[ https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761248#comment-13761248 ]
Maysam Yabandeh commented on HADOOP-9639: ----------------------------------------- The design looks great. Just a couple of minor questions/comments: 1) I am wondering about the feasibility of using ZooKeeper's (ZK) ephemeral znodes for maintaining the .cleaner_locks. It should address the problem of dangling .cleaner_lock. Moreover, it shifts some of the read traffic from the NameNode to ZK. The volume of data that ZK needs to maintain is also not much, assuming that the cleaner is running a limited number of concurrent threads. 2) The latest design relies on an isAppActive query to the ResourceManager (RM) per existing read lock. If it turned out that this load is not negligible for some particular setting/workload, the cleaner can load the list of active apps in one query to the RM, and use the same list for the predefined period of STALENESS: i.e., a jar is subject to removal if (i) no app that is using it is in the list and (ii) the creation date of all the read locks are older than the STALENESS period. 3) When a client loses the uploading race and determines that the winner version is bad, there is a possibility (although very small) that a software/hardware bug led the loser to the wrong judgement about the correctness of the uploaded version. In this case, deleting the jar file can break the (correct) winner application. If we instead let the presumably incorrect version to stay there, it will cause no harm and will eventually be deleted by the cleaner. Admitted that in the rare case that uploaded jar is actually incorrect, the cache of the jar becomes essentially useless (until it is removed by the cleaner), but one might prefer that over mistakenly breaking the correct applications. > truly shared cache for jars (jobjar/libjar) > ------------------------------------------- > > Key: HADOOP-9639 > URL: https://issues.apache.org/jira/browse/HADOOP-9639 > Project: Hadoop Common > Issue Type: New Feature > Components: filecache > Affects Versions: 2.0.4-alpha > Reporter: Sangjin Lee > Assignee: Sangjin Lee > Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira