[jira] [Commented] (HADOOP-9639) truly shared cache for jars (jobjar/libjar)

Maysam Yabandeh (JIRA) Sun, 08 Sep 2013 03:44:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761248#comment-13761248
 ]


Maysam Yabandeh commented on HADOOP-9639:
-----------------------------------------

The design looks great. Just a couple of minor questions/comments:

1) I am wondering about the feasibility of using ZooKeeper's (ZK) ephemeral 
znodes for maintaining the .cleaner_locks. It should address the problem of 
dangling .cleaner_lock. Moreover, it shifts some of the read traffic from the 
NameNode to ZK. The volume of data that ZK needs to maintain is also not much, 
assuming that the cleaner is running a limited number of concurrent threads.

2) The latest design relies on an isAppActive query to the ResourceManager (RM) 
per existing read lock. If it turned out that this load is not negligible for 
some particular setting/workload, the cleaner can load the list of active apps 
in one query to the RM, and use the same list for the predefined period of 
STALENESS: i.e., a jar is subject to removal if (i) no app that is using it is 
in the list and (ii) the creation date of all the read locks are older than the 
STALENESS period.

3) When a client loses the uploading race and determines that the winner 
version is bad, there is a possibility (although very small) that a 
software/hardware bug led the loser to the wrong judgement about the 
correctness of the uploaded version. In this case, deleting the jar file can 
break the (correct) winner application. If we instead let the presumably 
incorrect version to stay there, it will cause no harm and will eventually be 
deleted by the cleaner. Admitted that in the rare case that uploaded jar is 
actually incorrect, the cache of the jar becomes essentially useless (until it 
is removed by the cleaner), but one might prefer that over mistakenly breaking 
the correct applications.
                
> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: HADOOP-9639
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9639
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: filecache
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-9639) truly shared cache for jars (jobjar/libjar)

Reply via email to