[jira] [Commented] (HADOOP-9639) truly shared cache for jars (jobjar/libjar)

Jason Lowe (JIRA) Wed, 12 Jun 2013 17:17:55 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681790#comment-13681790
 ]


Jason Lowe commented on HADOOP-9639:
------------------------------------

bq. The way MR currently uses it is that it sets up these files under each 
job's staging directory, so it's not ready for sharing.

job jars are not required to be uploaded to the staging directory, see 
JobSubmitter.copyAndConfigureFiles.  If the job jar being submitted looks like 
it's already on the same filesystem as the staging directory then it will not 
upload it to staging since it's already accessible by the nodes for 
localization.  Yes, many jobs simply call setJarByClass which will always find 
a local filesystem jar and therefore always gets uploaded.  However a client 
can configure the job jar to point to a jar already in HDFS (or even specify no 
jar, implying the code is already sitting in the classpath on all the nodes).

bq. However, the reality is that our MR jobs are a combination of all types of 
different MR apps (pig, scalding, handwritten MR apps, ...). Every framework 
would need to be retrofitted to do this for it to be useful. While it is not 
impossible, it may not be the most effective way.

Yes, it's not a turnkey solution given most jobs want to submit local files by 
default. However I think there will be a number of issues trying to make this a 
generalized solution via a couple of config options for a job.  As you mention, 
it will be easy to trash public files with the wrong versions (think two or 
more versions of pig running on the cluster, for example).  Seems a lot simpler 
to target some main offenders (i.e.: frameworks like pig, scalding, etc. and 
large custom jobs with big dependencies shipped with the job).  The changes 
there are probably going to be very straightforward -- e.g.: add a pig-specific 
config to tell pig where it can pick up the pig jar from HDFS rather than 
always shipping it with each job, etc.  Then we can place the pig jar in HDFS 
and set that config to avoid shipping the same pig jar for every job.  Another 
pig version?  Point that config to a different pig jar in HDFS.

There might be a way to solve the problems you mentioned above in some general 
way, but I think it will complicated.  Discerning the proper public/private 
intent from local files submitted to the dist cache is going to be tricky to do 
automatically if it can be done at all, combined with the whole management 
issues of when to clobber/purge the retention area in HDFS.

So while we're hashing out if and how this could be done at the MR framework 
level, your custom jobs with large jars should be able to reap the desired 
performance benefits with relatively small changes to their job setup code.
                
> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: HADOOP-9639
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9639
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: filecache
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-9639) truly shared cache for jars (jobjar/libjar)

Reply via email to