Re: Too large class path for map reduce jobs

Henning Blohm Wed, 06 Oct 2010 04:56:56 -0700

Hi Alejandro,

   yes, it can of course be done right (sorry if my wording seemed to
imply otherwise). Just saying that I think that Hadoop M/R should not go
into that class loader / module separation business. It's one Job, one
VM, right? So the problem is to assign just the stuff needed to let the
Job do its business without becoming an obstacle.


  Must admit I didn't understand your proposal 2. How would that remove
(e.g.) jetty libs from the job's classpath?

Thanks,
  Henning

Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur:

> 1. Classloader business can be done right. Actually it could be done
> as spec-ed for servlet web-apps. 
> 
> 
> 
> 2. If the issue is strictly 'too large classpath', then a simpler
> solution would be to sof-link all JARs to the current directory and
> create the classpath with the JAR names only (no path). Note that the
> soft-linking business is already supported by the DistributedCache. So
> the changes would be mostly in the TT to create the JAR names only
> classpath before starting the child.
> 
> 
> Alejandro
> 
> 
> 
> On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm
> <henning.bl...@zfabrik.de> wrote:
> 
>         Hi Tom,
>         
>           that's exactly it. Thanks! I don't think that I can comment
>         on the issues in Jira so I will do it here.
>         
>           Tricking with class paths and deviating from the default
>         class loading delegation has never been anything but a short
>         term relieve. Fixing things by imposing a "better" order of
>         stuff on the class path will not work when people do actually
>         use child loaders (as the parent win) - like we do. Also it
>         may easily lead to very confusing situations because the
>         former part of the class path is not complete and gets other
>         stuff from a latter part etc. etc.... no good.
>         
>           Child loaders are good for module separation but should not
>         be used to "hide" type visibiliy from the parent. Almost
>         certainly leading to Class Loader Contraint Violation - once
>         you lose control (which is usually earlier than expected).
>         
>           The suggestion to reduce the Job class path to the required
>         minimum is the most practical approach. There is some gray
>         area there of course and it will not be feasible to reach the
>         absolute minimal set of types there - but something
>         reasonable, i.e. the hadoop core that suffices to run the job.
>         Certainly jetty & co are not required for job execution (btw.
>         I "hacked" 0.20.2 to remove anything in "server/" from the
>         classpath before setting the job class path).
>         
>           I would suggest to 
>         
>           a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is
>         the additional classpath, added to the "core" classpath (as
>         described above). If not set, for compatibility, preserve
>         today's behavior.
>           b) not getting into custom child loaders for jobs as part of
>         hadoop M/R. It's non-trivial to get it right and feels to be
>         beyond scope.
>         
>           I wouldn't mind helping btw.
>         
>         Thanks,
>           Henning
>         
>         
>         
>         
>         
>         On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote: 
>         
>         > Hi Henning,
>         > 
>         > I don't know if you've seen
>         > https://issues.apache.org/jira/browse/MAPREDUCE-1938 and
>         > https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have
>         > discussion about this issue.
>         > 
>         > Cheers
>         > Tom
>         > 
>         > On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm 
> <henning.bl...@zfabrik.de> wrote:
>         > > Short update on the issue:
>         > >
>         > > I tried to find a way to separate class path configurations by 
> modifying the
>         > > scripts in HADOOP_HOME/bin but found that TaskRunner actually 
> copies the
>         > > class path setting from the parent process when starting a local 
> task so
>         > > that I do not see a way of having less on a job's classpath 
> without
>         > > modifying Hadoop.
>         > >
>         > > As that will present a real issue when running our jobs on Hadoop 
> I would
>         > > like to propose to change TaskRunner so that it sets a class path
>         > > specifically for M/R tasks. That class path could be defined in 
> the scipts
>         > > (as for the other processes) using a particular environment 
> variable (e.g.
>         > > HADOOP_JOB_CLASSPATH). It could default to the current VM's class 
> path,
>         > > preserving today's behavior.
>         > >
>         > > Is it ok to enter this as an issue?
>         > >
>         > > Thanks,
>         > >   Henning
>         > >
>         > >
>         > > Am Freitag, den 17.09.2010, 16:01 +0000 schrieb Allen Wittenauer:
>         > >
>         > > On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
>         > >
>         > >> When running map reduce tasks in Hadoop I run into classpath 
> issues.
>         > >> Contrary to previous posts, my problem is not that I am missing 
> classes on
>         > >> the Task's class path (we have a perfect solution for that) but 
> rather find
>         > >> too many (e.g. ECJ classes or jetty).
>         > >
>         > > The fact that you mention:
>         > >
>         > >> The libs in HADOOP_HOME/lib seem to contain everything needed to 
> run
>         > >> anything in Hadoop which is, I assume, much more than is needed 
> to run a map
>         > >> reduce task.
>         > >
>         > > hints that your perfect solution is to throw all your custom 
> stuff in lib.
>         > > If so, that's a huge mistake.  Use distributed cache instead.
>         > >
>         
>         
>         
> 
> 
>

Re: Too large class path for map reduce jobs

Reply via email to