Re: hadoop jobs take long time to setup

Mikhail Bautin Sun, 28 Jun 2009 15:33:36 -0700

Marcus,

The code that needs to patched is in the tasktracker, because the
tasktracker is what starts the child JVM that runs user code.


Thanks,
Mikhail

On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou <marcus.he...@tailsweep.com>wrote:

> Hi.
>
> Just to be clear. It is the jobtracker that needs the patched code right ?
> Or is it the tasktrackers ?
>
> Kindly
>
> //Marcus
>
> On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin <mbau...@gmail.com>
> wrote:
>
> > Marcus,
> >
> > We currently use 0.20.0 but this patch just inserts 8 lines of code into
> > TaskRunner.java, which could certainly be done with 0.18.3.
> >
> > Yes, this patch just appends additional jars to the child JVM classpath.
> >
> > I've never really used tmpjars myself, but if it involves uploading
> > multiple
> > jar files into HDFS every time a job is started, I see how it can be
> really
> > slow. On our ~80-job workflow this would have really slowed things down.
> >
> > Thanks,
> > Mikhail
> >
> > On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <
> marcus.he...@tailsweep.com
> > >wrote:
> >
> > > Makes sense... I will try both rsync and NFS but I think rsync will
> beat
> > > NFS
> > > since NFS can be slow as hell sometimes but what the heck we already
> have
> > > our maven2 repo on NFS so why not :)
> > >
> > > Are you saying that this patch make the client able to configure which
> > > "extra" local jar files to add as classpath when firing up the
> > > TaskTrackerChild ?
> > >
> > > To be explicit: Do you confirm that using tmpjars like I do is a
> costful
> > > slow operation ?
> > >
> > > To what branch to you apply the patch (we use 0.18.3) ?
> > >
> > > Cheers
> > >
> > > //Marcus
> > >
> > >
> > > On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbau...@gmail.com>
> > > wrote:
> > >
> > > > This is the way we deal with this problem, too. We put our jar files
> on
> > > > NFS, and the attached patch makes possible to add those jar files to
> > the
> > > > tasktracker classpath through a configuration property.
> > > >
> > > > Thanks,
> > > > Mikhail
> > > >
> > > > On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <
> stuart.whi...@gmail.com
> > > >wrote:
> > > >
> > > >> Although I've never done it, I believe you could manually copy your
> > jar
> > > >> files out to your cluster somewhere in hadoop's classpath, and that
> > > would
> > > >> remove the need for you to copy them to your cluster at the start of
> > > each
> > > >> job.
> > > >>
> > > >> On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
> > > marcus.he...@tailsweep.com
> > > >> >wrote:
> > > >>
> > > >> > Hi.
> > > >> >
> > > >> > Running without a jobtracker makes the job start almost instantly.
> > > >> > I think it is due to something with the classloader. I use a huge
> > > amount
> > > >> of
> > > >> > jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need
> > to
> > > be
> > > >> > loaded every time I guess.
> > > >> >
> > > >> > By issuing conf.setNumTasksToExecutePerJvm(-1); will the
> TaskTracker
> > > >> child
> > > >> > live forever then ?
> > > >> >
> > > >> > Cheers
> > > >> >
> > > >> > //Marcus
> > > >> >
> > > >> > On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
> > > >> timrobertson...@gmail.com
> > > >> > >wrote:
> > > >> >
> > > >> > > How long does it take to start the code locally in a single
> > thread?
> > > >> > >
> > > >> > > Can you reuse the JVM so it only starts once per node per job?
> > > >> > > conf.setNumTasksToExecutePerJvm(-1)
> > > >> > >
> > > >> > > Cheers,
> > > >> > > Tim
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
> > > >> marcus.he...@tailsweep.com
> > > >> > >
> > > >> > > wrote:
> > > >> > > > Hi.
> > > >> > > >
> > > >> > > > Wonder how one should improve the startup times of a hadoop
> job.
> > > >> Some
> > > >> > of
> > > >> > > my
> > > >> > > > jobs which have a lot of dependencies in terms of many jar
> files
> > > >> take a
> > > >> > > long
> > > >> > > > time to start in hadoop up to 2 minutes some times.
> > > >> > > > The data input amounts in these cases are neglible so it seems
> > > that
> > > >> > > Hadoop
> > > >> > > > have a really high setup cost, which I can live with but this
> > > seems
> > > >> to
> > > >> > > much.
> > > >> > > >
> > > >> > > > Let's say a job takes 10 minutes to complete then it is bad if
> > it
> > > >> takes
> > > >> > 2
> > > >> > > > mins to set it up... 20-30 sec max would be a lot more
> > reasonable.
> > > >> > > >
> > > >> > > > Hints ?
> > > >> > > >
> > > >> > > > //Marcus
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > >> > > > +46702561312
> > > >> > > > marcus.he...@tailsweep.com
> > > >> > > > http://www.tailsweep.com/
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Marcus Herou CTO and co-founder Tailsweep AB
> > > >> > +46702561312
> > > >> > marcus.he...@tailsweep.com
> > > >> > http://www.tailsweep.com/
> > > >> >
> > > >>
> > > >
> > > >
> > >
> > >
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
>

Re: hadoop jobs take long time to setup

Reply via email to