Re: Running multiple MR Job's in sequence

John Conwell Thu, 29 Sep 2011 11:02:38 -0700

If you are running on EC2, you can use elastic map reduce.  It has a startup
option where you specify the driver class in your jar, and it will run the
driver, I believe, on the namenode, which wont really add any overhead
because when the namenode is under stress, the driver will be sitting
quietly waiting for the job to complete...and vice versa.


Just an option.

On Thu, Sep 29, 2011 at 10:53 AM, Aaron Baff <aaron.b...@telescope.tv>wrote:

> Yea, we don't want it to sit there waiting for the Job to complete, even if
> it's just a few minutes.
>
> --Aaron
> -----Original Message-----
> From: turboc...@gmail.com [mailto:turboc...@gmail.com] On Behalf Of John
> Conwell
> Sent: Thursday, September 29, 2011 10:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
>
> After you kick off a job, say JobA, your client doesn't need to sit and
> ping
> Hadoop to see if it finished before it starts JobB.  You can have the
> client
> block until the job is complete with "Job.waitForCompletion(boolean
> verbose)".  Using this you can create a "job driver" that chains jobs
> together easily.
>
> Now, if your job takes 2 weeks to run, you cant kill your driver process.
>  If you do, JobA will finish running, but JobB will never start
>
> JohnC
>
> On Thu, Sep 29, 2011 at 9:51 AM, Aaron Baff <aaron.b...@telescope.tv>
> wrote:
>
> > I saw this, but wasn't sure if it was something that ran on the client
> and
> > just submitted the Job's in sequence, or if that gave it all to the
> > JobTracker, and the JobTracker took care of submitting the Jobs in
> sequence
> > appropriately.
> >
> > Basically, I'm looking for a completely stateless client, that doesn't
> need
> > to ping the JobTracker every now and then to see if a Job has completed,
> and
> > then submit the next one. The ideal flow would be the client gets in a
> > request to run the series of Jobs, it preps them all, gets them all
> > configured, and then passes them off to the JobTracker which runs them
> all
> > in order without the client application needing to do anthing further.
> >
> > Sounds like that doesn't really exist as part of Hadoop framework, and
> > needs something like Oozie (or a home-built system) to do this.
> >
> > --Aaron
> > -----Original Message-----
> > From: Harsh J [mailto:ha...@cloudera.com]
> > Sent: Wednesday, September 28, 2011 9:37 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Running multiple MR Job's in sequence
> >
> > Within the Hadoop core project, there is JobControl you can utilize
> > for this. You can view its API at
> >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
> > and it is fairly simple to use (Create jobs in regular java API, build
> > a dependency flow using JobControl atop these jobconf objects).
> >
> > Apache Oozie and other such tools offer higher abstractions on
> > controlling a workflow, and can be considered when your needs can get
> > a bit complex than just a series (easy to handle failure scenarios
> > between dependent jobs, perform minor fs operations in pre/post
> > processing, etc.).
> >
> > On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <aaron.b...@telescope.tv>
> > wrote:
> > > Is it possible to submit a series of MR Jobs to the JobTracker to run
> in
> > sequence (one finishes, take the output of that if successful and feed it
> > into the next, etc), or does it need to run client side by using the
> > JobControl or something like Oozie, or rolling our own? What I'm looking
> for
> > is a fire & forget, and occasionally check back to see if it's done. So
> > client-side doesn't need to really know anything or keep track of
> anything.
> > Does something like that exist within the Hadoop framework?
> > >
> > > --Aaron
> > >
> >
> >
> >
> > --
> > Harsh J
> >
>
>
>
> --
>
> Thanks,
> John C
>



-- 

Thanks,
John C

Re: Running multiple MR Job's in sequence

Reply via email to