Re: Job Recoverability

Robert Kanter Tue, 06 Aug 2013 10:43:59 -0700

Mona,
As far as I'm aware, the "retry" that Oozie is doing is just retrying to
connect to the JT (which is why when the JT comes back up, Oozie
can continue monitoring the hadoop job if it still has the same ID); it
doesn't try to submit the job again as part of the "retry".


Mayank,
We can put the ID for the actual job in the Child IDs tab (like with Pig).


- Robert


On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal <[email protected]> wrote:

> I agree , we should handle these two scenarios, I am ok with changing the
> launcher behavior for MR however if we remove the id swap then how we
> nevigate to MR jobs from UI as we do right now?
>
> Thanks,
> Mayank
>
>
> On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter <[email protected]>
> wrote:
>
> > Suppose we leave the MR ID swap thing as is but set the launcher recover
> to
> > 0 and job to 1; then consider these two scenarios:
> >
> > 1. JT gets restarted during the launcher job but before the launcher job
> > actually launches the real job:
> >      - The launcher job won't be recovered because we told it not to
> >      - The real job was never launched
> >      ---> Action never completes and Oozie marks it as failed
> >
> > 2. Launcher job submits the real job, but JT gets restarted before the
> > Oozie server has a chance to swap IDs (its not an atomic operation):
> >      - The launcher job won't be recovered because we told it not to
> >      - The real job will be recovered and finish successfully
> >      ---> Oozie marks the action as failed even though the actual job
> > succeeded because it didn't know about the ID swap
> >
> > It would only work for the case where the JT gets restarted after the ID
> > swap occurs.
> >
> >
> > - Robert
> >
> >
> > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <[email protected]>
> wrote:
> >
> > > Hi Robert,
> > >
> > > +1 for oozie to set launcher to 1 and 0 to jobs for recovery in all the
> > > cases except MR.
> > >
> > > As after Id swapped Oozie only know about MR job isn't it? then there
> > > should not be any problem.
> > >
> > > If we set MR launcher recover to 0 and job to 1 then job will be
> succeded
> > > in case of JT restart.
> > >
> > > AM I missing something?
> > >
> > > Thanks,
> > > Mayank
> > >
> > >
> > >
> > >
> > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <[email protected]>
> > > wrote:
> > >
> > > > I think you usually just get the "Unknown Hadoop Job" error message
> > > because
> > > > Oozie tries to look up the Hadoop Job ID it already has, but the JT
> no
> > > > longer has that ID because it was restarted.  With JT Recoverability
> > > turned
> > > > on, it will restart the job using the same ID, so Oozie continues
> just
> > > > fine.
> > > >
> > > > - Robert
> > > >
> > > >
> > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
> > > > <[email protected]>wrote:
> > > >
> > > > > Wouldn't oozie poll for the job status and decide that it has
> failed
> > > and
> > > > > when JT comes up launch another one if retry is configured?
> > > > >
> > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We looked into how to support Job Recoverability (i.e. the JT is
> > > > > restarted
> > > > > > and it wants to restart the jobs that were running; similarly for
> > > YARN)
> > > > > and
> > > > > > have a pretty simple solution for all of the action types except
> > for
> > > > > > MapReduce.  If we set mapreduce.job.restart.recover=true for the
> > > > launcher
> > > > > > job and mapreduce.job.restart.recover=false for the jobs launched
> > by
> > > > the
> > > > > > launcher, then when the JT restarts, it will recover the launcher
> > job
> > > > but
> > > > > > not the child jobs -- the launcher job will then take care of
> > > > relaunching
> > > > > > the child jobs.
> > > > > >
> > > > > > For MapReduce, because of the optimization with the id swap, this
> > > won't
> > > > > > work.  It would be very tricky, if it's even practical, to do
> > > something
> > > > > > similar for the MR action.  Instead, we think it would be best if
> > we
> > > > > simply
> > > > > > remove the MR optimization and make it just like the other action
> > > > types.
> > > > >  I
> > > > > > know we normally don't want to remove optimizations, but there
> are
> > > many
> > > > > > advantages in this case, and it's only saving a single Map slot
> for
> > > MR
> > > > > jobs
> > > > > > only.
> > > > > >
> > > > > > I've created OOZIE-1483 <
> > > > > https://issues.apache.org/jira/browse/OOZIE-1483>
> > > > > > with
> > > > > > more details and should have a patch soon.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > >
> > > > > > thanks
> > > > > > - Robert
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Job Recoverability

Reply via email to