Re: Job Recoverability

Mayank Bansal Tue, 06 Aug 2013 10:49:07 -0700

Robert,

Thats a break in backward compatibility. Till now user are used to click on
to link to go to MR page.


Is there a better way to handle this?

Thanks,
Mayank




On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <[email protected]> wrote:

> Mona,
> As far as I'm aware, the "retry" that Oozie is doing is just retrying to
> connect to the JT (which is why when the JT comes back up, Oozie
> can continue monitoring the hadoop job if it still has the same ID); it
> doesn't try to submit the job again as part of the "retry".
>
> Mayank,
> We can put the ID for the actual job in the Child IDs tab (like with Pig).
>
>
> - Robert
>
>
> On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal <[email protected]> wrote:
>
> > I agree , we should handle these two scenarios, I am ok with changing the
> > launcher behavior for MR however if we remove the id swap then how we
> > nevigate to MR jobs from UI as we do right now?
> >
> > Thanks,
> > Mayank
> >
> >
> > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter <[email protected]>
> > wrote:
> >
> > > Suppose we leave the MR ID swap thing as is but set the launcher
> recover
> > to
> > > 0 and job to 1; then consider these two scenarios:
> > >
> > > 1. JT gets restarted during the launcher job but before the launcher
> job
> > > actually launches the real job:
> > >      - The launcher job won't be recovered because we told it not to
> > >      - The real job was never launched
> > >      ---> Action never completes and Oozie marks it as failed
> > >
> > > 2. Launcher job submits the real job, but JT gets restarted before the
> > > Oozie server has a chance to swap IDs (its not an atomic operation):
> > >      - The launcher job won't be recovered because we told it not to
> > >      - The real job will be recovered and finish successfully
> > >      ---> Oozie marks the action as failed even though the actual job
> > > succeeded because it didn't know about the ID swap
> > >
> > > It would only work for the case where the JT gets restarted after the
> ID
> > > swap occurs.
> > >
> > >
> > > - Robert
> > >
> > >
> > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <[email protected]>
> > wrote:
> > >
> > > > Hi Robert,
> > > >
> > > > +1 for oozie to set launcher to 1 and 0 to jobs for recovery in all
> the
> > > > cases except MR.
> > > >
> > > > As after Id swapped Oozie only know about MR job isn't it? then there
> > > > should not be any problem.
> > > >
> > > > If we set MR launcher recover to 0 and job to 1 then job will be
> > succeded
> > > > in case of JT restart.
> > > >
> > > > AM I missing something?
> > > >
> > > > Thanks,
> > > > Mayank
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <[email protected]>
> > > > wrote:
> > > >
> > > > > I think you usually just get the "Unknown Hadoop Job" error message
> > > > because
> > > > > Oozie tries to look up the Hadoop Job ID it already has, but the JT
> > no
> > > > > longer has that ID because it was restarted.  With JT
> Recoverability
> > > > turned
> > > > > on, it will restart the job using the same ID, so Oozie continues
> > just
> > > > > fine.
> > > > >
> > > > > - Robert
> > > > >
> > > > >
> > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > Wouldn't oozie poll for the job status and decide that it has
> > failed
> > > > and
> > > > > > when JT comes up launch another one if retry is configured?
> > > > > >
> > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We looked into how to support Job Recoverability (i.e. the JT
> is
> > > > > > restarted
> > > > > > > and it wants to restart the jobs that were running; similarly
> for
> > > > YARN)
> > > > > > and
> > > > > > > have a pretty simple solution for all of the action types
> except
> > > for
> > > > > > > MapReduce.  If we set mapreduce.job.restart.recover=true for
> the
> > > > > launcher
> > > > > > > job and mapreduce.job.restart.recover=false for the jobs
> launched
> > > by
> > > > > the
> > > > > > > launcher, then when the JT restarts, it will recover the
> launcher
> > > job
> > > > > but
> > > > > > > not the child jobs -- the launcher job will then take care of
> > > > > relaunching
> > > > > > > the child jobs.
> > > > > > >
> > > > > > > For MapReduce, because of the optimization with the id swap,
> this
> > > > won't
> > > > > > > work.  It would be very tricky, if it's even practical, to do
> > > > something
> > > > > > > similar for the MR action.  Instead, we think it would be best
> if
> > > we
> > > > > > simply
> > > > > > > remove the MR optimization and make it just like the other
> action
> > > > > types.
> > > > > >  I
> > > > > > > know we normally don't want to remove optimizations, but there
> > are
> > > > many
> > > > > > > advantages in this case, and it's only saving a single Map slot
> > for
> > > > MR
> > > > > > jobs
> > > > > > > only.
> > > > > > >
> > > > > > > I've created OOZIE-1483 <
> > > > > > https://issues.apache.org/jira/browse/OOZIE-1483>
> > > > > > > with
> > > > > > > more details and should have a patch soon.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > >
> > > > > > > thanks
> > > > > > > - Robert
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Job Recoverability

Reply via email to