Re: Job Recoverability

Robert Kanter Tue, 06 Aug 2013 13:13:29 -0700

Virag,
I just tested out killing the JT and waiting for the Checker service to
retry and give up: the action goes to START_MANUAL and the job gets
SUSPENDED.  I waited around long enough, but the RecoveryService didn't do
anything.  Does it kick in for you?  As a side note, looking at the code,
the RecoveryService looks like it can handle START_MANUAL, END_MANUAL, and
USER_RETRY, which all sound like things the user should be doing; is it
correct that RecoveryService is handling these?
The Unknown Hadoop Job error happens when the JT comes back in time because
it won't know about the old ID if its not recovering jobs.  So, Oozie tries
to ask it about a job that no longer exists.  I'm not sure that this should
be a transient error because there's no way to determine if its because the
JT restarted and Oozie should resubmit the job or if something else
happened.


Mayank,
That is a good point.  We could either make a v3 API or add an oozie-site
config to turn on/off the id swap behavior and keep the v2 API.

thanks
- Robert




On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal <[email protected]> wrote:

> Robert,
>
> Thats a break in backward compatibility. Till now user are used to click on
> to link to go to MR page.
>
> Is there a better way to handle this?
>
> Thanks,
> Mayank
>
>
>
>
> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <[email protected]>
> wrote:
>
> > Mona,
> > As far as I'm aware, the "retry" that Oozie is doing is just retrying to
> > connect to the JT (which is why when the JT comes back up, Oozie
> > can continue monitoring the hadoop job if it still has the same ID); it
> > doesn't try to submit the job again as part of the "retry".
> >
> > Mayank,
> > We can put the ID for the actual job in the Child IDs tab (like with
> Pig).
> >
> >
> > - Robert
> >
> >
> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal <[email protected]>
> wrote:
> >
> > > I agree , we should handle these two scenarios, I am ok with changing
> the
> > > launcher behavior for MR however if we remove the id swap then how we
> > > nevigate to MR jobs from UI as we do right now?
> > >
> > > Thanks,
> > > Mayank
> > >
> > >
> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter <[email protected]>
> > > wrote:
> > >
> > > > Suppose we leave the MR ID swap thing as is but set the launcher
> > recover
> > > to
> > > > 0 and job to 1; then consider these two scenarios:
> > > >
> > > > 1. JT gets restarted during the launcher job but before the launcher
> > job
> > > > actually launches the real job:
> > > >      - The launcher job won't be recovered because we told it not to
> > > >      - The real job was never launched
> > > >      ---> Action never completes and Oozie marks it as failed
> > > >
> > > > 2. Launcher job submits the real job, but JT gets restarted before
> the
> > > > Oozie server has a chance to swap IDs (its not an atomic operation):
> > > >      - The launcher job won't be recovered because we told it not to
> > > >      - The real job will be recovered and finish successfully
> > > >      ---> Oozie marks the action as failed even though the actual job
> > > > succeeded because it didn't know about the ID swap
> > > >
> > > > It would only work for the case where the JT gets restarted after the
> > ID
> > > > swap occurs.
> > > >
> > > >
> > > > - Robert
> > > >
> > > >
> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Robert,
> > > > >
> > > > > +1 for oozie to set launcher to 1 and 0 to jobs for recovery in all
> > the
> > > > > cases except MR.
> > > > >
> > > > > As after Id swapped Oozie only know about MR job isn't it? then
> there
> > > > > should not be any problem.
> > > > >
> > > > > If we set MR launcher recover to 0 and job to 1 then job will be
> > > succeded
> > > > > in case of JT restart.
> > > > >
> > > > > AM I missing something?
> > > > >
> > > > > Thanks,
> > > > > Mayank
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > I think you usually just get the "Unknown Hadoop Job" error
> message
> > > > > because
> > > > > > Oozie tries to look up the Hadoop Job ID it already has, but the
> JT
> > > no
> > > > > > longer has that ID because it was restarted.  With JT
> > Recoverability
> > > > > turned
> > > > > > on, it will restart the job using the same ID, so Oozie continues
> > > just
> > > > > > fine.
> > > > > >
> > > > > > - Robert
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
> > > > > > <[email protected]>wrote:
> > > > > >
> > > > > > > Wouldn't oozie poll for the job status and decide that it has
> > > failed
> > > > > and
> > > > > > > when JT comes up launch another one if retry is configured?
> > > > > > >
> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We looked into how to support Job Recoverability (i.e. the JT
> > is
> > > > > > > restarted
> > > > > > > > and it wants to restart the jobs that were running; similarly
> > for
> > > > > YARN)
> > > > > > > and
> > > > > > > > have a pretty simple solution for all of the action types
> > except
> > > > for
> > > > > > > > MapReduce.  If we set mapreduce.job.restart.recover=true for
> > the
> > > > > > launcher
> > > > > > > > job and mapreduce.job.restart.recover=false for the jobs
> > launched
> > > > by
> > > > > > the
> > > > > > > > launcher, then when the JT restarts, it will recover the
> > launcher
> > > > job
> > > > > > but
> > > > > > > > not the child jobs -- the launcher job will then take care of
> > > > > > relaunching
> > > > > > > > the child jobs.
> > > > > > > >
> > > > > > > > For MapReduce, because of the optimization with the id swap,
> > this
> > > > > won't
> > > > > > > > work.  It would be very tricky, if it's even practical, to do
> > > > > something
> > > > > > > > similar for the MR action.  Instead, we think it would be
> best
> > if
> > > > we
> > > > > > > simply
> > > > > > > > remove the MR optimization and make it just like the other
> > action
> > > > > > types.
> > > > > > >  I
> > > > > > > > know we normally don't want to remove optimizations, but
> there
> > > are
> > > > > many
> > > > > > > > advantages in this case, and it's only saving a single Map
> slot
> > > for
> > > > > MR
> > > > > > > jobs
> > > > > > > > only.
> > > > > > > >
> > > > > > > > I've created OOZIE-1483 <
> > > > > > > https://issues.apache.org/jira/browse/OOZIE-1483>
> > > > > > > > with
> > > > > > > > more details and should have a patch soon.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > >
> > > > > > > > thanks
> > > > > > > > - Robert
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Job Recoverability

Reply via email to