Re: Job Recoverability

Rohini Palaniswamy Tue, 06 Aug 2013 15:24:45 -0700

Robert,
    You will not get a unknown hadoop job if JT has retry configured right?
What happens in that case? Especially what happens when Oozie retry happens
when JT comes up quickly?  Also do you know what is the behaviour with
Hadoop 2.x ?


Mayank,
  OOZIE-1231 already has the changes to show Mapreduce job id in the Child
job page to be consistent with other job types. The v1 API has the older
behaviour with map job url in externalId, while v2 API has it in
childjobids.  So there is a UI change but v1 REST API has not changed. But
OOZIE-1231 has not changed any code with respect to id swap.

Regards,
Rohini

On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter <[email protected]> wrote:

> Ya, I saw a precondition failed message.
>
> I just tried out what happens when the job is SUSPENDED, the action is
> START_MANUAL, and the JT recovers the hadoop job: It doesn't continue the
> workflow.  It fails the eagerVerifyPrecondition from
> CompletedActionXCommand because the action isn't RUNNING.  Perhaps we
> should make the CallbackService change the status in this situation?
>
> Just to clarify, the above only happens when the JT has been down long
> enough that the ActionCheckXCommand (every 10min by default) + the retries
> (3 x 1min) happen.  If it comes back sooner than that, everything works
> fine.
>
> thanks
> - Robert
>
>
>
>
>
>
> On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari <[email protected]> wrote:
>
> > Oh..okay. Seems like RecoveryService queues the StartX command but the
> > verifyPrecondition() fails as the wf job is
> > Suspended (Plz verify this from logs).
> >
> > In that case, if Oozie is not auto-retrying and resubmitting, then it
> > seems fair to have the JT recover the job.
> > But if JT recovers the job, can we make sure that the workflow job
> > transits to RUNNING from SUSPENDED and wf action from START_MANUAL to
> > RUNNING?
> > It should not happen that the user resumes the job which makes Oozie
> > submit a new hadoop job while the JT is also recovering the same job.
> > Also, I think the error can still be considered transient from Oozie
> > perspective as it is temporary depending on state of JT.
> >
> > Thanks,
> > Virag
> >
> >
> > On 8/6/13 1:12 PM, "Robert Kanter" <[email protected]> wrote:
> >
> > >Virag,
> > >I just tested out killing the JT and waiting for the Checker service to
> > >retry and give up: the action goes to START_MANUAL and the job gets
> > >SUSPENDED.  I waited around long enough, but the RecoveryService didn't
> do
> > >anything.  Does it kick in for you?  As a side note, looking at the
> code,
> > >the RecoveryService looks like it can handle START_MANUAL, END_MANUAL,
> and
> > >USER_RETRY, which all sound like things the user should be doing; is it
> > >correct that RecoveryService is handling these?
> > >The Unknown Hadoop Job error happens when the JT comes back in time
> > >because
> > >it won't know about the old ID if its not recovering jobs.  So, Oozie
> > >tries
> > >to ask it about a job that no longer exists.  I'm not sure that this
> > >should
> > >be a transient error because there's no way to determine if its because
> > >the
> > >JT restarted and Oozie should resubmit the job or if something else
> > >happened.
> > >
> > >Mayank,
> > >That is a good point.  We could either make a v3 API or add an
> oozie-site
> > >config to turn on/off the id swap behavior and keep the v2 API.
> > >
> > >thanks
> > >- Robert
> > >
> > >
> > >
> > >
> > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal <[email protected]>
> wrote:
> > >
> > >> Robert,
> > >>
> > >> Thats a break in backward compatibility. Till now user are used to
> > >>click on
> > >> to link to go to MR page.
> > >>
> > >> Is there a better way to handle this?
> > >>
> > >> Thanks,
> > >> Mayank
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <[email protected]>
> > >> wrote:
> > >>
> > >> > Mona,
> > >> > As far as I'm aware, the "retry" that Oozie is doing is just
> retrying
> > >>to
> > >> > connect to the JT (which is why when the JT comes back up, Oozie
> > >> > can continue monitoring the hadoop job if it still has the same ID);
> > >>it
> > >> > doesn't try to submit the job again as part of the "retry".
> > >> >
> > >> > Mayank,
> > >> > We can put the ID for the actual job in the Child IDs tab (like with
> > >> Pig).
> > >> >
> > >> >
> > >> > - Robert
> > >> >
> > >> >
> > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal <[email protected]>
> > >> wrote:
> > >> >
> > >> > > I agree , we should handle these two scenarios, I am ok with
> > >>changing
> > >> the
> > >> > > launcher behavior for MR however if we remove the id swap then how
> > >>we
> > >> > > nevigate to MR jobs from UI as we do right now?
> > >> > >
> > >> > > Thanks,
> > >> > > Mayank
> > >> > >
> > >> > >
> > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter
> > >><[email protected]>
> > >> > > wrote:
> > >> > >
> > >> > > > Suppose we leave the MR ID swap thing as is but set the launcher
> > >> > recover
> > >> > > to
> > >> > > > 0 and job to 1; then consider these two scenarios:
> > >> > > >
> > >> > > > 1. JT gets restarted during the launcher job but before the
> > >>launcher
> > >> > job
> > >> > > > actually launches the real job:
> > >> > > >      - The launcher job won't be recovered because we told it
> not
> > >>to
> > >> > > >      - The real job was never launched
> > >> > > >      ---> Action never completes and Oozie marks it as failed
> > >> > > >
> > >> > > > 2. Launcher job submits the real job, but JT gets restarted
> before
> > >> the
> > >> > > > Oozie server has a chance to swap IDs (its not an atomic
> > >>operation):
> > >> > > >      - The launcher job won't be recovered because we told it
> not
> > >>to
> > >> > > >      - The real job will be recovered and finish successfully
> > >> > > >      ---> Oozie marks the action as failed even though the
> actual
> > >>job
> > >> > > > succeeded because it didn't know about the ID swap
> > >> > > >
> > >> > > > It would only work for the case where the JT gets restarted
> after
> > >>the
> > >> > ID
> > >> > > > swap occurs.
> > >> > > >
> > >> > > >
> > >> > > > - Robert
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <
> [email protected]
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > Hi Robert,
> > >> > > > >
> > >> > > > > +1 for oozie to set launcher to 1 and 0 to jobs for recovery
> in
> > >>all
> > >> > the
> > >> > > > > cases except MR.
> > >> > > > >
> > >> > > > > As after Id swapped Oozie only know about MR job isn't it?
> then
> > >> there
> > >> > > > > should not be any problem.
> > >> > > > >
> > >> > > > > If we set MR launcher recover to 0 and job to 1 then job will
> be
> > >> > > succeded
> > >> > > > > in case of JT restart.
> > >> > > > >
> > >> > > > > AM I missing something?
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Mayank
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <
> > >> [email protected]>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > I think you usually just get the "Unknown Hadoop Job" error
> > >> message
> > >> > > > > because
> > >> > > > > > Oozie tries to look up the Hadoop Job ID it already has, but
> > >>the
> > >> JT
> > >> > > no
> > >> > > > > > longer has that ID because it was restarted.  With JT
> > >> > Recoverability
> > >> > > > > turned
> > >> > > > > > on, it will restart the job using the same ID, so Oozie
> > >>continues
> > >> > > just
> > >> > > > > > fine.
> > >> > > > > >
> > >> > > > > > - Robert
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
> > >> > > > > > <[email protected]>wrote:
> > >> > > > > >
> > >> > > > > > > Wouldn't oozie poll for the job status and decide that it
> > >>has
> > >> > > failed
> > >> > > > > and
> > >> > > > > > > when JT comes up launch another one if retry is
> configured?
> > >> > > > > > >
> > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <
> > >> > > [email protected]>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi,
> > >> > > > > > > >
> > >> > > > > > > > We looked into how to support Job Recoverability (i.e.
> > >>the JT
> > >> > is
> > >> > > > > > > restarted
> > >> > > > > > > > and it wants to restart the jobs that were running;
> > >>similarly
> > >> > for
> > >> > > > > YARN)
> > >> > > > > > > and
> > >> > > > > > > > have a pretty simple solution for all of the action
> types
> > >> > except
> > >> > > > for
> > >> > > > > > > > MapReduce.  If we set mapreduce.job.restart.recover=true
> > >>for
> > >> > the
> > >> > > > > > launcher
> > >> > > > > > > > job and mapreduce.job.restart.recover=false for the jobs
> > >> > launched
> > >> > > > by
> > >> > > > > > the
> > >> > > > > > > > launcher, then when the JT restarts, it will recover the
> > >> > launcher
> > >> > > > job
> > >> > > > > > but
> > >> > > > > > > > not the child jobs -- the launcher job will then take
> > >>care of
> > >> > > > > > relaunching
> > >> > > > > > > > the child jobs.
> > >> > > > > > > >
> > >> > > > > > > > For MapReduce, because of the optimization with the id
> > >>swap,
> > >> > this
> > >> > > > > won't
> > >> > > > > > > > work.  It would be very tricky, if it's even practical,
> > >>to do
> > >> > > > > something
> > >> > > > > > > > similar for the MR action.  Instead, we think it would
> be
> > >> best
> > >> > if
> > >> > > > we
> > >> > > > > > > simply
> > >> > > > > > > > remove the MR optimization and make it just like the
> other
> > >> > action
> > >> > > > > > types.
> > >> > > > > > >  I
> > >> > > > > > > > know we normally don't want to remove optimizations, but
> > >> there
> > >> > > are
> > >> > > > > many
> > >> > > > > > > > advantages in this case, and it's only saving a single
> Map
> > >> slot
> > >> > > for
> > >> > > > > MR
> > >> > > > > > > jobs
> > >> > > > > > > > only.
> > >> > > > > > > >
> > >> > > > > > > > I've created OOZIE-1483 <
> > >> > > > > > > https://issues.apache.org/jira/browse/OOZIE-1483>
> > >> > > > > > > > with
> > >> > > > > > > > more details and should have a patch soon.
> > >> > > > > > > >
> > >> > > > > > > > Thoughts?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > thanks
> > >> > > > > > > > - Robert
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >
>

Re: Job Recoverability

Reply via email to