Re: Job Recoverability

Alejandro Abdelnur Wed, 07 Aug 2013 14:50:02 -0700

[joining the party a bit late]

I just add an offline call with RobertK who brought me up to speed.


By design, Oozie will retry starting a workflow action ONLY if it couldn't
start the WF action before. If Oozie started the WF action successfully,
the WF action state goes into RUNNING, and from then on it is the
responsibility of the external system running the action to recover it.
Oozie will not attempt any recovery after that point.

This means that with  Hadoop (JT or YARN) job recovery, the launcher job
will be recovered by Hadoop without any intervention from Oozie.

It is clear that to have recovery for  MR  action we need to get rid of the
swap and just hold onto the MR launcher job as we do for the other actions.

Now, on the whole discussion on the ActionCheckXCommand retries. We have a
bug in the ActionCheckXCommand, on handleNonTransient() we should not
change the status of the WF action to START_MANUAL, we should leave it in
RUNNING. hadnleNonTransient() will suspend the WF job thus switching off
action checks. On WF job resume, the action checks will start working
again, and if Hadoop has job recovery, things will work fine. Else the WF
action will fail because the launcher job is not known (the external system
does not know how to recover jobs). Because we are reseting the status to
START_MANUAL we are dialing back on the lifecycle of the action, that is
incorrect and that creates the race condition that introduces 2 jobs.

So again, Oozie is not responsible for recovering actions. With that
assumption, fixing the handleNonTransient() to leave the status in RUNNING
and getting rid of the RM swap logic we should be good.

Thoughts?




On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari <[email protected]> wrote:

> Robert,
>
> I have been thinking on this for a while and have few more concerns if the
> job retries are not streamlined through Oozie.
>
> 1) Till the JT finishes recovering the job, the wf job/wf action status
> will be SUSPENDED/START_MANUAL.
> Isn't it misleading as the hadoop job is RUNNING while oozie incorrectly
> shows as SUSPENDED? Even if allow this, after the job completes, what if
> the callback is lost or oozie is down?
> To prevent the job being in SUSPENDED forever, we need to hack our
> services to pull SUSPENDED/START_MANUAL jobs from db and update their
> status.
>
> 2) Should we allow failing of the user RESUME command if the action is in
> START_MANUAL to prevent the race condition we were discussing?
> This would mean changing the semantics of the states.
>
> 3) Confused on mapred.job.restart.recover. Reading
> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html, it says
> that the default value of this is true. So,
> if mapred.jobtracker.restart.recover (system config) is already enabled,
> is job recovery on by default? Also, does recover mean the job will start
> where it left from or is it just plain restart?
>
> In summary, IMO allowing hadoop to recover jobs independently bypassing
> Oozie ins't trivial. It would have helped if the JT produced notification
> when it comes online, so Oozie could retry after consuming those. But
> currently, notification only happens when task completes.
>
> An alternate approach is to modify the semantics of START_MANUAL.
> Currently Oozie puts the action/job in START_MANUAL/SUSPENDED and expects
> the user to resume it. We can change this and make Oozie retry the
> START_MANUAL actions at configurable interval (~30 mins or some scheme
> like exp back off) . Of course, this is is bad as oozie will keep polling
> hadoop at some interval but manual resume of jobs who have faced transient
> errors will no longer be mandatory.
>
> --Virag
>
>
> On 8/6/13 4:38 PM, "Robert Kanter" <[email protected]> wrote:
>
> >If ActionCheckX is trying to retry, and the JT recovers the job, that
> >should be fine.  The "retry" is to simply try connecting to the JT to get
> >the status for the job.  If the user issues a "RESUME" for a START_MANUAL
> >job, then yes, Oozie will try to resubmit a new job for that action and
> >we'd have two of them if the JT also recovers it.
> >
> >What if we modified the ActionStartXCommand/ResumeActionXCommand
> >precondition to check if the action already has a Job ID that is valid
> >(i.e. not unknown to the JT), then it fails the precondition check or
> >something similar?
> >
> >- Robert
> >
> >
> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <[email protected]>
> wrote:
> >
> >> ActionCheckx first retries for a configurable amount of time and then
> >> makes the status as START_MANUAL.
> >> So, the problem might happen when JT recovers the job during the same
> >>time
> >> when 1) ActionCheckX is trying to retry or the 2) user issues a "RESUME"
> >> for a start_manual job.
> >> We have to ensure that this doesn't happen otherwise we will have two
> >> hadoop jobs for the same action.
> >> The callback happens only when the task is completed which might be too
> >> late. During that time, Oozie might have already submitted a new hadoop
> >> job for that wf action.
> >> So it doesn't seem straightforward to prevent Oozie to submit a new job
> >>if
> >> the JT is already recovering the older one.
> >>
> >>
> >>
> >> On 8/6/13 4:01 PM, "Robert Kanter" <[email protected]> wrote:
> >>
> >> >Yes, if JT recovers the job, it uses the same ID.  If the JT comes up
> >> >quickly and recovers the job, Oozie continues working just fine
> >>(without
> >> >the ID swap issues discussed earlier).  When the JT takes longer than
> >>the
> >> >10min ActionCheck interval, and the action is START_MANUAL, that still
> >> >needs to be figured out.
> >> >
> >> >I haven't tested on Hadoop 2.x yet, but I've been told that it should
> >>have
> >> >the same behavior.  The only differences are that the name of the
> >>property
> >> >to enable recoverability on the server (not the job-level one) is
> >> >different
> >> >obviously because it doesn't have "jobtracker" in it and it can also
> >> >recover the completed tasks, which shouldn't be a problem because the
> >> >launcher jar has the one task.  I'll of course double check this
> >>though.
> >> >
> >> >
> >> >- Robert
> >> >
> >> >
> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy
> >> ><[email protected]>wrote:
> >> >
> >> >> Robert,
> >> >>     You will not get a unknown hadoop job if JT has retry configured
> >> >>right?
> >> >> What happens in that case? Especially what happens when Oozie retry
> >> >>happens
> >> >> when JT comes up quickly?  Also do you know what is the behaviour
> >>with
> >> >> Hadoop 2.x ?
> >> >>
> >> >> Mayank,
> >> >>   OOZIE-1231 already has the changes to show Mapreduce job id in the
> >> >>Child
> >> >> job page to be consistent with other job types. The v1 API has the
> >>older
> >> >> behaviour with map job url in externalId, while v2 API has it in
> >> >> childjobids.  So there is a UI change but v1 REST API has not
> >>changed.
> >> >>But
> >> >> OOZIE-1231 has not changed any code with respect to id swap.
> >> >>
> >> >> Regards,
> >> >> Rohini
> >> >>
> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter <[email protected]>
> >> >> wrote:
> >> >>
> >> >> > Ya, I saw a precondition failed message.
> >> >> >
> >> >> > I just tried out what happens when the job is SUSPENDED, the
> >>action is
> >> >> > START_MANUAL, and the JT recovers the hadoop job: It doesn't
> >>continue
> >> >>the
> >> >> > workflow.  It fails the eagerVerifyPrecondition from
> >> >> > CompletedActionXCommand because the action isn't RUNNING.  Perhaps
> >>we
> >> >> > should make the CallbackService change the status in this
> >>situation?
> >> >> >
> >> >> > Just to clarify, the above only happens when the JT has been down
> >>long
> >> >> > enough that the ActionCheckXCommand (every 10min by default) + the
> >> >> retries
> >> >> > (3 x 1min) happen.  If it comes back sooner than that, everything
> >> >>works
> >> >> > fine.
> >> >> >
> >> >> > thanks
> >> >> > - Robert
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari <[email protected]
> >
> >> >> wrote:
> >> >> >
> >> >> > > Oh..okay. Seems like RecoveryService queues the StartX command
> >>but
> >> >>the
> >> >> > > verifyPrecondition() fails as the wf job is
> >> >> > > Suspended (Plz verify this from logs).
> >> >> > >
> >> >> > > In that case, if Oozie is not auto-retrying and resubmitting,
> >>then
> >> >>it
> >> >> > > seems fair to have the JT recover the job.
> >> >> > > But if JT recovers the job, can we make sure that the workflow
> >>job
> >> >> > > transits to RUNNING from SUSPENDED and wf action from
> >>START_MANUAL
> >> >>to
> >> >> > > RUNNING?
> >> >> > > It should not happen that the user resumes the job which makes
> >>Oozie
> >> >> > > submit a new hadoop job while the JT is also recovering the same
> >> >>job.
> >> >> > > Also, I think the error can still be considered transient from
> >>Oozie
> >> >> > > perspective as it is temporary depending on state of JT.
> >> >> > >
> >> >> > > Thanks,
> >> >> > > Virag
> >> >> > >
> >> >> > >
> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <[email protected]> wrote:
> >> >> > >
> >> >> > > >Virag,
> >> >> > > >I just tested out killing the JT and waiting for the Checker
> >> >>service
> >> >> to
> >> >> > > >retry and give up: the action goes to START_MANUAL and the job
> >>gets
> >> >> > > >SUSPENDED.  I waited around long enough, but the RecoveryService
> >> >> didn't
> >> >> > do
> >> >> > > >anything.  Does it kick in for you?  As a side note, looking at
> >>the
> >> >> > code,
> >> >> > > >the RecoveryService looks like it can handle START_MANUAL,
> >> >>END_MANUAL,
> >> >> > and
> >> >> > > >USER_RETRY, which all sound like things the user should be
> >>doing;
> >> >>is
> >> >> it
> >> >> > > >correct that RecoveryService is handling these?
> >> >> > > >The Unknown Hadoop Job error happens when the JT comes back in
> >>time
> >> >> > > >because
> >> >> > > >it won't know about the old ID if its not recovering jobs.  So,
> >> >>Oozie
> >> >> > > >tries
> >> >> > > >to ask it about a job that no longer exists.  I'm not sure that
> >> >>this
> >> >> > > >should
> >> >> > > >be a transient error because there's no way to determine if its
> >> >> because
> >> >> > > >the
> >> >> > > >JT restarted and Oozie should resubmit the job or if something
> >>else
> >> >> > > >happened.
> >> >> > > >
> >> >> > > >Mayank,
> >> >> > > >That is a good point.  We could either make a v3 API or add an
> >> >> > oozie-site
> >> >> > > >config to turn on/off the id swap behavior and keep the v2 API.
> >> >> > > >
> >> >> > > >thanks
> >> >> > > >- Robert
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal
> >><[email protected]>
> >> >> > wrote:
> >> >> > > >
> >> >> > > >> Robert,
> >> >> > > >>
> >> >> > > >> Thats a break in backward compatibility. Till now user are
> >>used
> >> >>to
> >> >> > > >>click on
> >> >> > > >> to link to go to MR page.
> >> >> > > >>
> >> >> > > >> Is there a better way to handle this?
> >> >> > > >>
> >> >> > > >> Thanks,
> >> >> > > >> Mayank
> >> >> > > >>
> >> >> > > >>
> >> >> > > >>
> >> >> > > >>
> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <
> >> >> [email protected]>
> >> >> > > >> wrote:
> >> >> > > >>
> >> >> > > >> > Mona,
> >> >> > > >> > As far as I'm aware, the "retry" that Oozie is doing is just
> >> >> > retrying
> >> >> > > >>to
> >> >> > > >> > connect to the JT (which is why when the JT comes back up,
> >> >>Oozie
> >> >> > > >> > can continue monitoring the hadoop job if it still has the
> >>same
> >> >> ID);
> >> >> > > >>it
> >> >> > > >> > doesn't try to submit the job again as part of the "retry".
> >> >> > > >> >
> >> >> > > >> > Mayank,
> >> >> > > >> > We can put the ID for the actual job in the Child IDs tab
> >>(like
> >> >> with
> >> >> > > >> Pig).
> >> >> > > >> >
> >> >> > > >> >
> >> >> > > >> > - Robert
> >> >> > > >> >
> >> >> > > >> >
> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal
> >> >><[email protected]
> >> >> >
> >> >> > > >> wrote:
> >> >> > > >> >
> >> >> > > >> > > I agree , we should handle these two scenarios, I am ok
> >>with
> >> >> > > >>changing
> >> >> > > >> the
> >> >> > > >> > > launcher behavior for MR however if we remove the id swap
> >> >>then
> >> >> how
> >> >> > > >>we
> >> >> > > >> > > nevigate to MR jobs from UI as we do right now?
> >> >> > > >> > >
> >> >> > > >> > > Thanks,
> >> >> > > >> > > Mayank
> >> >> > > >> > >
> >> >> > > >> > >
> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter
> >> >> > > >><[email protected]>
> >> >> > > >> > > wrote:
> >> >> > > >> > >
> >> >> > > >> > > > Suppose we leave the MR ID swap thing as is but set the
> >> >> launcher
> >> >> > > >> > recover
> >> >> > > >> > > to
> >> >> > > >> > > > 0 and job to 1; then consider these two scenarios:
> >> >> > > >> > > >
> >> >> > > >> > > > 1. JT gets restarted during the launcher job but before
> >>the
> >> >> > > >>launcher
> >> >> > > >> > job
> >> >> > > >> > > > actually launches the real job:
> >> >> > > >> > > >      - The launcher job won't be recovered because we
> >>told
> >> >>it
> >> >> > not
> >> >> > > >>to
> >> >> > > >> > > >      - The real job was never launched
> >> >> > > >> > > >      ---> Action never completes and Oozie marks it as
> >> >>failed
> >> >> > > >> > > >
> >> >> > > >> > > > 2. Launcher job submits the real job, but JT gets
> >>restarted
> >> >> > before
> >> >> > > >> the
> >> >> > > >> > > > Oozie server has a chance to swap IDs (its not an atomic
> >> >> > > >>operation):
> >> >> > > >> > > >      - The launcher job won't be recovered because we
> >>told
> >> >>it
> >> >> > not
> >> >> > > >>to
> >> >> > > >> > > >      - The real job will be recovered and finish
> >> >>successfully
> >> >> > > >> > > >      ---> Oozie marks the action as failed even though
> >>the
> >> >> > actual
> >> >> > > >>job
> >> >> > > >> > > > succeeded because it didn't know about the ID swap
> >> >> > > >> > > >
> >> >> > > >> > > > It would only work for the case where the JT gets
> >>restarted
> >> >> > after
> >> >> > > >>the
> >> >> > > >> > ID
> >> >> > > >> > > > swap occurs.
> >> >> > > >> > > >
> >> >> > > >> > > >
> >> >> > > >> > > > - Robert
> >> >> > > >> > > >
> >> >> > > >> > > >
> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <
> >> >> > [email protected]
> >> >> > > >
> >> >> > > >> > > wrote:
> >> >> > > >> > > >
> >> >> > > >> > > > > Hi Robert,
> >> >> > > >> > > > >
> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to jobs for
> >> >>recovery
> >> >> > in
> >> >> > > >>all
> >> >> > > >> > the
> >> >> > > >> > > > > cases except MR.
> >> >> > > >> > > > >
> >> >> > > >> > > > > As after Id swapped Oozie only know about MR job isn't
> >> >>it?
> >> >> > then
> >> >> > > >> there
> >> >> > > >> > > > > should not be any problem.
> >> >> > > >> > > > >
> >> >> > > >> > > > > If we set MR launcher recover to 0 and job to 1 then
> >>job
> >> >> will
> >> >> > be
> >> >> > > >> > > succeded
> >> >> > > >> > > > > in case of JT restart.
> >> >> > > >> > > > >
> >> >> > > >> > > > > AM I missing something?
> >> >> > > >> > > > >
> >> >> > > >> > > > > Thanks,
> >> >> > > >> > > > > Mayank
> >> >> > > >> > > > >
> >> >> > > >> > > > >
> >> >> > > >> > > > >
> >> >> > > >> > > > >
> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <
> >> >> > > >> [email protected]>
> >> >> > > >> > > > > wrote:
> >> >> > > >> > > > >
> >> >> > > >> > > > > > I think you usually just get the "Unknown Hadoop
> >>Job"
> >> >> error
> >> >> > > >> message
> >> >> > > >> > > > > because
> >> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID it already
> >> >>has,
> >> >> but
> >> >> > > >>the
> >> >> > > >> JT
> >> >> > > >> > > no
> >> >> > > >> > > > > > longer has that ID because it was restarted.  With
> >>JT
> >> >> > > >> > Recoverability
> >> >> > > >> > > > > turned
> >> >> > > >> > > > > > on, it will restart the job using the same ID, so
> >>Oozie
> >> >> > > >>continues
> >> >> > > >> > > just
> >> >> > > >> > > > > > fine.
> >> >> > > >> > > > > >
> >> >> > > >> > > > > > - Robert
> >> >> > > >> > > > > >
> >> >> > > >> > > > > >
> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
> >> >> > > >> > > > > > <[email protected]>wrote:
> >> >> > > >> > > > > >
> >> >> > > >> > > > > > > Wouldn't oozie poll for the job status and decide
> >> >>that
> >> >> it
> >> >> > > >>has
> >> >> > > >> > > failed
> >> >> > > >> > > > > and
> >> >> > > >> > > > > > > when JT comes up launch another one if retry is
> >> >> > configured?
> >> >> > > >> > > > > > >
> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <
> >> >> > > >> > > [email protected]>
> >> >> > > >> > > > > > > wrote:
> >> >> > > >> > > > > > >
> >> >> > > >> > > > > > > > Hi,
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > > > We looked into how to support Job Recoverability
> >> >>(i.e.
> >> >> > > >>the JT
> >> >> > > >> > is
> >> >> > > >> > > > > > > restarted
> >> >> > > >> > > > > > > > and it wants to restart the jobs that were
> >>running;
> >> >> > > >>similarly
> >> >> > > >> > for
> >> >> > > >> > > > > YARN)
> >> >> > > >> > > > > > > and
> >> >> > > >> > > > > > > > have a pretty simple solution for all of the
> >>action
> >> >> > types
> >> >> > > >> > except
> >> >> > > >> > > > for
> >> >> > > >> > > > > > > > MapReduce.  If we set
> >> >> mapreduce.job.restart.recover=true
> >> >> > > >>for
> >> >> > > >> > the
> >> >> > > >> > > > > > launcher
> >> >> > > >> > > > > > > > job and mapreduce.job.restart.recover=false for
> >>the
> >> >> jobs
> >> >> > > >> > launched
> >> >> > > >> > > > by
> >> >> > > >> > > > > > the
> >> >> > > >> > > > > > > > launcher, then when the JT restarts, it will
> >> >>recover
> >> >> the
> >> >> > > >> > launcher
> >> >> > > >> > > > job
> >> >> > > >> > > > > > but
> >> >> > > >> > > > > > > > not the child jobs -- the launcher job will then
> >> >>take
> >> >> > > >>care of
> >> >> > > >> > > > > > relaunching
> >> >> > > >> > > > > > > > the child jobs.
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > > > For MapReduce, because of the optimization with
> >> >>the id
> >> >> > > >>swap,
> >> >> > > >> > this
> >> >> > > >> > > > > won't
> >> >> > > >> > > > > > > > work.  It would be very tricky, if it's even
> >> >> practical,
> >> >> > > >>to do
> >> >> > > >> > > > > something
> >> >> > > >> > > > > > > > similar for the MR action.  Instead, we think it
> >> >>would
> >> >> > be
> >> >> > > >> best
> >> >> > > >> > if
> >> >> > > >> > > > we
> >> >> > > >> > > > > > > simply
> >> >> > > >> > > > > > > > remove the MR optimization and make it just like
> >> >>the
> >> >> > other
> >> >> > > >> > action
> >> >> > > >> > > > > > types.
> >> >> > > >> > > > > > >  I
> >> >> > > >> > > > > > > > know we normally don't want to remove
> >> >>optimizations,
> >> >> but
> >> >> > > >> there
> >> >> > > >> > > are
> >> >> > > >> > > > > many
> >> >> > > >> > > > > > > > advantages in this case, and it's only saving a
> >> >>single
> >> >> > Map
> >> >> > > >> slot
> >> >> > > >> > > for
> >> >> > > >> > > > > MR
> >> >> > > >> > > > > > > jobs
> >> >> > > >> > > > > > > > only.
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > > > I've created OOZIE-1483 <
> >> >> > > >> > > > > > > https://issues.apache.org/jira/browse/OOZIE-1483>
> >> >> > > >> > > > > > > > with
> >> >> > > >> > > > > > > > more details and should have a patch soon.
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > > > Thoughts?
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > > > thanks
> >> >> > > >> > > > > > > > - Robert
> >> >> > > >> > > > > > > >
> >> >> > > >> > > > > > >
> >> >> > > >> > > > > >
> >> >> > > >> > > > >
> >> >> > > >> > > >
> >> >> > > >> > >
> >> >> > > >> >
> >> >> > > >>
> >> >> > >
> >> >> > >
> >> >> >
> >> >>
> >>
> >>
>
>


-- 
Alejandro

Re: Job Recoverability

Reply via email to