Re: Job Recoverability

Alejandro Abdelnur Thu, 08 Aug 2013 13:32:49 -0700

the change mentioned in 1) is a bug, a nasty one. This is a problem with JT
recovery turned ON or OFF and with any version of Hadoop.


It has to be fixed.

Also, Hadoop 1 JT job recovery is stable and works as expected.

Thanks.


On Thu, Aug 8, 2013 at 10:56 AM, Rohini Palaniswamy <[email protected]
> wrote:

> Haven't gone through the whole thread in detail yet. But looking at the
> change mentioned in 1), the first thing that comes to my mind is that it
> might not work as expected if job recoverability is not turned on. We need
> to consider that case. We cannot expect everyone to be in the latest
> version of hadoop and have recoverability turned on. Job recoverability in
> hadoop is not fully mature yet and not tested well.
>
> On Thu, Aug 8, 2013 at 10:17 AM, Robert Kanter <[email protected]>
> wrote:
>
> > So, does this sound good?
> >
> > 1) Create a JIRA to make the ActionCheckXCommand leave the action RUNNING
> > instead of START_MANUAL and ResumeXCommand shouldn't resubmit the job
> > 2) OOZIE-1483 to remove the MR optimization and set the launcher job to
> > recover but not the real job
> >
> > The property to set a job to not recover wasn't added until Hadoop 1.2.0
> > and we're using 1.1.1, so we'll also need:
> > 3) Create a JIRA to bump up the Hadoop version to 1.2.x
> >
> > There's also a problem with the DistCp action where DistCp doesn't
> actually
> > read the jobconf that Oozie prepares, and recoverability is enabled by
> > default on all jobs, so we can't disable it for the DistCp action until
> > DistCp is updated accordingly and we switch to a Hadoop release with that
> > fix, so we'll also need:
> > 4) A MAPREDUCE JIRA to make DistCp accept a jobconf
> > In the meantime, this will have to be a known issue where if the JT is
> > restarted with recoverability, you'll end up with two hadoop jobs running
> > DistCp
> >
> > And what should we do about the external id being the launcher job
> instead
> > of the real job after removing the MR optimization?
> >
> >
> > thanks
> > - Robert
> >
> >
> >
> >
> > On Wed, Aug 7, 2013 at 8:45 PM, Virag Kothari <[email protected]>
> wrote:
> >
> > > Ahh..I forgot about Oozie-994. My bad, I suggested that change.
> > Everything
> > > makes sense now. Thanks!
> > >
> > > On 8/7/13 7:38 PM, "Robert Kanter" <[email protected]> wrote:
> > >
> > > >The behavior where the ActionCheckXCommand calls handleNonTransient()
> > with
> > > >START_MANUAL when the JT can't be reached after the retries and on
> > RESUME
> > > >command will resubmit the job was something I did for OOZIE-994.  In
> > > >hindsight, we shouldn't have done it that way.
> > > >
> > > >Yes, it will fail if job recovery is not enabled in the JT/RM; but I
> > think
> > > >this is the more correct behavior as this is something that the
> external
> > > >system should be taking care of.
> > > >
> > > >- Robert
> > > >
> > > >
> > > >On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari <[email protected]>
> > > wrote:
> > > >
> > > >> Alejandro, I agree that functionality would be preserved if action
> is
> > > >>left
> > > >> in RUNNING during a transient error.
> > > >>
> > > >> Few questions
> > > >>
> > > >> 1) START_MANUAL seems to be set only by handleNonTransient(). If
> this
> > > >>is a
> > > >> bug, do you know for what purpose it was introduced?
> > > >>    I thought having START_MANUAL is a way to distinguish between
> Oozie
> > > >> suspending job due to transient error and a user manually suspending
> > the
> > > >> job.
> > > >>
> > > >> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM recovery
> > is
> > > >> not enabled. And it seems that YARN recovery is still not there as
> > > >> YARN-128 is not yet committed (Not sure if looking at right JIRA).
> > > >>   Its a concern for us as we ask users to RESUME their jobs after
> > hadoop
> > > >> upgrade. Now they have to resume wf and rerun the failed actions.
> > > >>
> > > >> Thanks,
> > > >> Virag
> > > >>
> > > >>
> > > >>
> > > >> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <[email protected]> wrote:
> > > >>
> > > >> >[joining the party a bit late]
> > > >> >
> > > >> >I just add an offline call with RobertK who brought me up to speed.
> > > >> >
> > > >> >By design, Oozie will retry starting a workflow action ONLY if it
> > > >>couldn't
> > > >> >start the WF action before. If Oozie started the WF action
> > > >>successfully,
> > > >> >the WF action state goes into RUNNING, and from then on it is the
> > > >> >responsibility of the external system running the action to recover
> > it.
> > > >> >Oozie will not attempt any recovery after that point.
> > > >> >
> > > >> >This means that with  Hadoop (JT or YARN) job recovery, the
> launcher
> > > >>job
> > > >> >will be recovered by Hadoop without any intervention from Oozie.
> > > >> >
> > > >> >It is clear that to have recovery for  MR  action we need to get
> rid
> > of
> > > >> >the
> > > >> >swap and just hold onto the MR launcher job as we do for the other
> > > >> >actions.
> > > >> >
> > > >> >Now, on the whole discussion on the ActionCheckXCommand retries. We
> > > >>have a
> > > >> >bug in the ActionCheckXCommand, on handleNonTransient() we should
> not
> > > >> >change the status of the WF action to START_MANUAL, we should leave
> > it
> > > >>in
> > > >> >RUNNING. hadnleNonTransient() will suspend the WF job thus
> switching
> > > >>off
> > > >> >action checks. On WF job resume, the action checks will start
> working
> > > >> >again, and if Hadoop has job recovery, things will work fine. Else
> > the
> > > >>WF
> > > >> >action will fail because the launcher job is not known (the
> external
> > > >> >system
> > > >> >does not know how to recover jobs). Because we are reseting the
> > status
> > > >>to
> > > >> >START_MANUAL we are dialing back on the lifecycle of the action,
> that
> > > >>is
> > > >> >incorrect and that creates the race condition that introduces 2
> jobs.
> > > >> >
> > > >> >So again, Oozie is not responsible for recovering actions. With
> that
> > > >> >assumption, fixing the handleNonTransient() to leave the status in
> > > >>RUNNING
> > > >> >and getting rid of the RM swap logic we should be good.
> > > >> >
> > > >> >Thoughts?
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari <
> [email protected]>
> > > >> >wrote:
> > > >> >
> > > >> >> Robert,
> > > >> >>
> > > >> >> I have been thinking on this for a while and have few more
> concerns
> > > >>if
> > > >> >>the
> > > >> >> job retries are not streamlined through Oozie.
> > > >> >>
> > > >> >> 1) Till the JT finishes recovering the job, the wf job/wf action
> > > >>status
> > > >> >> will be SUSPENDED/START_MANUAL.
> > > >> >> Isn't it misleading as the hadoop job is RUNNING while oozie
> > > >>incorrectly
> > > >> >> shows as SUSPENDED? Even if allow this, after the job completes,
> > > >>what if
> > > >> >> the callback is lost or oozie is down?
> > > >> >> To prevent the job being in SUSPENDED forever, we need to hack
> our
> > > >> >> services to pull SUSPENDED/START_MANUAL jobs from db and update
> > their
> > > >> >> status.
> > > >> >>
> > > >> >> 2) Should we allow failing of the user RESUME command if the
> action
> > > >>is
> > > >> >>in
> > > >> >> START_MANUAL to prevent the race condition we were discussing?
> > > >> >> This would mean changing the semantics of the states.
> > > >> >>
> > > >> >> 3) Confused on mapred.job.restart.recover. Reading
> > > >> >> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html,
> it
> > > >>says
> > > >> >> that the default value of this is true. So,
> > > >> >> if mapred.jobtracker.restart.recover (system config) is already
> > > >>enabled,
> > > >> >> is job recovery on by default? Also, does recover mean the job
> will
> > > >> >>start
> > > >> >> where it left from or is it just plain restart?
> > > >> >>
> > > >> >> In summary, IMO allowing hadoop to recover jobs independently
> > > >>bypassing
> > > >> >> Oozie ins't trivial. It would have helped if the JT produced
> > > >> >>notification
> > > >> >> when it comes online, so Oozie could retry after consuming those.
> > But
> > > >> >> currently, notification only happens when task completes.
> > > >> >>
> > > >> >> An alternate approach is to modify the semantics of START_MANUAL.
> > > >> >> Currently Oozie puts the action/job in START_MANUAL/SUSPENDED and
> > > >> >>expects
> > > >> >> the user to resume it. We can change this and make Oozie retry
> the
> > > >> >> START_MANUAL actions at configurable interval (~30 mins or some
> > > >>scheme
> > > >> >> like exp back off) . Of course, this is is bad as oozie will keep
> > > >> >>polling
> > > >> >> hadoop at some interval but manual resume of jobs who have faced
> > > >> >>transient
> > > >> >> errors will no longer be mandatory.
> > > >> >>
> > > >> >> --Virag
> > > >> >>
> > > >> >>
> > > >> >> On 8/6/13 4:38 PM, "Robert Kanter" <[email protected]> wrote:
> > > >> >>
> > > >> >> >If ActionCheckX is trying to retry, and the JT recovers the job,
> > > >>that
> > > >> >> >should be fine.  The "retry" is to simply try connecting to the
> JT
> > > >>to
> > > >> >>get
> > > >> >> >the status for the job.  If the user issues a "RESUME" for a
> > > >> >>START_MANUAL
> > > >> >> >job, then yes, Oozie will try to resubmit a new job for that
> > action
> > > >>and
> > > >> >> >we'd have two of them if the JT also recovers it.
> > > >> >> >
> > > >> >> >What if we modified the ActionStartXCommand/ResumeActionXCommand
> > > >> >> >precondition to check if the action already has a Job ID that is
> > > >>valid
> > > >> >> >(i.e. not unknown to the JT), then it fails the precondition
> check
> > > >>or
> > > >> >> >something similar?
> > > >> >> >
> > > >> >> >- Robert
> > > >> >> >
> > > >> >> >
> > > >> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <
> > [email protected]>
> > > >> >> wrote:
> > > >> >> >
> > > >> >> >> ActionCheckx first retries for a configurable amount of time
> and
> > > >>then
> > > >> >> >> makes the status as START_MANUAL.
> > > >> >> >> So, the problem might happen when JT recovers the job during
> the
> > > >>same
> > > >> >> >>time
> > > >> >> >> when 1) ActionCheckX is trying to retry or the 2) user issues
> a
> > > >> >>"RESUME"
> > > >> >> >> for a start_manual job.
> > > >> >> >> We have to ensure that this doesn't happen otherwise we will
> > have
> > > >>two
> > > >> >> >> hadoop jobs for the same action.
> > > >> >> >> The callback happens only when the task is completed which
> might
> > > >>be
> > > >> >>too
> > > >> >> >> late. During that time, Oozie might have already submitted a
> new
> > > >> >>hadoop
> > > >> >> >> job for that wf action.
> > > >> >> >> So it doesn't seem straightforward to prevent Oozie to submit
> a
> > > >>new
> > > >> >>job
> > > >> >> >>if
> > > >> >> >> the JT is already recovering the older one.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <[email protected]>
> > wrote:
> > > >> >> >>
> > > >> >> >> >Yes, if JT recovers the job, it uses the same ID.  If the JT
> > > >>comes
> > > >> >>up
> > > >> >> >> >quickly and recovers the job, Oozie continues working just
> fine
> > > >> >> >>(without
> > > >> >> >> >the ID swap issues discussed earlier).  When the JT takes
> > longer
> > > >> >>than
> > > >> >> >>the
> > > >> >> >> >10min ActionCheck interval, and the action is START_MANUAL,
> > that
> > > >> >>still
> > > >> >> >> >needs to be figured out.
> > > >> >> >> >
> > > >> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told that
> it
> > > >> >>should
> > > >> >> >>have
> > > >> >> >> >the same behavior.  The only differences are that the name of
> > the
> > > >> >> >>property
> > > >> >> >> >to enable recoverability on the server (not the job-level
> one)
> > is
> > > >> >> >> >different
> > > >> >> >> >obviously because it doesn't have "jobtracker" in it and it
> can
> > > >>also
> > > >> >> >> >recover the completed tasks, which shouldn't be a problem
> > because
> > > >> >>the
> > > >> >> >> >launcher jar has the one task.  I'll of course double check
> > this
> > > >> >> >>though.
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> >- Robert
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy
> > > >> >> >> ><[email protected]>wrote:
> > > >> >> >> >
> > > >> >> >> >> Robert,
> > > >> >> >> >>     You will not get a unknown hadoop job if JT has retry
> > > >> >>configured
> > > >> >> >> >>right?
> > > >> >> >> >> What happens in that case? Especially what happens when
> Oozie
> > > >> >>retry
> > > >> >> >> >>happens
> > > >> >> >> >> when JT comes up quickly?  Also do you know what is the
> > > >>behaviour
> > > >> >> >>with
> > > >> >> >> >> Hadoop 2.x ?
> > > >> >> >> >>
> > > >> >> >> >> Mayank,
> > > >> >> >> >>   OOZIE-1231 already has the changes to show Mapreduce job
> id
> > > >>in
> > > >> >>the
> > > >> >> >> >>Child
> > > >> >> >> >> job page to be consistent with other job types. The v1 API
> > has
> > > >>the
> > > >> >> >>older
> > > >> >> >> >> behaviour with map job url in externalId, while v2 API has
> it
> > > >>in
> > > >> >> >> >> childjobids.  So there is a UI change but v1 REST API has
> not
> > > >> >> >>changed.
> > > >> >> >> >>But
> > > >> >> >> >> OOZIE-1231 has not changed any code with respect to id
> swap.
> > > >> >> >> >>
> > > >> >> >> >> Regards,
> > > >> >> >> >> Rohini
> > > >> >> >> >>
> > > >> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter
> > > >> >><[email protected]>
> > > >> >> >> >> wrote:
> > > >> >> >> >>
> > > >> >> >> >> > Ya, I saw a precondition failed message.
> > > >> >> >> >> >
> > > >> >> >> >> > I just tried out what happens when the job is SUSPENDED,
> > the
> > > >> >> >>action is
> > > >> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It
> > doesn't
> > > >> >> >>continue
> > > >> >> >> >>the
> > > >> >> >> >> > workflow.  It fails the eagerVerifyPrecondition from
> > > >> >> >> >> > CompletedActionXCommand because the action isn't RUNNING.
> > > >> >>Perhaps
> > > >> >> >>we
> > > >> >> >> >> > should make the CallbackService change the status in this
> > > >> >> >>situation?
> > > >> >> >> >> >
> > > >> >> >> >> > Just to clarify, the above only happens when the JT has
> > been
> > > >> >>down
> > > >> >> >>long
> > > >> >> >> >> > enough that the ActionCheckXCommand (every 10min by
> > default)
> > > >>+
> > > >> >>the
> > > >> >> >> >> retries
> > > >> >> >> >> > (3 x 1min) happen.  If it comes back sooner than that,
> > > >> >>everything
> > > >> >> >> >>works
> > > >> >> >> >> > fine.
> > > >> >> >> >> >
> > > >> >> >> >> > thanks
> > > >> >> >> >> > - Robert
> > > >> >> >> >> >
> > > >> >> >> >> >
> > > >> >> >> >> >
> > > >> >> >> >> >
> > > >> >> >> >> >
> > > >> >> >> >> >
> > > >> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari
> > > >> >><[email protected]
> > > >> >> >
> > > >> >> >> >> wrote:
> > > >> >> >> >> >
> > > >> >> >> >> > > Oh..okay. Seems like RecoveryService queues the StartX
> > > >>command
> > > >> >> >>but
> > > >> >> >> >>the
> > > >> >> >> >> > > verifyPrecondition() fails as the wf job is
> > > >> >> >> >> > > Suspended (Plz verify this from logs).
> > > >> >> >> >> > >
> > > >> >> >> >> > > In that case, if Oozie is not auto-retrying and
> > > >>resubmitting,
> > > >> >> >>then
> > > >> >> >> >>it
> > > >> >> >> >> > > seems fair to have the JT recover the job.
> > > >> >> >> >> > > But if JT recovers the job, can we make sure that the
> > > >>workflow
> > > >> >> >>job
> > > >> >> >> >> > > transits to RUNNING from SUSPENDED and wf action from
> > > >> >> >>START_MANUAL
> > > >> >> >> >>to
> > > >> >> >> >> > > RUNNING?
> > > >> >> >> >> > > It should not happen that the user resumes the job
> which
> > > >>makes
> > > >> >> >>Oozie
> > > >> >> >> >> > > submit a new hadoop job while the JT is also recovering
> > the
> > > >> >>same
> > > >> >> >> >>job.
> > > >> >> >> >> > > Also, I think the error can still be considered
> transient
> > > >>from
> > > >> >> >>Oozie
> > > >> >> >> >> > > perspective as it is temporary depending on state of
> JT.
> > > >> >> >> >> > >
> > > >> >> >> >> > > Thanks,
> > > >> >> >> >> > > Virag
> > > >> >> >> >> > >
> > > >> >> >> >> > >
> > > >> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <
> [email protected]
> > >
> > > >> >>wrote:
> > > >> >> >> >> > >
> > > >> >> >> >> > > >Virag,
> > > >> >> >> >> > > >I just tested out killing the JT and waiting for the
> > > >>Checker
> > > >> >> >> >>service
> > > >> >> >> >> to
> > > >> >> >> >> > > >retry and give up: the action goes to START_MANUAL and
> > the
> > > >> >>job
> > > >> >> >>gets
> > > >> >> >> >> > > >SUSPENDED.  I waited around long enough, but the
> > > >> >>RecoveryService
> > > >> >> >> >> didn't
> > > >> >> >> >> > do
> > > >> >> >> >> > > >anything.  Does it kick in for you?  As a side note,
> > > >>looking
> > > >> >>at
> > > >> >> >>the
> > > >> >> >> >> > code,
> > > >> >> >> >> > > >the RecoveryService looks like it can handle
> > START_MANUAL,
> > > >> >> >> >>END_MANUAL,
> > > >> >> >> >> > and
> > > >> >> >> >> > > >USER_RETRY, which all sound like things the user
> should
> > be
> > > >> >> >>doing;
> > > >> >> >> >>is
> > > >> >> >> >> it
> > > >> >> >> >> > > >correct that RecoveryService is handling these?
> > > >> >> >> >> > > >The Unknown Hadoop Job error happens when the JT comes
> > > >>back
> > > >> >>in
> > > >> >> >>time
> > > >> >> >> >> > > >because
> > > >> >> >> >> > > >it won't know about the old ID if its not recovering
> > jobs.
> > > >> >>So,
> > > >> >> >> >>Oozie
> > > >> >> >> >> > > >tries
> > > >> >> >> >> > > >to ask it about a job that no longer exists.  I'm not
> > sure
> > > >> >>that
> > > >> >> >> >>this
> > > >> >> >> >> > > >should
> > > >> >> >> >> > > >be a transient error because there's no way to
> determine
> > > >>if
> > > >> >>its
> > > >> >> >> >> because
> > > >> >> >> >> > > >the
> > > >> >> >> >> > > >JT restarted and Oozie should resubmit the job or if
> > > >> >>something
> > > >> >> >>else
> > > >> >> >> >> > > >happened.
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >Mayank,
> > > >> >> >> >> > > >That is a good point.  We could either make a v3 API
> or
> > > >>add
> > > >> >>an
> > > >> >> >> >> > oozie-site
> > > >> >> >> >> > > >config to turn on/off the id swap behavior and keep
> the
> > v2
> > > >> >>API.
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >thanks
> > > >> >> >> >> > > >- Robert
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal
> > > >> >> >><[email protected]>
> > > >> >> >> >> > wrote:
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >> Robert,
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >> Thats a break in backward compatibility. Till now
> user
> > > >>are
> > > >> >> >>used
> > > >> >> >> >>to
> > > >> >> >> >> > > >>click on
> > > >> >> >> >> > > >> to link to go to MR page.
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >> Is there a better way to handle this?
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >> Thanks,
> > > >> >> >> >> > > >> Mayank
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <
> > > >> >> >> >> [email protected]>
> > > >> >> >> >> > > >> wrote:
> > > >> >> >> >> > > >>
> > > >> >> >> >> > > >> > Mona,
> > > >> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie is
> doing
> > > >>is
> > > >> >>just
> > > >> >> >> >> > retrying
> > > >> >> >> >> > > >>to
> > > >> >> >> >> > > >> > connect to the JT (which is why when the JT comes
> > back
> > > >> >>up,
> > > >> >> >> >>Oozie
> > > >> >> >> >> > > >> > can continue monitoring the hadoop job if it still
> > has
> > > >> >>the
> > > >> >> >>same
> > > >> >> >> >> ID);
> > > >> >> >> >> > > >>it
> > > >> >> >> >> > > >> > doesn't try to submit the job again as part of the
> > > >> >>"retry".
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >> > Mayank,
> > > >> >> >> >> > > >> > We can put the ID for the actual job in the Child
> > IDs
> > > >>tab
> > > >> >> >>(like
> > > >> >> >> >> with
> > > >> >> >> >> > > >> Pig).
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >> > - Robert
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal
> > > >> >> >> >><[email protected]
> > > >> >> >> >> >
> > > >> >> >> >> > > >> wrote:
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >> > > I agree , we should handle these two scenarios,
> I
> > > >>am ok
> > > >> >> >>with
> > > >> >> >> >> > > >>changing
> > > >> >> >> >> > > >> the
> > > >> >> >> >> > > >> > > launcher behavior for MR however if we remove
> the
> > id
> > > >> >>swap
> > > >> >> >> >>then
> > > >> >> >> >> how
> > > >> >> >> >> > > >>we
> > > >> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right now?
> > > >> >> >> >> > > >> > >
> > > >> >> >> >> > > >> > > Thanks,
> > > >> >> >> >> > > >> > > Mayank
> > > >> >> >> >> > > >> > >
> > > >> >> >> >> > > >> > >
> > > >> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter
> > > >> >> >> >> > > >><[email protected]>
> > > >> >> >> >> > > >> > > wrote:
> > > >> >> >> >> > > >> > >
> > > >> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as is
> but
> > > >>set
> > > >> >>the
> > > >> >> >> >> launcher
> > > >> >> >> >> > > >> > recover
> > > >> >> >> >> > > >> > > to
> > > >> >> >> >> > > >> > > > 0 and job to 1; then consider these two
> > scenarios:
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > > 1. JT gets restarted during the launcher job
> but
> > > >> >>before
> > > >> >> >>the
> > > >> >> >> >> > > >>launcher
> > > >> >> >> >> > > >> > job
> > > >> >> >> >> > > >> > > > actually launches the real job:
> > > >> >> >> >> > > >> > > >      - The launcher job won't be recovered
> > > >>because we
> > > >> >> >>told
> > > >> >> >> >>it
> > > >> >> >> >> > not
> > > >> >> >> >> > > >>to
> > > >> >> >> >> > > >> > > >      - The real job was never launched
> > > >> >> >> >> > > >> > > >      ---> Action never completes and Oozie
> marks
> > > >>it
> > > >> >>as
> > > >> >> >> >>failed
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > > 2. Launcher job submits the real job, but JT
> > gets
> > > >> >> >>restarted
> > > >> >> >> >> > before
> > > >> >> >> >> > > >> the
> > > >> >> >> >> > > >> > > > Oozie server has a chance to swap IDs (its not
> > an
> > > >> >>atomic
> > > >> >> >> >> > > >>operation):
> > > >> >> >> >> > > >> > > >      - The launcher job won't be recovered
> > > >>because we
> > > >> >> >>told
> > > >> >> >> >>it
> > > >> >> >> >> > not
> > > >> >> >> >> > > >>to
> > > >> >> >> >> > > >> > > >      - The real job will be recovered and
> finish
> > > >> >> >> >>successfully
> > > >> >> >> >> > > >> > > >      ---> Oozie marks the action as failed
> even
> > > >> >>though
> > > >> >> >>the
> > > >> >> >> >> > actual
> > > >> >> >> >> > > >>job
> > > >> >> >> >> > > >> > > > succeeded because it didn't know about the ID
> > swap
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > > It would only work for the case where the JT
> > gets
> > > >> >> >>restarted
> > > >> >> >> >> > after
> > > >> >> >> >> > > >>the
> > > >> >> >> >> > > >> > ID
> > > >> >> >> >> > > >> > > > swap occurs.
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > > - Robert
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank
> Bansal <
> > > >> >> >> >> > [email protected]
> > > >> >> >> >> > > >
> > > >> >> >> >> > > >> > > wrote:
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > > > > Hi Robert,
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to
> > jobs
> > > >>for
> > > >> >> >> >>recovery
> > > >> >> >> >> > in
> > > >> >> >> >> > > >>all
> > > >> >> >> >> > > >> > the
> > > >> >> >> >> > > >> > > > > cases except MR.
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > As after Id swapped Oozie only know about MR
> > job
> > > >> >>isn't
> > > >> >> >> >>it?
> > > >> >> >> >> > then
> > > >> >> >> >> > > >> there
> > > >> >> >> >> > > >> > > > > should not be any problem.
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and job
> to
> > 1
> > > >> >>then
> > > >> >> >>job
> > > >> >> >> >> will
> > > >> >> >> >> > be
> > > >> >> >> >> > > >> > > succeded
> > > >> >> >> >> > > >> > > > > in case of JT restart.
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > AM I missing something?
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > Thanks,
> > > >> >> >> >> > > >> > > > > Mayank
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert
> Kanter
> > <
> > > >> >> >> >> > > >> [email protected]>
> > > >> >> >> >> > > >> > > > > wrote:
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > > > > I think you usually just get the "Unknown
> > > >>Hadoop
> > > >> >> >>Job"
> > > >> >> >> >> error
> > > >> >> >> >> > > >> message
> > > >> >> >> >> > > >> > > > > because
> > > >> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID
> it
> > > >> >>already
> > > >> >> >> >>has,
> > > >> >> >> >> but
> > > >> >> >> >> > > >>the
> > > >> >> >> >> > > >> JT
> > > >> >> >> >> > > >> > > no
> > > >> >> >> >> > > >> > > > > > longer has that ID because it was
> restarted.
> > > >> >>With
> > > >> >> >>JT
> > > >> >> >> >> > > >> > Recoverability
> > > >> >> >> >> > > >> > > > > turned
> > > >> >> >> >> > > >> > > > > > on, it will restart the job using the same
> > > >>ID, so
> > > >> >> >>Oozie
> > > >> >> >> >> > > >>continues
> > > >> >> >> >> > > >> > > just
> > > >> >> >> >> > > >> > > > > > fine.
> > > >> >> >> >> > > >> > > > > >
> > > >> >> >> >> > > >> > > > > > - Robert
> > > >> >> >> >> > > >> > > > > >
> > > >> >> >> >> > > >> > > > > >
> > > >> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini
> > > >> >>Palaniswamy
> > > >> >> >> >> > > >> > > > > > <[email protected]>wrote:
> > > >> >> >> >> > > >> > > > > >
> > > >> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job status
> and
> > > >> >>decide
> > > >> >> >> >>that
> > > >> >> >> >> it
> > > >> >> >> >> > > >>has
> > > >> >> >> >> > > >> > > failed
> > > >> >> >> >> > > >> > > > > and
> > > >> >> >> >> > > >> > > > > > > when JT comes up launch another one if
> > > >>retry is
> > > >> >> >> >> > configured?
> > > >> >> >> >> > > >> > > > > > >
> > > >> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert
> > > >>Kanter <
> > > >> >> >> >> > > >> > > [email protected]>
> > > >> >> >> >> > > >> > > > > > > wrote:
> > > >> >> >> >> > > >> > > > > > >
> > > >> >> >> >> > > >> > > > > > > > Hi,
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > > > We looked into how to support Job
> > > >> >>Recoverability
> > > >> >> >> >>(i.e.
> > > >> >> >> >> > > >>the JT
> > > >> >> >> >> > > >> > is
> > > >> >> >> >> > > >> > > > > > > restarted
> > > >> >> >> >> > > >> > > > > > > > and it wants to restart the jobs that
> > were
> > > >> >> >>running;
> > > >> >> >> >> > > >>similarly
> > > >> >> >> >> > > >> > for
> > > >> >> >> >> > > >> > > > > YARN)
> > > >> >> >> >> > > >> > > > > > > and
> > > >> >> >> >> > > >> > > > > > > > have a pretty simple solution for all
> of
> > > >>the
> > > >> >> >>action
> > > >> >> >> >> > types
> > > >> >> >> >> > > >> > except
> > > >> >> >> >> > > >> > > > for
> > > >> >> >> >> > > >> > > > > > > > MapReduce.  If we set
> > > >> >> >> >> mapreduce.job.restart.recover=true
> > > >> >> >> >> > > >>for
> > > >> >> >> >> > > >> > the
> > > >> >> >> >> > > >> > > > > > launcher
> > > >> >> >> >> > > >> > > > > > > > job and
> > > >>mapreduce.job.restart.recover=false
> > > >> >>for
> > > >> >> >>the
> > > >> >> >> >> jobs
> > > >> >> >> >> > > >> > launched
> > > >> >> >> >> > > >> > > > by
> > > >> >> >> >> > > >> > > > > > the
> > > >> >> >> >> > > >> > > > > > > > launcher, then when the JT restarts,
> it
> > > >>will
> > > >> >> >> >>recover
> > > >> >> >> >> the
> > > >> >> >> >> > > >> > launcher
> > > >> >> >> >> > > >> > > > job
> > > >> >> >> >> > > >> > > > > > but
> > > >> >> >> >> > > >> > > > > > > > not the child jobs -- the launcher job
> > > >>will
> > > >> >>then
> > > >> >> >> >>take
> > > >> >> >> >> > > >>care of
> > > >> >> >> >> > > >> > > > > > relaunching
> > > >> >> >> >> > > >> > > > > > > > the child jobs.
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > > > For MapReduce, because of the
> > optimization
> > > >> >>with
> > > >> >> >> >>the id
> > > >> >> >> >> > > >>swap,
> > > >> >> >> >> > > >> > this
> > > >> >> >> >> > > >> > > > > won't
> > > >> >> >> >> > > >> > > > > > > > work.  It would be very tricky, if
> it's
> > > >>even
> > > >> >> >> >> practical,
> > > >> >> >> >> > > >>to do
> > > >> >> >> >> > > >> > > > > something
> > > >> >> >> >> > > >> > > > > > > > similar for the MR action.  Instead,
> we
> > > >> >>think it
> > > >> >> >> >>would
> > > >> >> >> >> > be
> > > >> >> >> >> > > >> best
> > > >> >> >> >> > > >> > if
> > > >> >> >> >> > > >> > > > we
> > > >> >> >> >> > > >> > > > > > > simply
> > > >> >> >> >> > > >> > > > > > > > remove the MR optimization and make it
> > > >>just
> > > >> >>like
> > > >> >> >> >>the
> > > >> >> >> >> > other
> > > >> >> >> >> > > >> > action
> > > >> >> >> >> > > >> > > > > > types.
> > > >> >> >> >> > > >> > > > > > >  I
> > > >> >> >> >> > > >> > > > > > > > know we normally don't want to remove
> > > >> >> >> >>optimizations,
> > > >> >> >> >> but
> > > >> >> >> >> > > >> there
> > > >> >> >> >> > > >> > > are
> > > >> >> >> >> > > >> > > > > many
> > > >> >> >> >> > > >> > > > > > > > advantages in this case, and it's only
> > > >> >>saving a
> > > >> >> >> >>single
> > > >> >> >> >> > Map
> > > >> >> >> >> > > >> slot
> > > >> >> >> >> > > >> > > for
> > > >> >> >> >> > > >> > > > > MR
> > > >> >> >> >> > > >> > > > > > > jobs
> > > >> >> >> >> > > >> > > > > > > > only.
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 <
> > > >> >> >> >> > > >> > > > > > >
> > > >> >>https://issues.apache.org/jira/browse/OOZIE-1483>
> > > >> >> >> >> > > >> > > > > > > > with
> > > >> >> >> >> > > >> > > > > > > > more details and should have a patch
> > soon.
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > > > Thoughts?
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > > > thanks
> > > >> >> >> >> > > >> > > > > > > > - Robert
> > > >> >> >> >> > > >> > > > > > > >
> > > >> >> >> >> > > >> > > > > > >
> > > >> >> >> >> > > >> > > > > >
> > > >> >> >> >> > > >> > > > >
> > > >> >> >> >> > > >> > > >
> > > >> >> >> >> > > >> > >
> > > >> >> >> >> > > >> >
> > > >> >> >> >> > > >>
> > > >> >> >> >> > >
> > > >> >> >> >> > >
> > > >> >> >> >> >
> > > >> >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >>
> > > >> >>
> > > >> >
> > > >> >
> > > >> >--
> > > >> >Alejandro
> > > >>
> > > >>
> > >
> > >
> >
>



-- 
Alejandro

Re: Job Recoverability

Reply via email to