Re: Job Recoverability

Rohini Palaniswamy Thu, 08 Aug 2013 10:59:45 -0700

Haven't gone through the whole thread in detail yet. But looking at the
change mentioned in 1), the first thing that comes to my mind is that it
might not work as expected if job recoverability is not turned on. We need
to consider that case. We cannot expect everyone to be in the latest
version of hadoop and have recoverability turned on. Job recoverability in
hadoop is not fully mature yet and not tested well.


On Thu, Aug 8, 2013 at 10:17 AM, Robert Kanter <[email protected]> wrote:

> So, does this sound good?
>
> 1) Create a JIRA to make the ActionCheckXCommand leave the action RUNNING
> instead of START_MANUAL and ResumeXCommand shouldn't resubmit the job
> 2) OOZIE-1483 to remove the MR optimization and set the launcher job to
> recover but not the real job
>
> The property to set a job to not recover wasn't added until Hadoop 1.2.0
> and we're using 1.1.1, so we'll also need:
> 3) Create a JIRA to bump up the Hadoop version to 1.2.x
>
> There's also a problem with the DistCp action where DistCp doesn't actually
> read the jobconf that Oozie prepares, and recoverability is enabled by
> default on all jobs, so we can't disable it for the DistCp action until
> DistCp is updated accordingly and we switch to a Hadoop release with that
> fix, so we'll also need:
> 4) A MAPREDUCE JIRA to make DistCp accept a jobconf
> In the meantime, this will have to be a known issue where if the JT is
> restarted with recoverability, you'll end up with two hadoop jobs running
> DistCp
>
> And what should we do about the external id being the launcher job instead
> of the real job after removing the MR optimization?
>
>
> thanks
> - Robert
>
>
>
>
> On Wed, Aug 7, 2013 at 8:45 PM, Virag Kothari <[email protected]> wrote:
>
> > Ahh..I forgot about Oozie-994. My bad, I suggested that change.
> Everything
> > makes sense now. Thanks!
> >
> > On 8/7/13 7:38 PM, "Robert Kanter" <[email protected]> wrote:
> >
> > >The behavior where the ActionCheckXCommand calls handleNonTransient()
> with
> > >START_MANUAL when the JT can't be reached after the retries and on
> RESUME
> > >command will resubmit the job was something I did for OOZIE-994.  In
> > >hindsight, we shouldn't have done it that way.
> > >
> > >Yes, it will fail if job recovery is not enabled in the JT/RM; but I
> think
> > >this is the more correct behavior as this is something that the external
> > >system should be taking care of.
> > >
> > >- Robert
> > >
> > >
> > >On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari <[email protected]>
> > wrote:
> > >
> > >> Alejandro, I agree that functionality would be preserved if action is
> > >>left
> > >> in RUNNING during a transient error.
> > >>
> > >> Few questions
> > >>
> > >> 1) START_MANUAL seems to be set only by handleNonTransient(). If this
> > >>is a
> > >> bug, do you know for what purpose it was introduced?
> > >>    I thought having START_MANUAL is a way to distinguish between Oozie
> > >> suspending job due to transient error and a user manually suspending
> the
> > >> job.
> > >>
> > >> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM recovery
> is
> > >> not enabled. And it seems that YARN recovery is still not there as
> > >> YARN-128 is not yet committed (Not sure if looking at right JIRA).
> > >>   Its a concern for us as we ask users to RESUME their jobs after
> hadoop
> > >> upgrade. Now they have to resume wf and rerun the failed actions.
> > >>
> > >> Thanks,
> > >> Virag
> > >>
> > >>
> > >>
> > >> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <[email protected]> wrote:
> > >>
> > >> >[joining the party a bit late]
> > >> >
> > >> >I just add an offline call with RobertK who brought me up to speed.
> > >> >
> > >> >By design, Oozie will retry starting a workflow action ONLY if it
> > >>couldn't
> > >> >start the WF action before. If Oozie started the WF action
> > >>successfully,
> > >> >the WF action state goes into RUNNING, and from then on it is the
> > >> >responsibility of the external system running the action to recover
> it.
> > >> >Oozie will not attempt any recovery after that point.
> > >> >
> > >> >This means that with  Hadoop (JT or YARN) job recovery, the launcher
> > >>job
> > >> >will be recovered by Hadoop without any intervention from Oozie.
> > >> >
> > >> >It is clear that to have recovery for  MR  action we need to get rid
> of
> > >> >the
> > >> >swap and just hold onto the MR launcher job as we do for the other
> > >> >actions.
> > >> >
> > >> >Now, on the whole discussion on the ActionCheckXCommand retries. We
> > >>have a
> > >> >bug in the ActionCheckXCommand, on handleNonTransient() we should not
> > >> >change the status of the WF action to START_MANUAL, we should leave
> it
> > >>in
> > >> >RUNNING. hadnleNonTransient() will suspend the WF job thus switching
> > >>off
> > >> >action checks. On WF job resume, the action checks will start working
> > >> >again, and if Hadoop has job recovery, things will work fine. Else
> the
> > >>WF
> > >> >action will fail because the launcher job is not known (the external
> > >> >system
> > >> >does not know how to recover jobs). Because we are reseting the
> status
> > >>to
> > >> >START_MANUAL we are dialing back on the lifecycle of the action, that
> > >>is
> > >> >incorrect and that creates the race condition that introduces 2 jobs.
> > >> >
> > >> >So again, Oozie is not responsible for recovering actions. With that
> > >> >assumption, fixing the handleNonTransient() to leave the status in
> > >>RUNNING
> > >> >and getting rid of the RM swap logic we should be good.
> > >> >
> > >> >Thoughts?
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari <[email protected]>
> > >> >wrote:
> > >> >
> > >> >> Robert,
> > >> >>
> > >> >> I have been thinking on this for a while and have few more concerns
> > >>if
> > >> >>the
> > >> >> job retries are not streamlined through Oozie.
> > >> >>
> > >> >> 1) Till the JT finishes recovering the job, the wf job/wf action
> > >>status
> > >> >> will be SUSPENDED/START_MANUAL.
> > >> >> Isn't it misleading as the hadoop job is RUNNING while oozie
> > >>incorrectly
> > >> >> shows as SUSPENDED? Even if allow this, after the job completes,
> > >>what if
> > >> >> the callback is lost or oozie is down?
> > >> >> To prevent the job being in SUSPENDED forever, we need to hack our
> > >> >> services to pull SUSPENDED/START_MANUAL jobs from db and update
> their
> > >> >> status.
> > >> >>
> > >> >> 2) Should we allow failing of the user RESUME command if the action
> > >>is
> > >> >>in
> > >> >> START_MANUAL to prevent the race condition we were discussing?
> > >> >> This would mean changing the semantics of the states.
> > >> >>
> > >> >> 3) Confused on mapred.job.restart.recover. Reading
> > >> >> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html, it
> > >>says
> > >> >> that the default value of this is true. So,
> > >> >> if mapred.jobtracker.restart.recover (system config) is already
> > >>enabled,
> > >> >> is job recovery on by default? Also, does recover mean the job will
> > >> >>start
> > >> >> where it left from or is it just plain restart?
> > >> >>
> > >> >> In summary, IMO allowing hadoop to recover jobs independently
> > >>bypassing
> > >> >> Oozie ins't trivial. It would have helped if the JT produced
> > >> >>notification
> > >> >> when it comes online, so Oozie could retry after consuming those.
> But
> > >> >> currently, notification only happens when task completes.
> > >> >>
> > >> >> An alternate approach is to modify the semantics of START_MANUAL.
> > >> >> Currently Oozie puts the action/job in START_MANUAL/SUSPENDED and
> > >> >>expects
> > >> >> the user to resume it. We can change this and make Oozie retry the
> > >> >> START_MANUAL actions at configurable interval (~30 mins or some
> > >>scheme
> > >> >> like exp back off) . Of course, this is is bad as oozie will keep
> > >> >>polling
> > >> >> hadoop at some interval but manual resume of jobs who have faced
> > >> >>transient
> > >> >> errors will no longer be mandatory.
> > >> >>
> > >> >> --Virag
> > >> >>
> > >> >>
> > >> >> On 8/6/13 4:38 PM, "Robert Kanter" <[email protected]> wrote:
> > >> >>
> > >> >> >If ActionCheckX is trying to retry, and the JT recovers the job,
> > >>that
> > >> >> >should be fine.  The "retry" is to simply try connecting to the JT
> > >>to
> > >> >>get
> > >> >> >the status for the job.  If the user issues a "RESUME" for a
> > >> >>START_MANUAL
> > >> >> >job, then yes, Oozie will try to resubmit a new job for that
> action
> > >>and
> > >> >> >we'd have two of them if the JT also recovers it.
> > >> >> >
> > >> >> >What if we modified the ActionStartXCommand/ResumeActionXCommand
> > >> >> >precondition to check if the action already has a Job ID that is
> > >>valid
> > >> >> >(i.e. not unknown to the JT), then it fails the precondition check
> > >>or
> > >> >> >something similar?
> > >> >> >
> > >> >> >- Robert
> > >> >> >
> > >> >> >
> > >> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <
> [email protected]>
> > >> >> wrote:
> > >> >> >
> > >> >> >> ActionCheckx first retries for a configurable amount of time and
> > >>then
> > >> >> >> makes the status as START_MANUAL.
> > >> >> >> So, the problem might happen when JT recovers the job during the
> > >>same
> > >> >> >>time
> > >> >> >> when 1) ActionCheckX is trying to retry or the 2) user issues a
> > >> >>"RESUME"
> > >> >> >> for a start_manual job.
> > >> >> >> We have to ensure that this doesn't happen otherwise we will
> have
> > >>two
> > >> >> >> hadoop jobs for the same action.
> > >> >> >> The callback happens only when the task is completed which might
> > >>be
> > >> >>too
> > >> >> >> late. During that time, Oozie might have already submitted a new
> > >> >>hadoop
> > >> >> >> job for that wf action.
> > >> >> >> So it doesn't seem straightforward to prevent Oozie to submit a
> > >>new
> > >> >>job
> > >> >> >>if
> > >> >> >> the JT is already recovering the older one.
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <[email protected]>
> wrote:
> > >> >> >>
> > >> >> >> >Yes, if JT recovers the job, it uses the same ID.  If the JT
> > >>comes
> > >> >>up
> > >> >> >> >quickly and recovers the job, Oozie continues working just fine
> > >> >> >>(without
> > >> >> >> >the ID swap issues discussed earlier).  When the JT takes
> longer
> > >> >>than
> > >> >> >>the
> > >> >> >> >10min ActionCheck interval, and the action is START_MANUAL,
> that
> > >> >>still
> > >> >> >> >needs to be figured out.
> > >> >> >> >
> > >> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told that it
> > >> >>should
> > >> >> >>have
> > >> >> >> >the same behavior.  The only differences are that the name of
> the
> > >> >> >>property
> > >> >> >> >to enable recoverability on the server (not the job-level one)
> is
> > >> >> >> >different
> > >> >> >> >obviously because it doesn't have "jobtracker" in it and it can
> > >>also
> > >> >> >> >recover the completed tasks, which shouldn't be a problem
> because
> > >> >>the
> > >> >> >> >launcher jar has the one task.  I'll of course double check
> this
> > >> >> >>though.
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >- Robert
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy
> > >> >> >> ><[email protected]>wrote:
> > >> >> >> >
> > >> >> >> >> Robert,
> > >> >> >> >>     You will not get a unknown hadoop job if JT has retry
> > >> >>configured
> > >> >> >> >>right?
> > >> >> >> >> What happens in that case? Especially what happens when Oozie
> > >> >>retry
> > >> >> >> >>happens
> > >> >> >> >> when JT comes up quickly?  Also do you know what is the
> > >>behaviour
> > >> >> >>with
> > >> >> >> >> Hadoop 2.x ?
> > >> >> >> >>
> > >> >> >> >> Mayank,
> > >> >> >> >>   OOZIE-1231 already has the changes to show Mapreduce job id
> > >>in
> > >> >>the
> > >> >> >> >>Child
> > >> >> >> >> job page to be consistent with other job types. The v1 API
> has
> > >>the
> > >> >> >>older
> > >> >> >> >> behaviour with map job url in externalId, while v2 API has it
> > >>in
> > >> >> >> >> childjobids.  So there is a UI change but v1 REST API has not
> > >> >> >>changed.
> > >> >> >> >>But
> > >> >> >> >> OOZIE-1231 has not changed any code with respect to id swap.
> > >> >> >> >>
> > >> >> >> >> Regards,
> > >> >> >> >> Rohini
> > >> >> >> >>
> > >> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter
> > >> >><[email protected]>
> > >> >> >> >> wrote:
> > >> >> >> >>
> > >> >> >> >> > Ya, I saw a precondition failed message.
> > >> >> >> >> >
> > >> >> >> >> > I just tried out what happens when the job is SUSPENDED,
> the
> > >> >> >>action is
> > >> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It
> doesn't
> > >> >> >>continue
> > >> >> >> >>the
> > >> >> >> >> > workflow.  It fails the eagerVerifyPrecondition from
> > >> >> >> >> > CompletedActionXCommand because the action isn't RUNNING.
> > >> >>Perhaps
> > >> >> >>we
> > >> >> >> >> > should make the CallbackService change the status in this
> > >> >> >>situation?
> > >> >> >> >> >
> > >> >> >> >> > Just to clarify, the above only happens when the JT has
> been
> > >> >>down
> > >> >> >>long
> > >> >> >> >> > enough that the ActionCheckXCommand (every 10min by
> default)
> > >>+
> > >> >>the
> > >> >> >> >> retries
> > >> >> >> >> > (3 x 1min) happen.  If it comes back sooner than that,
> > >> >>everything
> > >> >> >> >>works
> > >> >> >> >> > fine.
> > >> >> >> >> >
> > >> >> >> >> > thanks
> > >> >> >> >> > - Robert
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari
> > >> >><[email protected]
> > >> >> >
> > >> >> >> >> wrote:
> > >> >> >> >> >
> > >> >> >> >> > > Oh..okay. Seems like RecoveryService queues the StartX
> > >>command
> > >> >> >>but
> > >> >> >> >>the
> > >> >> >> >> > > verifyPrecondition() fails as the wf job is
> > >> >> >> >> > > Suspended (Plz verify this from logs).
> > >> >> >> >> > >
> > >> >> >> >> > > In that case, if Oozie is not auto-retrying and
> > >>resubmitting,
> > >> >> >>then
> > >> >> >> >>it
> > >> >> >> >> > > seems fair to have the JT recover the job.
> > >> >> >> >> > > But if JT recovers the job, can we make sure that the
> > >>workflow
> > >> >> >>job
> > >> >> >> >> > > transits to RUNNING from SUSPENDED and wf action from
> > >> >> >>START_MANUAL
> > >> >> >> >>to
> > >> >> >> >> > > RUNNING?
> > >> >> >> >> > > It should not happen that the user resumes the job which
> > >>makes
> > >> >> >>Oozie
> > >> >> >> >> > > submit a new hadoop job while the JT is also recovering
> the
> > >> >>same
> > >> >> >> >>job.
> > >> >> >> >> > > Also, I think the error can still be considered transient
> > >>from
> > >> >> >>Oozie
> > >> >> >> >> > > perspective as it is temporary depending on state of JT.
> > >> >> >> >> > >
> > >> >> >> >> > > Thanks,
> > >> >> >> >> > > Virag
> > >> >> >> >> > >
> > >> >> >> >> > >
> > >> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <[email protected]
> >
> > >> >>wrote:
> > >> >> >> >> > >
> > >> >> >> >> > > >Virag,
> > >> >> >> >> > > >I just tested out killing the JT and waiting for the
> > >>Checker
> > >> >> >> >>service
> > >> >> >> >> to
> > >> >> >> >> > > >retry and give up: the action goes to START_MANUAL and
> the
> > >> >>job
> > >> >> >>gets
> > >> >> >> >> > > >SUSPENDED.  I waited around long enough, but the
> > >> >>RecoveryService
> > >> >> >> >> didn't
> > >> >> >> >> > do
> > >> >> >> >> > > >anything.  Does it kick in for you?  As a side note,
> > >>looking
> > >> >>at
> > >> >> >>the
> > >> >> >> >> > code,
> > >> >> >> >> > > >the RecoveryService looks like it can handle
> START_MANUAL,
> > >> >> >> >>END_MANUAL,
> > >> >> >> >> > and
> > >> >> >> >> > > >USER_RETRY, which all sound like things the user should
> be
> > >> >> >>doing;
> > >> >> >> >>is
> > >> >> >> >> it
> > >> >> >> >> > > >correct that RecoveryService is handling these?
> > >> >> >> >> > > >The Unknown Hadoop Job error happens when the JT comes
> > >>back
> > >> >>in
> > >> >> >>time
> > >> >> >> >> > > >because
> > >> >> >> >> > > >it won't know about the old ID if its not recovering
> jobs.
> > >> >>So,
> > >> >> >> >>Oozie
> > >> >> >> >> > > >tries
> > >> >> >> >> > > >to ask it about a job that no longer exists.  I'm not
> sure
> > >> >>that
> > >> >> >> >>this
> > >> >> >> >> > > >should
> > >> >> >> >> > > >be a transient error because there's no way to determine
> > >>if
> > >> >>its
> > >> >> >> >> because
> > >> >> >> >> > > >the
> > >> >> >> >> > > >JT restarted and Oozie should resubmit the job or if
> > >> >>something
> > >> >> >>else
> > >> >> >> >> > > >happened.
> > >> >> >> >> > > >
> > >> >> >> >> > > >Mayank,
> > >> >> >> >> > > >That is a good point.  We could either make a v3 API or
> > >>add
> > >> >>an
> > >> >> >> >> > oozie-site
> > >> >> >> >> > > >config to turn on/off the id swap behavior and keep the
> v2
> > >> >>API.
> > >> >> >> >> > > >
> > >> >> >> >> > > >thanks
> > >> >> >> >> > > >- Robert
> > >> >> >> >> > > >
> > >> >> >> >> > > >
> > >> >> >> >> > > >
> > >> >> >> >> > > >
> > >> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal
> > >> >> >><[email protected]>
> > >> >> >> >> > wrote:
> > >> >> >> >> > > >
> > >> >> >> >> > > >> Robert,
> > >> >> >> >> > > >>
> > >> >> >> >> > > >> Thats a break in backward compatibility. Till now user
> > >>are
> > >> >> >>used
> > >> >> >> >>to
> > >> >> >> >> > > >>click on
> > >> >> >> >> > > >> to link to go to MR page.
> > >> >> >> >> > > >>
> > >> >> >> >> > > >> Is there a better way to handle this?
> > >> >> >> >> > > >>
> > >> >> >> >> > > >> Thanks,
> > >> >> >> >> > > >> Mayank
> > >> >> >> >> > > >>
> > >> >> >> >> > > >>
> > >> >> >> >> > > >>
> > >> >> >> >> > > >>
> > >> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <
> > >> >> >> >> [email protected]>
> > >> >> >> >> > > >> wrote:
> > >> >> >> >> > > >>
> > >> >> >> >> > > >> > Mona,
> > >> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie is doing
> > >>is
> > >> >>just
> > >> >> >> >> > retrying
> > >> >> >> >> > > >>to
> > >> >> >> >> > > >> > connect to the JT (which is why when the JT comes
> back
> > >> >>up,
> > >> >> >> >>Oozie
> > >> >> >> >> > > >> > can continue monitoring the hadoop job if it still
> has
> > >> >>the
> > >> >> >>same
> > >> >> >> >> ID);
> > >> >> >> >> > > >>it
> > >> >> >> >> > > >> > doesn't try to submit the job again as part of the
> > >> >>"retry".
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >> > Mayank,
> > >> >> >> >> > > >> > We can put the ID for the actual job in the Child
> IDs
> > >>tab
> > >> >> >>(like
> > >> >> >> >> with
> > >> >> >> >> > > >> Pig).
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >> > - Robert
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal
> > >> >> >> >><[email protected]
> > >> >> >> >> >
> > >> >> >> >> > > >> wrote:
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >> > > I agree , we should handle these two scenarios, I
> > >>am ok
> > >> >> >>with
> > >> >> >> >> > > >>changing
> > >> >> >> >> > > >> the
> > >> >> >> >> > > >> > > launcher behavior for MR however if we remove the
> id
> > >> >>swap
> > >> >> >> >>then
> > >> >> >> >> how
> > >> >> >> >> > > >>we
> > >> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right now?
> > >> >> >> >> > > >> > >
> > >> >> >> >> > > >> > > Thanks,
> > >> >> >> >> > > >> > > Mayank
> > >> >> >> >> > > >> > >
> > >> >> >> >> > > >> > >
> > >> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter
> > >> >> >> >> > > >><[email protected]>
> > >> >> >> >> > > >> > > wrote:
> > >> >> >> >> > > >> > >
> > >> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as is but
> > >>set
> > >> >>the
> > >> >> >> >> launcher
> > >> >> >> >> > > >> > recover
> > >> >> >> >> > > >> > > to
> > >> >> >> >> > > >> > > > 0 and job to 1; then consider these two
> scenarios:
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > > 1. JT gets restarted during the launcher job but
> > >> >>before
> > >> >> >>the
> > >> >> >> >> > > >>launcher
> > >> >> >> >> > > >> > job
> > >> >> >> >> > > >> > > > actually launches the real job:
> > >> >> >> >> > > >> > > >      - The launcher job won't be recovered
> > >>because we
> > >> >> >>told
> > >> >> >> >>it
> > >> >> >> >> > not
> > >> >> >> >> > > >>to
> > >> >> >> >> > > >> > > >      - The real job was never launched
> > >> >> >> >> > > >> > > >      ---> Action never completes and Oozie marks
> > >>it
> > >> >>as
> > >> >> >> >>failed
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > > 2. Launcher job submits the real job, but JT
> gets
> > >> >> >>restarted
> > >> >> >> >> > before
> > >> >> >> >> > > >> the
> > >> >> >> >> > > >> > > > Oozie server has a chance to swap IDs (its not
> an
> > >> >>atomic
> > >> >> >> >> > > >>operation):
> > >> >> >> >> > > >> > > >      - The launcher job won't be recovered
> > >>because we
> > >> >> >>told
> > >> >> >> >>it
> > >> >> >> >> > not
> > >> >> >> >> > > >>to
> > >> >> >> >> > > >> > > >      - The real job will be recovered and finish
> > >> >> >> >>successfully
> > >> >> >> >> > > >> > > >      ---> Oozie marks the action as failed even
> > >> >>though
> > >> >> >>the
> > >> >> >> >> > actual
> > >> >> >> >> > > >>job
> > >> >> >> >> > > >> > > > succeeded because it didn't know about the ID
> swap
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > > It would only work for the case where the JT
> gets
> > >> >> >>restarted
> > >> >> >> >> > after
> > >> >> >> >> > > >>the
> > >> >> >> >> > > >> > ID
> > >> >> >> >> > > >> > > > swap occurs.
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > > - Robert
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <
> > >> >> >> >> > [email protected]
> > >> >> >> >> > > >
> > >> >> >> >> > > >> > > wrote:
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > > > > Hi Robert,
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to
> jobs
> > >>for
> > >> >> >> >>recovery
> > >> >> >> >> > in
> > >> >> >> >> > > >>all
> > >> >> >> >> > > >> > the
> > >> >> >> >> > > >> > > > > cases except MR.
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > As after Id swapped Oozie only know about MR
> job
> > >> >>isn't
> > >> >> >> >>it?
> > >> >> >> >> > then
> > >> >> >> >> > > >> there
> > >> >> >> >> > > >> > > > > should not be any problem.
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and job to
> 1
> > >> >>then
> > >> >> >>job
> > >> >> >> >> will
> > >> >> >> >> > be
> > >> >> >> >> > > >> > > succeded
> > >> >> >> >> > > >> > > > > in case of JT restart.
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > AM I missing something?
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > Thanks,
> > >> >> >> >> > > >> > > > > Mayank
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter
> <
> > >> >> >> >> > > >> [email protected]>
> > >> >> >> >> > > >> > > > > wrote:
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > > > > I think you usually just get the "Unknown
> > >>Hadoop
> > >> >> >>Job"
> > >> >> >> >> error
> > >> >> >> >> > > >> message
> > >> >> >> >> > > >> > > > > because
> > >> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID it
> > >> >>already
> > >> >> >> >>has,
> > >> >> >> >> but
> > >> >> >> >> > > >>the
> > >> >> >> >> > > >> JT
> > >> >> >> >> > > >> > > no
> > >> >> >> >> > > >> > > > > > longer has that ID because it was restarted.
> > >> >>With
> > >> >> >>JT
> > >> >> >> >> > > >> > Recoverability
> > >> >> >> >> > > >> > > > > turned
> > >> >> >> >> > > >> > > > > > on, it will restart the job using the same
> > >>ID, so
> > >> >> >>Oozie
> > >> >> >> >> > > >>continues
> > >> >> >> >> > > >> > > just
> > >> >> >> >> > > >> > > > > > fine.
> > >> >> >> >> > > >> > > > > >
> > >> >> >> >> > > >> > > > > > - Robert
> > >> >> >> >> > > >> > > > > >
> > >> >> >> >> > > >> > > > > >
> > >> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini
> > >> >>Palaniswamy
> > >> >> >> >> > > >> > > > > > <[email protected]>wrote:
> > >> >> >> >> > > >> > > > > >
> > >> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job status and
> > >> >>decide
> > >> >> >> >>that
> > >> >> >> >> it
> > >> >> >> >> > > >>has
> > >> >> >> >> > > >> > > failed
> > >> >> >> >> > > >> > > > > and
> > >> >> >> >> > > >> > > > > > > when JT comes up launch another one if
> > >>retry is
> > >> >> >> >> > configured?
> > >> >> >> >> > > >> > > > > > >
> > >> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert
> > >>Kanter <
> > >> >> >> >> > > >> > > [email protected]>
> > >> >> >> >> > > >> > > > > > > wrote:
> > >> >> >> >> > > >> > > > > > >
> > >> >> >> >> > > >> > > > > > > > Hi,
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > > > We looked into how to support Job
> > >> >>Recoverability
> > >> >> >> >>(i.e.
> > >> >> >> >> > > >>the JT
> > >> >> >> >> > > >> > is
> > >> >> >> >> > > >> > > > > > > restarted
> > >> >> >> >> > > >> > > > > > > > and it wants to restart the jobs that
> were
> > >> >> >>running;
> > >> >> >> >> > > >>similarly
> > >> >> >> >> > > >> > for
> > >> >> >> >> > > >> > > > > YARN)
> > >> >> >> >> > > >> > > > > > > and
> > >> >> >> >> > > >> > > > > > > > have a pretty simple solution for all of
> > >>the
> > >> >> >>action
> > >> >> >> >> > types
> > >> >> >> >> > > >> > except
> > >> >> >> >> > > >> > > > for
> > >> >> >> >> > > >> > > > > > > > MapReduce.  If we set
> > >> >> >> >> mapreduce.job.restart.recover=true
> > >> >> >> >> > > >>for
> > >> >> >> >> > > >> > the
> > >> >> >> >> > > >> > > > > > launcher
> > >> >> >> >> > > >> > > > > > > > job and
> > >>mapreduce.job.restart.recover=false
> > >> >>for
> > >> >> >>the
> > >> >> >> >> jobs
> > >> >> >> >> > > >> > launched
> > >> >> >> >> > > >> > > > by
> > >> >> >> >> > > >> > > > > > the
> > >> >> >> >> > > >> > > > > > > > launcher, then when the JT restarts, it
> > >>will
> > >> >> >> >>recover
> > >> >> >> >> the
> > >> >> >> >> > > >> > launcher
> > >> >> >> >> > > >> > > > job
> > >> >> >> >> > > >> > > > > > but
> > >> >> >> >> > > >> > > > > > > > not the child jobs -- the launcher job
> > >>will
> > >> >>then
> > >> >> >> >>take
> > >> >> >> >> > > >>care of
> > >> >> >> >> > > >> > > > > > relaunching
> > >> >> >> >> > > >> > > > > > > > the child jobs.
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > > > For MapReduce, because of the
> optimization
> > >> >>with
> > >> >> >> >>the id
> > >> >> >> >> > > >>swap,
> > >> >> >> >> > > >> > this
> > >> >> >> >> > > >> > > > > won't
> > >> >> >> >> > > >> > > > > > > > work.  It would be very tricky, if it's
> > >>even
> > >> >> >> >> practical,
> > >> >> >> >> > > >>to do
> > >> >> >> >> > > >> > > > > something
> > >> >> >> >> > > >> > > > > > > > similar for the MR action.  Instead, we
> > >> >>think it
> > >> >> >> >>would
> > >> >> >> >> > be
> > >> >> >> >> > > >> best
> > >> >> >> >> > > >> > if
> > >> >> >> >> > > >> > > > we
> > >> >> >> >> > > >> > > > > > > simply
> > >> >> >> >> > > >> > > > > > > > remove the MR optimization and make it
> > >>just
> > >> >>like
> > >> >> >> >>the
> > >> >> >> >> > other
> > >> >> >> >> > > >> > action
> > >> >> >> >> > > >> > > > > > types.
> > >> >> >> >> > > >> > > > > > >  I
> > >> >> >> >> > > >> > > > > > > > know we normally don't want to remove
> > >> >> >> >>optimizations,
> > >> >> >> >> but
> > >> >> >> >> > > >> there
> > >> >> >> >> > > >> > > are
> > >> >> >> >> > > >> > > > > many
> > >> >> >> >> > > >> > > > > > > > advantages in this case, and it's only
> > >> >>saving a
> > >> >> >> >>single
> > >> >> >> >> > Map
> > >> >> >> >> > > >> slot
> > >> >> >> >> > > >> > > for
> > >> >> >> >> > > >> > > > > MR
> > >> >> >> >> > > >> > > > > > > jobs
> > >> >> >> >> > > >> > > > > > > > only.
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 <
> > >> >> >> >> > > >> > > > > > >
> > >> >>https://issues.apache.org/jira/browse/OOZIE-1483>
> > >> >> >> >> > > >> > > > > > > > with
> > >> >> >> >> > > >> > > > > > > > more details and should have a patch
> soon.
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > > > Thoughts?
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > > > thanks
> > >> >> >> >> > > >> > > > > > > > - Robert
> > >> >> >> >> > > >> > > > > > > >
> > >> >> >> >> > > >> > > > > > >
> > >> >> >> >> > > >> > > > > >
> > >> >> >> >> > > >> > > > >
> > >> >> >> >> > > >> > > >
> > >> >> >> >> > > >> > >
> > >> >> >> >> > > >> >
> > >> >> >> >> > > >>
> > >> >> >> >> > >
> > >> >> >> >> > >
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> >--
> > >> >Alejandro
> > >>
> > >>
> >
> >
>

Re: Job Recoverability

Reply via email to