Re: Job Recoverability

Virag Kothari Wed, 07 Aug 2013 20:47:27 -0700

Ahh..I forgot about Oozie-994. My bad, I suggested that change. Everything
makes sense now. Thanks!


On 8/7/13 7:38 PM, "Robert Kanter" <[email protected]> wrote:

>The behavior where the ActionCheckXCommand calls handleNonTransient() with
>START_MANUAL when the JT can't be reached after the retries and on RESUME
>command will resubmit the job was something I did for OOZIE-994.  In
>hindsight, we shouldn't have done it that way.
>
>Yes, it will fail if job recovery is not enabled in the JT/RM; but I think
>this is the more correct behavior as this is something that the external
>system should be taking care of.
>
>- Robert
>
>
>On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari <[email protected]> wrote:
>
>> Alejandro, I agree that functionality would be preserved if action is
>>left
>> in RUNNING during a transient error.
>>
>> Few questions
>>
>> 1) START_MANUAL seems to be set only by handleNonTransient(). If this
>>is a
>> bug, do you know for what purpose it was introduced?
>>    I thought having START_MANUAL is a way to distinguish between Oozie
>> suspending job due to transient error and a user manually suspending the
>> job.
>>
>> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM recovery is
>> not enabled. And it seems that YARN recovery is still not there as
>> YARN-128 is not yet committed (Not sure if looking at right JIRA).
>>   Its a concern for us as we ask users to RESUME their jobs after hadoop
>> upgrade. Now they have to resume wf and rerun the failed actions.
>>
>> Thanks,
>> Virag
>>
>>
>>
>> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <[email protected]> wrote:
>>
>> >[joining the party a bit late]
>> >
>> >I just add an offline call with RobertK who brought me up to speed.
>> >
>> >By design, Oozie will retry starting a workflow action ONLY if it
>>couldn't
>> >start the WF action before. If Oozie started the WF action
>>successfully,
>> >the WF action state goes into RUNNING, and from then on it is the
>> >responsibility of the external system running the action to recover it.
>> >Oozie will not attempt any recovery after that point.
>> >
>> >This means that with  Hadoop (JT or YARN) job recovery, the launcher
>>job
>> >will be recovered by Hadoop without any intervention from Oozie.
>> >
>> >It is clear that to have recovery for  MR  action we need to get rid of
>> >the
>> >swap and just hold onto the MR launcher job as we do for the other
>> >actions.
>> >
>> >Now, on the whole discussion on the ActionCheckXCommand retries. We
>>have a
>> >bug in the ActionCheckXCommand, on handleNonTransient() we should not
>> >change the status of the WF action to START_MANUAL, we should leave it
>>in
>> >RUNNING. hadnleNonTransient() will suspend the WF job thus switching
>>off
>> >action checks. On WF job resume, the action checks will start working
>> >again, and if Hadoop has job recovery, things will work fine. Else the
>>WF
>> >action will fail because the launcher job is not known (the external
>> >system
>> >does not know how to recover jobs). Because we are reseting the status
>>to
>> >START_MANUAL we are dialing back on the lifecycle of the action, that
>>is
>> >incorrect and that creates the race condition that introduces 2 jobs.
>> >
>> >So again, Oozie is not responsible for recovering actions. With that
>> >assumption, fixing the handleNonTransient() to leave the status in
>>RUNNING
>> >and getting rid of the RM swap logic we should be good.
>> >
>> >Thoughts?
>> >
>> >
>> >
>> >
>> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari <[email protected]>
>> >wrote:
>> >
>> >> Robert,
>> >>
>> >> I have been thinking on this for a while and have few more concerns
>>if
>> >>the
>> >> job retries are not streamlined through Oozie.
>> >>
>> >> 1) Till the JT finishes recovering the job, the wf job/wf action
>>status
>> >> will be SUSPENDED/START_MANUAL.
>> >> Isn't it misleading as the hadoop job is RUNNING while oozie
>>incorrectly
>> >> shows as SUSPENDED? Even if allow this, after the job completes,
>>what if
>> >> the callback is lost or oozie is down?
>> >> To prevent the job being in SUSPENDED forever, we need to hack our
>> >> services to pull SUSPENDED/START_MANUAL jobs from db and update their
>> >> status.
>> >>
>> >> 2) Should we allow failing of the user RESUME command if the action
>>is
>> >>in
>> >> START_MANUAL to prevent the race condition we were discussing?
>> >> This would mean changing the semantics of the states.
>> >>
>> >> 3) Confused on mapred.job.restart.recover. Reading
>> >> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html, it
>>says
>> >> that the default value of this is true. So,
>> >> if mapred.jobtracker.restart.recover (system config) is already
>>enabled,
>> >> is job recovery on by default? Also, does recover mean the job will
>> >>start
>> >> where it left from or is it just plain restart?
>> >>
>> >> In summary, IMO allowing hadoop to recover jobs independently
>>bypassing
>> >> Oozie ins't trivial. It would have helped if the JT produced
>> >>notification
>> >> when it comes online, so Oozie could retry after consuming those. But
>> >> currently, notification only happens when task completes.
>> >>
>> >> An alternate approach is to modify the semantics of START_MANUAL.
>> >> Currently Oozie puts the action/job in START_MANUAL/SUSPENDED and
>> >>expects
>> >> the user to resume it. We can change this and make Oozie retry the
>> >> START_MANUAL actions at configurable interval (~30 mins or some
>>scheme
>> >> like exp back off) . Of course, this is is bad as oozie will keep
>> >>polling
>> >> hadoop at some interval but manual resume of jobs who have faced
>> >>transient
>> >> errors will no longer be mandatory.
>> >>
>> >> --Virag
>> >>
>> >>
>> >> On 8/6/13 4:38 PM, "Robert Kanter" <[email protected]> wrote:
>> >>
>> >> >If ActionCheckX is trying to retry, and the JT recovers the job,
>>that
>> >> >should be fine.  The "retry" is to simply try connecting to the JT
>>to
>> >>get
>> >> >the status for the job.  If the user issues a "RESUME" for a
>> >>START_MANUAL
>> >> >job, then yes, Oozie will try to resubmit a new job for that action
>>and
>> >> >we'd have two of them if the JT also recovers it.
>> >> >
>> >> >What if we modified the ActionStartXCommand/ResumeActionXCommand
>> >> >precondition to check if the action already has a Job ID that is
>>valid
>> >> >(i.e. not unknown to the JT), then it fails the precondition check
>>or
>> >> >something similar?
>> >> >
>> >> >- Robert
>> >> >
>> >> >
>> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <[email protected]>
>> >> wrote:
>> >> >
>> >> >> ActionCheckx first retries for a configurable amount of time and
>>then
>> >> >> makes the status as START_MANUAL.
>> >> >> So, the problem might happen when JT recovers the job during the
>>same
>> >> >>time
>> >> >> when 1) ActionCheckX is trying to retry or the 2) user issues a
>> >>"RESUME"
>> >> >> for a start_manual job.
>> >> >> We have to ensure that this doesn't happen otherwise we will have
>>two
>> >> >> hadoop jobs for the same action.
>> >> >> The callback happens only when the task is completed which might
>>be
>> >>too
>> >> >> late. During that time, Oozie might have already submitted a new
>> >>hadoop
>> >> >> job for that wf action.
>> >> >> So it doesn't seem straightforward to prevent Oozie to submit a
>>new
>> >>job
>> >> >>if
>> >> >> the JT is already recovering the older one.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <[email protected]> wrote:
>> >> >>
>> >> >> >Yes, if JT recovers the job, it uses the same ID.  If the JT
>>comes
>> >>up
>> >> >> >quickly and recovers the job, Oozie continues working just fine
>> >> >>(without
>> >> >> >the ID swap issues discussed earlier).  When the JT takes longer
>> >>than
>> >> >>the
>> >> >> >10min ActionCheck interval, and the action is START_MANUAL, that
>> >>still
>> >> >> >needs to be figured out.
>> >> >> >
>> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told that it
>> >>should
>> >> >>have
>> >> >> >the same behavior.  The only differences are that the name of the
>> >> >>property
>> >> >> >to enable recoverability on the server (not the job-level one) is
>> >> >> >different
>> >> >> >obviously because it doesn't have "jobtracker" in it and it can
>>also
>> >> >> >recover the completed tasks, which shouldn't be a problem because
>> >>the
>> >> >> >launcher jar has the one task.  I'll of course double check this
>> >> >>though.
>> >> >> >
>> >> >> >
>> >> >> >- Robert
>> >> >> >
>> >> >> >
>> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy
>> >> >> ><[email protected]>wrote:
>> >> >> >
>> >> >> >> Robert,
>> >> >> >>     You will not get a unknown hadoop job if JT has retry
>> >>configured
>> >> >> >>right?
>> >> >> >> What happens in that case? Especially what happens when Oozie
>> >>retry
>> >> >> >>happens
>> >> >> >> when JT comes up quickly?  Also do you know what is the
>>behaviour
>> >> >>with
>> >> >> >> Hadoop 2.x ?
>> >> >> >>
>> >> >> >> Mayank,
>> >> >> >>   OOZIE-1231 already has the changes to show Mapreduce job id
>>in
>> >>the
>> >> >> >>Child
>> >> >> >> job page to be consistent with other job types. The v1 API has
>>the
>> >> >>older
>> >> >> >> behaviour with map job url in externalId, while v2 API has it
>>in
>> >> >> >> childjobids.  So there is a UI change but v1 REST API has not
>> >> >>changed.
>> >> >> >>But
>> >> >> >> OOZIE-1231 has not changed any code with respect to id swap.
>> >> >> >>
>> >> >> >> Regards,
>> >> >> >> Rohini
>> >> >> >>
>> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter
>> >><[email protected]>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >> > Ya, I saw a precondition failed message.
>> >> >> >> >
>> >> >> >> > I just tried out what happens when the job is SUSPENDED, the
>> >> >>action is
>> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It doesn't
>> >> >>continue
>> >> >> >>the
>> >> >> >> > workflow.  It fails the eagerVerifyPrecondition from
>> >> >> >> > CompletedActionXCommand because the action isn't RUNNING.
>> >>Perhaps
>> >> >>we
>> >> >> >> > should make the CallbackService change the status in this
>> >> >>situation?
>> >> >> >> >
>> >> >> >> > Just to clarify, the above only happens when the JT has been
>> >>down
>> >> >>long
>> >> >> >> > enough that the ActionCheckXCommand (every 10min by default)
>>+
>> >>the
>> >> >> >> retries
>> >> >> >> > (3 x 1min) happen.  If it comes back sooner than that,
>> >>everything
>> >> >> >>works
>> >> >> >> > fine.
>> >> >> >> >
>> >> >> >> > thanks
>> >> >> >> > - Robert
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari
>> >><[email protected]
>> >> >
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> > > Oh..okay. Seems like RecoveryService queues the StartX
>>command
>> >> >>but
>> >> >> >>the
>> >> >> >> > > verifyPrecondition() fails as the wf job is
>> >> >> >> > > Suspended (Plz verify this from logs).
>> >> >> >> > >
>> >> >> >> > > In that case, if Oozie is not auto-retrying and
>>resubmitting,
>> >> >>then
>> >> >> >>it
>> >> >> >> > > seems fair to have the JT recover the job.
>> >> >> >> > > But if JT recovers the job, can we make sure that the
>>workflow
>> >> >>job
>> >> >> >> > > transits to RUNNING from SUSPENDED and wf action from
>> >> >>START_MANUAL
>> >> >> >>to
>> >> >> >> > > RUNNING?
>> >> >> >> > > It should not happen that the user resumes the job which
>>makes
>> >> >>Oozie
>> >> >> >> > > submit a new hadoop job while the JT is also recovering the
>> >>same
>> >> >> >>job.
>> >> >> >> > > Also, I think the error can still be considered transient
>>from
>> >> >>Oozie
>> >> >> >> > > perspective as it is temporary depending on state of JT.
>> >> >> >> > >
>> >> >> >> > > Thanks,
>> >> >> >> > > Virag
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <[email protected]>
>> >>wrote:
>> >> >> >> > >
>> >> >> >> > > >Virag,
>> >> >> >> > > >I just tested out killing the JT and waiting for the
>>Checker
>> >> >> >>service
>> >> >> >> to
>> >> >> >> > > >retry and give up: the action goes to START_MANUAL and the
>> >>job
>> >> >>gets
>> >> >> >> > > >SUSPENDED.  I waited around long enough, but the
>> >>RecoveryService
>> >> >> >> didn't
>> >> >> >> > do
>> >> >> >> > > >anything.  Does it kick in for you?  As a side note,
>>looking
>> >>at
>> >> >>the
>> >> >> >> > code,
>> >> >> >> > > >the RecoveryService looks like it can handle START_MANUAL,
>> >> >> >>END_MANUAL,
>> >> >> >> > and
>> >> >> >> > > >USER_RETRY, which all sound like things the user should be
>> >> >>doing;
>> >> >> >>is
>> >> >> >> it
>> >> >> >> > > >correct that RecoveryService is handling these?
>> >> >> >> > > >The Unknown Hadoop Job error happens when the JT comes
>>back
>> >>in
>> >> >>time
>> >> >> >> > > >because
>> >> >> >> > > >it won't know about the old ID if its not recovering jobs.
>> >>So,
>> >> >> >>Oozie
>> >> >> >> > > >tries
>> >> >> >> > > >to ask it about a job that no longer exists.  I'm not sure
>> >>that
>> >> >> >>this
>> >> >> >> > > >should
>> >> >> >> > > >be a transient error because there's no way to determine
>>if
>> >>its
>> >> >> >> because
>> >> >> >> > > >the
>> >> >> >> > > >JT restarted and Oozie should resubmit the job or if
>> >>something
>> >> >>else
>> >> >> >> > > >happened.
>> >> >> >> > > >
>> >> >> >> > > >Mayank,
>> >> >> >> > > >That is a good point.  We could either make a v3 API or
>>add
>> >>an
>> >> >> >> > oozie-site
>> >> >> >> > > >config to turn on/off the id swap behavior and keep the v2
>> >>API.
>> >> >> >> > > >
>> >> >> >> > > >thanks
>> >> >> >> > > >- Robert
>> >> >> >> > > >
>> >> >> >> > > >
>> >> >> >> > > >
>> >> >> >> > > >
>> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal
>> >> >><[email protected]>
>> >> >> >> > wrote:
>> >> >> >> > > >
>> >> >> >> > > >> Robert,
>> >> >> >> > > >>
>> >> >> >> > > >> Thats a break in backward compatibility. Till now user
>>are
>> >> >>used
>> >> >> >>to
>> >> >> >> > > >>click on
>> >> >> >> > > >> to link to go to MR page.
>> >> >> >> > > >>
>> >> >> >> > > >> Is there a better way to handle this?
>> >> >> >> > > >>
>> >> >> >> > > >> Thanks,
>> >> >> >> > > >> Mayank
>> >> >> >> > > >>
>> >> >> >> > > >>
>> >> >> >> > > >>
>> >> >> >> > > >>
>> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <
>> >> >> >> [email protected]>
>> >> >> >> > > >> wrote:
>> >> >> >> > > >>
>> >> >> >> > > >> > Mona,
>> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie is doing
>>is
>> >>just
>> >> >> >> > retrying
>> >> >> >> > > >>to
>> >> >> >> > > >> > connect to the JT (which is why when the JT comes back
>> >>up,
>> >> >> >>Oozie
>> >> >> >> > > >> > can continue monitoring the hadoop job if it still has
>> >>the
>> >> >>same
>> >> >> >> ID);
>> >> >> >> > > >>it
>> >> >> >> > > >> > doesn't try to submit the job again as part of the
>> >>"retry".
>> >> >> >> > > >> >
>> >> >> >> > > >> > Mayank,
>> >> >> >> > > >> > We can put the ID for the actual job in the Child IDs
>>tab
>> >> >>(like
>> >> >> >> with
>> >> >> >> > > >> Pig).
>> >> >> >> > > >> >
>> >> >> >> > > >> >
>> >> >> >> > > >> > - Robert
>> >> >> >> > > >> >
>> >> >> >> > > >> >
>> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal
>> >> >> >><[email protected]
>> >> >> >> >
>> >> >> >> > > >> wrote:
>> >> >> >> > > >> >
>> >> >> >> > > >> > > I agree , we should handle these two scenarios, I
>>am ok
>> >> >>with
>> >> >> >> > > >>changing
>> >> >> >> > > >> the
>> >> >> >> > > >> > > launcher behavior for MR however if we remove the id
>> >>swap
>> >> >> >>then
>> >> >> >> how
>> >> >> >> > > >>we
>> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right now?
>> >> >> >> > > >> > >
>> >> >> >> > > >> > > Thanks,
>> >> >> >> > > >> > > Mayank
>> >> >> >> > > >> > >
>> >> >> >> > > >> > >
>> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter
>> >> >> >> > > >><[email protected]>
>> >> >> >> > > >> > > wrote:
>> >> >> >> > > >> > >
>> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as is but
>>set
>> >>the
>> >> >> >> launcher
>> >> >> >> > > >> > recover
>> >> >> >> > > >> > > to
>> >> >> >> > > >> > > > 0 and job to 1; then consider these two scenarios:
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > > 1. JT gets restarted during the launcher job but
>> >>before
>> >> >>the
>> >> >> >> > > >>launcher
>> >> >> >> > > >> > job
>> >> >> >> > > >> > > > actually launches the real job:
>> >> >> >> > > >> > > >      - The launcher job won't be recovered
>>because we
>> >> >>told
>> >> >> >>it
>> >> >> >> > not
>> >> >> >> > > >>to
>> >> >> >> > > >> > > >      - The real job was never launched
>> >> >> >> > > >> > > >      ---> Action never completes and Oozie marks
>>it
>> >>as
>> >> >> >>failed
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > > 2. Launcher job submits the real job, but JT gets
>> >> >>restarted
>> >> >> >> > before
>> >> >> >> > > >> the
>> >> >> >> > > >> > > > Oozie server has a chance to swap IDs (its not an
>> >>atomic
>> >> >> >> > > >>operation):
>> >> >> >> > > >> > > >      - The launcher job won't be recovered
>>because we
>> >> >>told
>> >> >> >>it
>> >> >> >> > not
>> >> >> >> > > >>to
>> >> >> >> > > >> > > >      - The real job will be recovered and finish
>> >> >> >>successfully
>> >> >> >> > > >> > > >      ---> Oozie marks the action as failed even
>> >>though
>> >> >>the
>> >> >> >> > actual
>> >> >> >> > > >>job
>> >> >> >> > > >> > > > succeeded because it didn't know about the ID swap
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > > It would only work for the case where the JT gets
>> >> >>restarted
>> >> >> >> > after
>> >> >> >> > > >>the
>> >> >> >> > > >> > ID
>> >> >> >> > > >> > > > swap occurs.
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > > - Robert
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <
>> >> >> >> > [email protected]
>> >> >> >> > > >
>> >> >> >> > > >> > > wrote:
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > > > > Hi Robert,
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to jobs
>>for
>> >> >> >>recovery
>> >> >> >> > in
>> >> >> >> > > >>all
>> >> >> >> > > >> > the
>> >> >> >> > > >> > > > > cases except MR.
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > As after Id swapped Oozie only know about MR job
>> >>isn't
>> >> >> >>it?
>> >> >> >> > then
>> >> >> >> > > >> there
>> >> >> >> > > >> > > > > should not be any problem.
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and job to 1
>> >>then
>> >> >>job
>> >> >> >> will
>> >> >> >> > be
>> >> >> >> > > >> > > succeded
>> >> >> >> > > >> > > > > in case of JT restart.
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > AM I missing something?
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > Thanks,
>> >> >> >> > > >> > > > > Mayank
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <
>> >> >> >> > > >> [email protected]>
>> >> >> >> > > >> > > > > wrote:
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > > > > I think you usually just get the "Unknown
>>Hadoop
>> >> >>Job"
>> >> >> >> error
>> >> >> >> > > >> message
>> >> >> >> > > >> > > > > because
>> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID it
>> >>already
>> >> >> >>has,
>> >> >> >> but
>> >> >> >> > > >>the
>> >> >> >> > > >> JT
>> >> >> >> > > >> > > no
>> >> >> >> > > >> > > > > > longer has that ID because it was restarted.
>> >>With
>> >> >>JT
>> >> >> >> > > >> > Recoverability
>> >> >> >> > > >> > > > > turned
>> >> >> >> > > >> > > > > > on, it will restart the job using the same
>>ID, so
>> >> >>Oozie
>> >> >> >> > > >>continues
>> >> >> >> > > >> > > just
>> >> >> >> > > >> > > > > > fine.
>> >> >> >> > > >> > > > > >
>> >> >> >> > > >> > > > > > - Robert
>> >> >> >> > > >> > > > > >
>> >> >> >> > > >> > > > > >
>> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini
>> >>Palaniswamy
>> >> >> >> > > >> > > > > > <[email protected]>wrote:
>> >> >> >> > > >> > > > > >
>> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job status and
>> >>decide
>> >> >> >>that
>> >> >> >> it
>> >> >> >> > > >>has
>> >> >> >> > > >> > > failed
>> >> >> >> > > >> > > > > and
>> >> >> >> > > >> > > > > > > when JT comes up launch another one if
>>retry is
>> >> >> >> > configured?
>> >> >> >> > > >> > > > > > >
>> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert
>>Kanter <
>> >> >> >> > > >> > > [email protected]>
>> >> >> >> > > >> > > > > > > wrote:
>> >> >> >> > > >> > > > > > >
>> >> >> >> > > >> > > > > > > > Hi,
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > > > We looked into how to support Job
>> >>Recoverability
>> >> >> >>(i.e.
>> >> >> >> > > >>the JT
>> >> >> >> > > >> > is
>> >> >> >> > > >> > > > > > > restarted
>> >> >> >> > > >> > > > > > > > and it wants to restart the jobs that were
>> >> >>running;
>> >> >> >> > > >>similarly
>> >> >> >> > > >> > for
>> >> >> >> > > >> > > > > YARN)
>> >> >> >> > > >> > > > > > > and
>> >> >> >> > > >> > > > > > > > have a pretty simple solution for all of
>>the
>> >> >>action
>> >> >> >> > types
>> >> >> >> > > >> > except
>> >> >> >> > > >> > > > for
>> >> >> >> > > >> > > > > > > > MapReduce.  If we set
>> >> >> >> mapreduce.job.restart.recover=true
>> >> >> >> > > >>for
>> >> >> >> > > >> > the
>> >> >> >> > > >> > > > > > launcher
>> >> >> >> > > >> > > > > > > > job and
>>mapreduce.job.restart.recover=false
>> >>for
>> >> >>the
>> >> >> >> jobs
>> >> >> >> > > >> > launched
>> >> >> >> > > >> > > > by
>> >> >> >> > > >> > > > > > the
>> >> >> >> > > >> > > > > > > > launcher, then when the JT restarts, it
>>will
>> >> >> >>recover
>> >> >> >> the
>> >> >> >> > > >> > launcher
>> >> >> >> > > >> > > > job
>> >> >> >> > > >> > > > > > but
>> >> >> >> > > >> > > > > > > > not the child jobs -- the launcher job
>>will
>> >>then
>> >> >> >>take
>> >> >> >> > > >>care of
>> >> >> >> > > >> > > > > > relaunching
>> >> >> >> > > >> > > > > > > > the child jobs.
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > > > For MapReduce, because of the optimization
>> >>with
>> >> >> >>the id
>> >> >> >> > > >>swap,
>> >> >> >> > > >> > this
>> >> >> >> > > >> > > > > won't
>> >> >> >> > > >> > > > > > > > work.  It would be very tricky, if it's
>>even
>> >> >> >> practical,
>> >> >> >> > > >>to do
>> >> >> >> > > >> > > > > something
>> >> >> >> > > >> > > > > > > > similar for the MR action.  Instead, we
>> >>think it
>> >> >> >>would
>> >> >> >> > be
>> >> >> >> > > >> best
>> >> >> >> > > >> > if
>> >> >> >> > > >> > > > we
>> >> >> >> > > >> > > > > > > simply
>> >> >> >> > > >> > > > > > > > remove the MR optimization and make it
>>just
>> >>like
>> >> >> >>the
>> >> >> >> > other
>> >> >> >> > > >> > action
>> >> >> >> > > >> > > > > > types.
>> >> >> >> > > >> > > > > > >  I
>> >> >> >> > > >> > > > > > > > know we normally don't want to remove
>> >> >> >>optimizations,
>> >> >> >> but
>> >> >> >> > > >> there
>> >> >> >> > > >> > > are
>> >> >> >> > > >> > > > > many
>> >> >> >> > > >> > > > > > > > advantages in this case, and it's only
>> >>saving a
>> >> >> >>single
>> >> >> >> > Map
>> >> >> >> > > >> slot
>> >> >> >> > > >> > > for
>> >> >> >> > > >> > > > > MR
>> >> >> >> > > >> > > > > > > jobs
>> >> >> >> > > >> > > > > > > > only.
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 <
>> >> >> >> > > >> > > > > > >
>> >>https://issues.apache.org/jira/browse/OOZIE-1483>
>> >> >> >> > > >> > > > > > > > with
>> >> >> >> > > >> > > > > > > > more details and should have a patch soon.
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > > > Thoughts?
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > > > thanks
>> >> >> >> > > >> > > > > > > > - Robert
>> >> >> >> > > >> > > > > > > >
>> >> >> >> > > >> > > > > > >
>> >> >> >> > > >> > > > > >
>> >> >> >> > > >> > > > >
>> >> >> >> > > >> > > >
>> >> >> >> > > >> > >
>> >> >> >> > > >> >
>> >> >> >> > > >>
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >
>> >
>> >--
>> >Alejandro
>>
>>

Re: Job Recoverability

Reply via email to