Re: Job Recoverability

Robert Kanter Tue, 06 Aug 2013 10:01:17 -0700

I think you usually just get the "Unknown Hadoop Job" error message because
Oozie tries to look up the Hadoop Job ID it already has, but the JT no
longer has that ID because it was restarted.  With JT Recoverability turned
on, it will restart the job using the same ID, so Oozie continues just
fine.


- Robert


On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
<[email protected]>wrote:

> Wouldn't oozie poll for the job status and decide that it has failed and
> when JT comes up launch another one if retry is configured?
>
> On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <[email protected]>
> wrote:
>
> > Hi,
> >
> > We looked into how to support Job Recoverability (i.e. the JT is
> restarted
> > and it wants to restart the jobs that were running; similarly for YARN)
> and
> > have a pretty simple solution for all of the action types except for
> > MapReduce.  If we set mapreduce.job.restart.recover=true for the launcher
> > job and mapreduce.job.restart.recover=false for the jobs launched by the
> > launcher, then when the JT restarts, it will recover the launcher job but
> > not the child jobs -- the launcher job will then take care of relaunching
> > the child jobs.
> >
> > For MapReduce, because of the optimization with the id swap, this won't
> > work.  It would be very tricky, if it's even practical, to do something
> > similar for the MR action.  Instead, we think it would be best if we
> simply
> > remove the MR optimization and make it just like the other action types.
>  I
> > know we normally don't want to remove optimizations, but there are many
> > advantages in this case, and it's only saving a single Map slot for MR
> jobs
> > only.
> >
> > I've created OOZIE-1483 <
> https://issues.apache.org/jira/browse/OOZIE-1483>
> > with
> > more details and should have a patch soon.
> >
> > Thoughts?
> >
> >
> > thanks
> > - Robert
> >
>

Re: Job Recoverability

Reply via email to