Job Recoverability

Robert Kanter Mon, 05 Aug 2013 15:12:50 -0700

Hi,

We looked into how to support Job Recoverability (i.e. the JT is restarted
and it wants to restart the jobs that were running; similarly for YARN) and
have a pretty simple solution for all of the action types except for
MapReduce.  If we set mapreduce.job.restart.recover=true for the launcher
job and mapreduce.job.restart.recover=false for the jobs launched by the
launcher, then when the JT restarts, it will recover the launcher job but
not the child jobs -- the launcher job will then take care of relaunching
the child jobs.


For MapReduce, because of the optimization with the id swap, this won't
work.  It would be very tricky, if it's even practical, to do something
similar for the MR action.  Instead, we think it would be best if we simply
remove the MR optimization and make it just like the other action types.  I
know we normally don't want to remove optimizations, but there are many
advantages in this case, and it's only saving a single Map slot for MR jobs
only.

I've created OOZIE-1483 <https://issues.apache.org/jira/browse/OOZIE-1483> with
more details and should have a patch soon.

Thoughts?


thanks
- Robert

Job Recoverability

Reply via email to