Hi, We looked into how to support Job Recoverability (i.e. the JT is restarted and it wants to restart the jobs that were running; similarly for YARN) and have a pretty simple solution for all of the action types except for MapReduce. If we set mapreduce.job.restart.recover=true for the launcher job and mapreduce.job.restart.recover=false for the jobs launched by the launcher, then when the JT restarts, it will recover the launcher job but not the child jobs -- the launcher job will then take care of relaunching the child jobs.
For MapReduce, because of the optimization with the id swap, this won't work. It would be very tricky, if it's even practical, to do something similar for the MR action. Instead, we think it would be best if we simply remove the MR optimization and make it just like the other action types. I know we normally don't want to remove optimizations, but there are many advantages in this case, and it's only saving a single Map slot for MR jobs only. I've created OOZIE-1483 <https://issues.apache.org/jira/browse/OOZIE-1483> with more details and should have a patch soon. Thoughts? thanks - Robert
