Adding to Robert's comment, Oozie retry currently does not take into account if JT is down or undergoing restart etc. It retries (upto the user-configurable max) in quick succession and then will give up. If JT is expected to be down longer than avg (retry interval x retry times), then recovering on JT side will be an advantage. However, in the case of a transient error and not a larger maintenance window, wouldn't both Oozie and JT end up retrying the same job?
On 8/6/13 9:59 AM, "Robert Kanter" <[email protected]> wrote: >I think you usually just get the "Unknown Hadoop Job" error message >because >Oozie tries to look up the Hadoop Job ID it already has, but the JT no >longer has that ID because it was restarted. With JT Recoverability >turned >on, it will restart the job using the same ID, so Oozie continues just >fine. > >- Robert > > >On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy ><[email protected]>wrote: > >> Wouldn't oozie poll for the job status and decide that it has failed and >> when JT comes up launch another one if retry is configured? >> >> On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <[email protected]> >> wrote: >> >> > Hi, >> > >> > We looked into how to support Job Recoverability (i.e. the JT is >> restarted >> > and it wants to restart the jobs that were running; similarly for >>YARN) >> and >> > have a pretty simple solution for all of the action types except for >> > MapReduce. If we set mapreduce.job.restart.recover=true for the >>launcher >> > job and mapreduce.job.restart.recover=false for the jobs launched by >>the >> > launcher, then when the JT restarts, it will recover the launcher job >>but >> > not the child jobs -- the launcher job will then take care of >>relaunching >> > the child jobs. >> > >> > For MapReduce, because of the optimization with the id swap, this >>won't >> > work. It would be very tricky, if it's even practical, to do >>something >> > similar for the MR action. Instead, we think it would be best if we >> simply >> > remove the MR optimization and make it just like the other action >>types. >> I >> > know we normally don't want to remove optimizations, but there are >>many >> > advantages in this case, and it's only saving a single Map slot for MR >> jobs >> > only. >> > >> > I've created OOZIE-1483 < >> https://issues.apache.org/jira/browse/OOZIE-1483> >> > with >> > more details and should have a patch soon. >> > >> > Thoughts? >> > >> > >> > thanks >> > - Robert >> > >>
