Adding to Robert's comment,

Oozie retry currently does not take into account if JT is down or
undergoing restart etc. It retries (upto the user-configurable max) in
quick succession and then will give up. If JT is expected to be down
longer than avg (retry interval x retry times), then recovering on JT side
will be an advantage. However, in the case of a transient error and not a
larger maintenance window, wouldn't both Oozie and JT end up retrying the
same job?


On 8/6/13 9:59 AM, "Robert Kanter" <[email protected]> wrote:

>I think you usually just get the "Unknown Hadoop Job" error message
>because
>Oozie tries to look up the Hadoop Job ID it already has, but the JT no
>longer has that ID because it was restarted.  With JT Recoverability
>turned
>on, it will restart the job using the same ID, so Oozie continues just
>fine.
>
>- Robert
>
>
>On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
><[email protected]>wrote:
>
>> Wouldn't oozie poll for the job status and decide that it has failed and
>> when JT comes up launch another one if retry is configured?
>>
>> On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > We looked into how to support Job Recoverability (i.e. the JT is
>> restarted
>> > and it wants to restart the jobs that were running; similarly for
>>YARN)
>> and
>> > have a pretty simple solution for all of the action types except for
>> > MapReduce.  If we set mapreduce.job.restart.recover=true for the
>>launcher
>> > job and mapreduce.job.restart.recover=false for the jobs launched by
>>the
>> > launcher, then when the JT restarts, it will recover the launcher job
>>but
>> > not the child jobs -- the launcher job will then take care of
>>relaunching
>> > the child jobs.
>> >
>> > For MapReduce, because of the optimization with the id swap, this
>>won't
>> > work.  It would be very tricky, if it's even practical, to do
>>something
>> > similar for the MR action.  Instead, we think it would be best if we
>> simply
>> > remove the MR optimization and make it just like the other action
>>types.
>>  I
>> > know we normally don't want to remove optimizations, but there are
>>many
>> > advantages in this case, and it's only saving a single Map slot for MR
>> jobs
>> > only.
>> >
>> > I've created OOZIE-1483 <
>> https://issues.apache.org/jira/browse/OOZIE-1483>
>> > with
>> > more details and should have a patch soon.
>> >
>> > Thoughts?
>> >
>> >
>> > thanks
>> > - Robert
>> >
>>

Reply via email to