[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981829#comment-14981829
 ] 

Bikas Saha commented on MAPREDUCE-5485:
---------------------------------------

The repeatable commit API seems useful. However, I dont understand why in this 
patch we are also changing the AM code to retry commits upon exception in 
commitJob() itself.

>From an offline conversation with [~djp] my understanding is that some commit 
>operations can timeout etc. (e.g. delete of many nested dirs) and so a retry 
>can prevent job failures.

This is where making delete repeatable in the committer code (by catching the 
file does not exist exception) will help. This will make the commit operation 
actually repeatable. Perhaps we can do that in a separate jira which includes 
making the different commit steps repeatable. E.g. also creating the success 
marker file with an over-write option so that it does not fail if the file 
exists.

More important though, the retry of the commit itself should probably be inside 
the committer itself. Moving it all the way up to the AM is leaking 
abstractions and also requires the implementation to be repeated across AM’s 
(MR, Tez, Spark etc.) And we cannot just retry in the AM on any exception 
because we don’t understand the semantics of the user land commit. IMO, this 
should move into the committer itself so that it retries the internal commit 
operations (based on a retry-able config) and based on the semantics of those 
operations.


> Allow repeating job commit by extending OutputCommitter API
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-5485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.1.0-beta
>            Reporter: Nemon Lou
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to