[ https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981829#comment-14981829 ]
Bikas Saha commented on MAPREDUCE-5485: --------------------------------------- The repeatable commit API seems useful. However, I dont understand why in this patch we are also changing the AM code to retry commits upon exception in commitJob() itself. >From an offline conversation with [~djp] my understanding is that some commit >operations can timeout etc. (e.g. delete of many nested dirs) and so a retry >can prevent job failures. This is where making delete repeatable in the committer code (by catching the file does not exist exception) will help. This will make the commit operation actually repeatable. Perhaps we can do that in a separate jira which includes making the different commit steps repeatable. E.g. also creating the success marker file with an over-write option so that it does not fail if the file exists. More important though, the retry of the commit itself should probably be inside the committer itself. Moving it all the way up to the AM is leaking abstractions and also requires the implementation to be repeated across AM’s (MR, Tez, Spark etc.) And we cannot just retry in the AM on any exception because we don’t understand the semantics of the user land commit. IMO, this should move into the committer itself so that it retries the internal commit operations (based on a retry-able config) and based on the semantics of those operations. > Allow repeating job commit by extending OutputCommitter API > ----------------------------------------------------------- > > Key: MAPREDUCE-5485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.1.0-beta > Reporter: Nemon Lou > Assignee: Junping Du > Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch > > > There are chances MRAppMaster crush during job committing,or NodeManager > restart cause the committing AM exit due to container expire.In these cases > ,the job will fail. > However,some jobs can redo commit so failing the job becomes unnecessary. > Let clients tell AM to allow redo commit or not is a better choice. > This idea comes from Jason Lowe's comments in MAPREDUCE-4819 -- This message was sent by Atlassian JIRA (v6.3.4#6332)