[ https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997949#comment-14997949 ]
Bikas Saha commented on MAPREDUCE-5485: --------------------------------------- Separate jira sounds good. We may need to add a new exception that can be thrown by the committer that clearly signals that the failure can be retried (e.g. NN in safemode) vs a fatal error (e.g. missing task files). About the patch. {code}+ // cleanup previous half done commits if committer supports repeatable job + // commit. + if (isCommitJobRepeatable()) { + cleanupInterruptedCommit(conf); + }{code} Is the above check too early in the code. E.g. IIRC, at this point we have not checked whether the previous job commit was succeeded or failed - in which case we cannot recover and there is nothing to do. We should be doing this check and cleaning up files only when we detect an in-progress commit - ie. startCommit file is present and finish/failedCommit file is not present. Else with this patch, we may end up redoing commit for an already successful commit. Is this the code we need to change? {code}//The commit is still pending, commit error shutDownMessage = "Job commit from a prior MRAppMaster attempt is " + "potentially in progress. Preventing multiple commit executions"; forcedState = JobStateInternal.ERROR;{code} Also, we have already changed the startCommit operation to be repeatable via the overwrite flag. After that is there a need to delete the files upfront. Delete may be an expensive operation on some cloud stores. Mapred javadoc fixes are missing. Also there are some typos in there. E.g. {code}+ * If repeatable job commit is supported, job restart can tolerant previous <<< tolerate ? + * failures during job is committing. <<< during job commit ? {code} This part of the code change could use some tests. Tests for repeatable success marker file and FileExistsException for repeatable deletes would be good to have. > Allow repeating job commit by extending OutputCommitter API > ----------------------------------------------------------- > > Key: MAPREDUCE-5485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.1.0-beta > Reporter: Nemon Lou > Assignee: Junping Du > Priority: Critical > Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, > MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, > MAPREDUCE-5485-v3.patch > > > There are chances MRAppMaster crush during job committing,or NodeManager > restart cause the committing AM exit due to container expire.In these cases > ,the job will fail. > However,some jobs can redo commit so failing the job becomes unnecessary. > Let clients tell AM to allow redo commit or not is a better choice. > This idea comes from Jason Lowe's comments in MAPREDUCE-4819 -- This message was sent by Atlassian JIRA (v6.3.4#6332)