[ https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991486#comment-14991486 ]
Bikas Saha commented on MAPREDUCE-5485: --------------------------------------- I dont think AM should reboot to retry after commit failure. Commit repeatable implies that a commit can be retried if it was stopped while in progress. If the commit actually failed then there does not seem any reason to assume that retrying it will succeed. IMO if the commit reports a failure then AM should fail. Similarly, if a commit failure file exists (from a previous AM version) then the new version of the AM should respect that and fail since the commit has been reported to be failed. So, IMO, we are looking to change the code here. Thoughts? {code} //The commit is still pending, commit error shutDownMessage = "Job commit from a prior MRAppMaster attempt is " + "potentially in progress. Preventing multiple commit executions"; forcedState = JobStateInternal.ERROR; {code} Javadoc could be improved. Inline {code} /** + * Is repeatable job commit supported so that job commit can be repeated again + * from previous failures. <<<< Returns true if an in-progress job commit can be retried. If the MR AM is re-run then it will check this value to determine if it can retry an in-progress commit that was started by a previous version. Note that in rare scenarios, the previous AM version might still be running at that time, due to system anomalies. Hence if this method returns true then the retry commit operation should be able to run concurrently with the previous operation. + * + * If repeatable job commit is supported, job restart can tolerant previous + * failures during job is committing. + * + * By default, it is not supported. Extended classes (like: + * FileOutputCommitter) should explicitly override it if provide support. << if they support it + * + * @param jobContext + * Context of the job whose output is being written. + * @return <code>true</code> repeatable job commit is supported, + * <code>false</code> otherwise + */ {code} num-retries instead of retries? Also, if its num-retries then default should be 0. If its num-attempts then default should be 1. {code}+ public static final String FILEOUTPUTCOMMITTER_FAILURE_RETRIES = + "mapreduce.fileoutputcommitter.failures.retry";{code} Retry count checking code in the catch block subsumes the check retry count check in the while block? {code}+ while (jobCommitNotFinished && (retries++ < retriesOnFailure)) { ... + } catch (Exception e) { + if (retries >= retriesOnFailure) {{code} The previous operation could delete the path after the if check has succeeded. So we probably also need to catch FileNotFoundException exception class here and ignore it if repeatableCommit is true. {code}+ // if job allow repeatable commit and pendingJobAttemptsPath could be + // deleted by previous AM, we do nothing here. + if (isCommitJobRepeatable(context) && !fs.exists(pendingJobAttemptsPath)) { + return; + }{code} Do testcases need an @Test annotation? {code}+ public void testCommitterWithFailureV1() throws Exception { + testCommitterWithFailureInternal(1);{code} firstTimeFail is probably a more clear name for what its doing - failing on the first attempt. {code}+ boolean firstTimeCommit = true;{code} Would be good to have a test that version 2 and retry = 1 will fail also. Testcases missing for specific changes in FileOutputCommitter for create/delete operation changes? > Allow repeating job commit by extending OutputCommitter API > ----------------------------------------------------------- > > Key: MAPREDUCE-5485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.1.0-beta > Reporter: Nemon Lou > Assignee: Junping Du > Priority: Critical > Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, > MAPREDUCE-5485-v1.patch > > > There are chances MRAppMaster crush during job committing,or NodeManager > restart cause the committing AM exit due to container expire.In these cases > ,the job will fail. > However,some jobs can redo commit so failing the job becomes unnecessary. > Let clients tell AM to allow redo commit or not is a better choice. > This idea comes from Jason Lowe's comments in MAPREDUCE-4819 -- This message was sent by Atlassian JIRA (v6.3.4#6332)