[ https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992102#comment-14992102 ]
Junping Du commented on MAPREDUCE-5485: --------------------------------------- Thanks Bikas for review and comments! bq. If the commit actually failed then there does not seem any reason to assume that retrying it will succeed. IMO if the commit reports a failure then AM should fail. Similarly, if a commit failure file exists (from a previous AM version) then the new version of the AM should respect that and fail since the commit has been reported to be failed. There are still reasons that related to AM specific, i.e. previous AM cannot connect to FS (FS or other CloudFS), committer mis-behavior because of getting loaded incorrect (due to classpath or other defect), etc. I think it make sense to do the best effort to retry the commit failure (like other reason to cause AM failure) given the commit is repeatable and all tasks are done successfully. bq. Javadoc could be improved. Inline Yes. I will. bq. num-retries instead of retries? Also, if its num-retries then default should be 0. If its num-attempts then default should be 1. Ok. I will update to attempts. default to be 1 means no retry to keep consistent with previous behavior. bq. Retry count checking code in the catch block subsumes the check retry count check in the while block? I don't think so. Can you take a look at it again? bq. The previous operation could delete the path after the if check has succeeded. So we probably also need to catch FileNotFoundException exception class here and ignore it if repeatableCommit is true. That's good point. Will fix it. bq. Do testcases need an @Test annotation? No. The test class extends TestCase, so all method start with "test" will be executed automatically. bq. firstTimeFail is probably a more clear name for what its doing - failing on the first attempt. Would be good to have a test that version 2 and retry = 1 will fail also. Testcases missing for specific changes in FileOutputCommitter for create/delete operation changes? Sounds good. Will fix/add later. > Allow repeating job commit by extending OutputCommitter API > ----------------------------------------------------------- > > Key: MAPREDUCE-5485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.1.0-beta > Reporter: Nemon Lou > Assignee: Junping Du > Priority: Critical > Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, > MAPREDUCE-5485-v1.patch > > > There are chances MRAppMaster crush during job committing,or NodeManager > restart cause the committing AM exit due to container expire.In these cases > ,the job will fail. > However,some jobs can redo commit so failing the job becomes unnecessary. > Let clients tell AM to allow redo commit or not is a better choice. > This idea comes from Jason Lowe's comments in MAPREDUCE-4819 -- This message was sent by Atlassian JIRA (v6.3.4#6332)