[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Joseph Evans updated MAPREDUCE-4819: ------------------------------------------- Attachment: MR-4819-bobby-trunk.txt This patch should be fully functional. I have included the work by Bikas to put the Job history file in a location that is deleted with the staging directory. I have fixed a few bugs in the original where we were not registering with the RM correctly. And also where the Web App Proxy would return a 500 error if hit when recovery was happening. I have manually tested this by having the AM exit/halt before, during, and after job commit. I tested it with the job commit failing and succeeding. Everything appears to be working as expected. I did not change JobImpl forcedState because adding in the transitions was more then I wanted to do right now. I am happy to file a follow up JIRA to make those changes if we want them. I have also not added in the kill state. Again it looked a bit tricky because of the multithreading and I would prefer to get something working in now and add that as part of a follow up JIRA. I talked with Kihwal Lee about the extra HDFS load for an empty file vs a directory and he said about the only extra load is the extra PRC call to close it, and because it is just two files per job I left it as is. If you feel strongly about it I can fix it on a separate JIRA. About the only thing that is left for this is integration with MAPREDUCE-4832. > AM can rerun job after reporting final job status to the client > --------------------------------------------------------------- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 0.23.3, 2.0.1-alpha > Reporter: Jason Lowe > Assignee: Bikas Saha > Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, > MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira