[ https://issues.apache.org/jira/browse/MAPREDUCE-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Joseph Evans updated MAPREDUCE-4611: ------------------------------------------- Attachment: MR-4611.txt This patch makes the changes to only cleanup when the job has finished, or when it is the last retry for the AM. I have manually tested this in addition to adding in the unit tests. > MR AM dies badly when Node is decomissioned > ------------------------------------------- > > Key: MAPREDUCE-4611 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4611 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.23.3, 2.0.0-alpha, 3.0.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Attachments: MR-4611.txt > > > The MR AM always thinks that it is being killed by the RM when it gets a kill > signal and it has not finished processing yet. In reality the RM kill signal > is only sent when the client cannot communicate directly with the AM, which > probably means that the AM is in a bad state already. The much more common > case is that the node is marked as unhealthy or decomissioned. > I propose that in the short term the AM will only clean up if > # The process has been asked by the client to exit (kill) > # The process job has finished cleanly and is exiting already > # This is that last retry of the AM retries. > The downside here is that the .staging directory will be leaked and the job > will not show up in the history server on an kill from the RM in some cases. > At least until the full set of AM cleanup issues can be addressed, probably > as part of MAPREDUCE-4428 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira