[ https://issues.apache.org/jira/browse/MAPREDUCE-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446711#comment-13446711 ]
Hudson commented on MAPREDUCE-4611: ----------------------------------- Integrated in Hadoop-Mapreduce-trunk #1183 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1183/]) MAPREDUCE-4611. MR AM dies badly when Node is decommissioned (Robert Evans via tgraves) (Revision 1379599) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379599 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryEventHandler.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java > MR AM dies badly when Node is decomissioned > ------------------------------------------- > > Key: MAPREDUCE-4611 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4611 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.23.3, 2.0.0-alpha, 3.0.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Critical > Fix For: 0.23.3, 3.0.0, 2.2.0-alpha > > Attachments: MR-4611.txt > > > The MR AM always thinks that it is being killed by the RM when it gets a kill > signal and it has not finished processing yet. In reality the RM kill signal > is only sent when the client cannot communicate directly with the AM, which > probably means that the AM is in a bad state already. The much more common > case is that the node is marked as unhealthy or decomissioned. > I propose that in the short term the AM will only clean up if > # The process has been asked by the client to exit (kill) > # The process job has finished cleanly and is exiting already > # This is that last retry of the AM retries. > The downside here is that the .staging directory will be leaked and the job > will not show up in the history server on an kill from the RM in some cases. > At least until the full set of AM cleanup issues can be addressed, probably > as part of MAPREDUCE-4428 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira