[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4611:
-------------------------------------------

    Attachment: MR-4611.txt

This patch makes the changes to only cleanup when the job has finished, or when 
it is the last retry for the AM.

I have manually tested this in addition to adding in the unit tests.
                
> MR AM dies badly when Node is decomissioned
> -------------------------------------------
>
>                 Key: MAPREDUCE-4611
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4611
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.0-alpha, 3.0.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>         Attachments: MR-4611.txt
>
>
> The MR AM always thinks that it is being killed by the RM when it gets a kill 
> signal and it has not finished processing yet.  In reality the RM kill signal 
> is only sent when the client cannot communicate directly with the AM, which 
> probably means that the AM is in a bad state already.  The much more common 
> case is that the node is marked as unhealthy or decomissioned.
> I propose that in the short term the AM will only clean up if 
>  # The process has been asked by the client to exit (kill)
>  # The process job has finished cleanly and is exiting already
>  # This is that last retry of the AM retries.
> The downside here is that the .staging directory will be leaked and the job 
> will not show up in the history server on an kill from the RM in some cases.
> At least until the full set of AM cleanup issues can be addressed, probably 
> as part of MAPREDUCE-4428

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to