Jason Lowe created MAPREDUCE-4955:
-------------------------------------

             Summary: NM container diagnostics for excess resource usage can be 
lost if task fails while being killed 
                 Key: MAPREDUCE-4955
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4955
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mr-am
    Affects Versions: 0.23.5, 2.0.3-alpha
            Reporter: Jason Lowe


When a nodemanager kills a container for being over resource budgets, it 
provides a diagnostics message for the container status explaining why it was 
killed.  However this message can be lost if the task fails during the shutdown 
from the SIGTERM (e.g.: lost DFS leases because filesystem closed) and notifies 
the AM via the task umbilical *before* the AM receives the NM's container 
status message via the RM heartbeat.

In that case the task attempt fails with the task's failure diagnostic, and the 
user is left wondering exactly why the task failed because the NM's diagnostics 
arrive too late, are not written to the history file, and are lost.  If the AM 
receives the container status via the RM heartbeat before the task fails during 
shutdown then the diagnostics are written properly to the history file, and the 
user can see why the task failed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to