Jason Lowe created MAPREDUCE-4955:
-------------------------------------
Summary: NM container diagnostics for excess resource usage can be
lost if task fails while being killed
Key: MAPREDUCE-4955
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4955
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mr-am
Affects Versions: 0.23.5, 2.0.3-alpha
Reporter: Jason Lowe
When a nodemanager kills a container for being over resource budgets, it
provides a diagnostics message for the container status explaining why it was
killed. However this message can be lost if the task fails during the shutdown
from the SIGTERM (e.g.: lost DFS leases because filesystem closed) and notifies
the AM via the task umbilical *before* the AM receives the NM's container
status message via the RM heartbeat.
In that case the task attempt fails with the task's failure diagnostic, and the
user is left wondering exactly why the task failed because the NM's diagnostics
arrive too late, are not written to the history file, and are lost. If the AM
receives the container status via the RM heartbeat before the task fails during
shutdown then the diagnostics are written properly to the history file, and the
user can see why the task failed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira