Jason Lowe created TEZ-3462:
-------------------------------

             Summary: Task attempt failure during container shutdown loses 
useful container diagnostics
                 Key: TEZ-3462
                 URL: https://issues.apache.org/jira/browse/TEZ-3462
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.7.1
            Reporter: Jason Lowe


When a nodemanager kills a task attempt due to excessive memory usage it will 
send a SIGTERM followed by a SIGKILL.  It also sends a useful diagnostic 
message with the container completion event to the RM which will eventually 
make it to the AM on a subsequent heartbeat.

However if the JVM shutdown processing causes an error in the task (e.g.: 
filesystem being closed by shutdown hook) then the task attempt can report a 
failure before the useful NM diagnostic makes it to the AM.  The AM then 
records some other error as the task failure reason, and by the time the 
container completion status makes it to the AM it does not associate that error 
with the task attempt and the useful information is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to