Siddharth Seth created TEZ-3028:
-----------------------------------

             Summary: Improvements to error handling
                 Key: TEZ-3028
                 URL: https://issues.apache.org/jira/browse/TEZ-3028
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Siddharth Seth


There's several places where exceptions can reach the Dispatcher - which can 
cause a restart. Some of these may be valid and need to be evaluated.
e.g. TaskCommunicatorManager tracks known containers etc. In case of an error - 
it throws an unchecked exception, which I believe will reach the dispatcher 
directly. (Something like this happening would indicate a bug in the 
framework). Should this trigger a restart of the AM - or shutting down with an 
internal error?

The TaskSchedulerManager handles exceptions while processing events and 
dispatches a generic INTERNAL_ERRROR to the DAGAppMaster. This can be augmented 
with the reason for the error so that diagnostics are displayed correctly (or 
at least posted to the history service)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to