Siddharth Seth created TEZ-3028:
-----------------------------------
Summary: Improvements to error handling
Key: TEZ-3028
URL: https://issues.apache.org/jira/browse/TEZ-3028
Project: Apache Tez
Issue Type: Bug
Reporter: Siddharth Seth
There's several places where exceptions can reach the Dispatcher - which can
cause a restart. Some of these may be valid and need to be evaluated.
e.g. TaskCommunicatorManager tracks known containers etc. In case of an error -
it throws an unchecked exception, which I believe will reach the dispatcher
directly. (Something like this happening would indicate a bug in the
framework). Should this trigger a restart of the AM - or shutting down with an
internal error?
The TaskSchedulerManager handles exceptions while processing events and
dispatches a generic INTERNAL_ERRROR to the DAGAppMaster. This can be augmented
with the reason for the error so that diagnostics are displayed correctly (or
at least posted to the history service)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)