[jira] [Comment Edited] (TEZ-2669) Propagation of errors from plugins to the AM for error reporting

Jeff Zhang (JIRA) Thu, 07 Jan 2016 01:38:59 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087106#comment-15087106
 ]


Jeff Zhang edited comment on TEZ-2669 at 1/7/16 9:38 AM:
---------------------------------------------------------

* Code duplication for handling InternalError and plugin error. Is the 
InternalError still necessary ? It only happens in TaskSchedulerManager, we may 
replace it as TASK_SCHEDULER_SERVICE_FATAL_ERROR
{code}
    case TASK_COMMUNICATOR_SERVICE_FATAL_ERROR:
    case CONTAINER_LAUNCHER_SERVICE_FATAL_ERROR:
    case TASK_SCHEDULER_SERVICE_FATAL_ERROR:
      // A fatal error from the pluggable services. The AM cannot continue 
operation, and should
      // be shutdown. The AM should not be restarted for recovery.
      DAGAppMasterEventUserServiceFatalError usfe = 
(DAGAppMasterEventUserServiceFatalError) event;
      Throwable error = usfe.getError();
      errDiagnostics = "Service Error: " + usfe.getDiagnosticInfo()
          + ", eventType=" + event.getType()
          + ", exception=" + ExceptionUtils.getStackTrace(usfe.getError());
      LOG.error(errDiagnostics, error);
      addDiagnostic(errDiagnostics);
      state = DAGAppMasterState.ERROR;
      if (currentDAG != null) {
        _updateLoggers(currentDAG, "_post");
        LOG.info("Service error: " + event.getType() + ". Aborting dag" + 
currentDAG.getID());
        // Inform the current DAG about the error
        sendEvent(new DAGEventInternalError(currentDAG.getID(), 
errDiagnostics));
      } else {
        LOG.info("Service error: " + event.getType() + ". AppMaster will exit 
as no dag is active");
        // This could be problematic if the scheduler generated the error,
        // since un-registration may not be possible.
        // For now - try setting this flag, but call the shutdownHandler 
irrespective of
        // how the flag is handled by user code.
        try {
          this.taskSchedulerManager.setShouldUnregisterFlag();
        } catch (Exception e) {
          // Ignore exception for now
          LOG.error("Error when trying to set unregister flag for 
TaskScheduler", e);
        } finally {
          shutdownHandler.shutdown();
        }
      }
      break;
    case INTERNAL_ERROR:
      state = DAGAppMasterState.ERROR;
      addDiagnostic("DAGAppMaster Internal Error occurred");
      if(currentDAG != null) {
        _updateLoggers(currentDAG, "_post");
        // notify dag to finish which will send the DAG_FINISHED event
        LOG.info("Internal Error. Notifying dags to finish.");
        sendEvent(new DAGEventInternalError(currentDAG.getID(), "Internal 
Error"));
      } else {
        LOG.info("Internal Error. Finishing directly as no dag is active.");
        this.taskSchedulerManager.setShouldUnregisterFlag();
        shutdownHandler.shutdown();
      }
{code}

* What's the purpose of 
TaskCommunicatorWrapper/ContainerLauncherWrapper/TaskSchedulerWrapper ? Add 
exception on methods ? We can do it on 
TaskCommunicator/ContainerLauncher/TaskScheduler directly since the plugin api 
is not stabilized I guess. Or maybe you don't want to make impact on downstream 
project (hive) ?




was (Author: zjffdu):
* Code duplicated for handling InternalError and plugin error. Is the 
InternalError still necessary ? It only happens in TaskSchedulerManager, we may 
replace it as TASK_SCHEDULER_SERVICE_FATAL_ERROR
{code}
    case TASK_COMMUNICATOR_SERVICE_FATAL_ERROR:
    case CONTAINER_LAUNCHER_SERVICE_FATAL_ERROR:
    case TASK_SCHEDULER_SERVICE_FATAL_ERROR:
      // A fatal error from the pluggable services. The AM cannot continue 
operation, and should
      // be shutdown. The AM should not be restarted for recovery.
      DAGAppMasterEventUserServiceFatalError usfe = 
(DAGAppMasterEventUserServiceFatalError) event;
      Throwable error = usfe.getError();
      errDiagnostics = "Service Error: " + usfe.getDiagnosticInfo()
          + ", eventType=" + event.getType()
          + ", exception=" + ExceptionUtils.getStackTrace(usfe.getError());
      LOG.error(errDiagnostics, error);
      addDiagnostic(errDiagnostics);
      state = DAGAppMasterState.ERROR;
      if (currentDAG != null) {
        _updateLoggers(currentDAG, "_post");
        LOG.info("Service error: " + event.getType() + ". Aborting dag" + 
currentDAG.getID());
        // Inform the current DAG about the error
        sendEvent(new DAGEventInternalError(currentDAG.getID(), 
errDiagnostics));
      } else {
        LOG.info("Service error: " + event.getType() + ". AppMaster will exit 
as no dag is active");
        // This could be problematic if the scheduler generated the error,
        // since un-registration may not be possible.
        // For now - try setting this flag, but call the shutdownHandler 
irrespective of
        // how the flag is handled by user code.
        try {
          this.taskSchedulerManager.setShouldUnregisterFlag();
        } catch (Exception e) {
          // Ignore exception for now
          LOG.error("Error when trying to set unregister flag for 
TaskScheduler", e);
        } finally {
          shutdownHandler.shutdown();
        }
      }
      break;
    case INTERNAL_ERROR:
      state = DAGAppMasterState.ERROR;
      addDiagnostic("DAGAppMaster Internal Error occurred");
      if(currentDAG != null) {
        _updateLoggers(currentDAG, "_post");
        // notify dag to finish which will send the DAG_FINISHED event
        LOG.info("Internal Error. Notifying dags to finish.");
        sendEvent(new DAGEventInternalError(currentDAG.getID(), "Internal 
Error"));
      } else {
        LOG.info("Internal Error. Finishing directly as no dag is active.");
        this.taskSchedulerManager.setShouldUnregisterFlag();
        shutdownHandler.shutdown();
      }
{code}

* What's the purpose of 
TaskCommunicatorWrapper/ContainerLauncherWrapper/TaskSchedulerWrapper ? Add 
exception on methods ? We can do it on 
TaskCommunicator/ContainerLauncher/TaskScheduler directly since the plugin api 
is not stabilized I guess. Or maybe you don't want to make impact on downstream 
project (hive) ?



> Propagation of errors from plugins to the AM for error reporting
> ----------------------------------------------------------------
>
>                 Key: TEZ-2669
>                 URL: https://issues.apache.org/jira/browse/TEZ-2669
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.8.0-alpha
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: TEZ-2669.1.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2669) Propagation of errors from plugins to the AM for error reporting

Reply via email to