[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510319#comment-14510319
 ] 

Hitesh Shah commented on TEZ-2303:
----------------------------------

The dag is being sent the recover event before all services are started. This 
will start generating events ( both to the dispatcher as well as to 
history/recovery, etc ).  If an error occurs, the shutdownHandler is invoked. 
This will hit issues as services will not have started.
This should unregister with the RM under normal circumstances. Maybe a separate 
jira to handle the diagnostics in the following section: 

{code}
    } catch (IOException e) {
      LOG.error("Error occurred when trying to recover data from previous 
attempt."
          + " Shutting down AM", e);
      this.state = DAGAppMasterState.ERROR;
      this.taskSchedulerEventHandler.setShouldUnregisterFlag();
      shutdownHandler.shutdown();
      return;
    }
{code}

Is there a way to only stop accepting connections from clients until after the 
DAG is recovered? Not starting only that service also has problems as I believe 
the YarnSchedulerService depends on it for the host:port info. 


> ConcurrentModificationException while processing recovery
> ---------------------------------------------------------
>
>                 Key: TEZ-2303
>                 URL: https://issues.apache.org/jira/browse/TEZ-2303
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch
>
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to