[ https://issues.apache.org/jira/browse/YARN-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809565#comment-13809565 ]
Bikas Saha commented on YARN-1343: ---------------------------------- It looks like in the reconnect with different capacity case we will end up sending 2 NODE_USABLE events for the same node. {code} } rmNode.context.getRMNodes().put(newNode.getNodeID(), newNode); rmNode.context.getDispatcher().getEventHandler().handle( new RMNodeEvent(newNode.getNodeID(), RMNodeEventType.STARTED)); // <=== First instance when this triggers the ADD_NODE_Transition } rmNode.context.getDispatcher().getEventHandler().handle( new NodesListManagerEvent( NodesListManagerEventType.NODE_USABLE, rmNode)); // <=== Second instance {code} So we could probably move the second instance to the first if-stmt where it also sends the NodeAddedSchedulerEvent. That would handle the case of the same node coming back while the STARTED event in the else stmt will cover the case of a different node with the same node name coming back (same as a new node being added). {code} if (rmNode.getTotalCapability().equals(newNode.getTotalCapability()) && rmNode.getHttpPort() == newNode.getHttpPort()) { // Reset heartbeat ID since node just restarted. rmNode.getLastNodeHeartBeatResponse().setResponseId(0); if (rmNode.getState() != NodeState.UNHEALTHY) { // Only add new node if old state is not UNHEALTHY rmNode.context.getDispatcher().getEventHandler().handle( new NodeAddedSchedulerEvent(rmNode)); } } {code} I modified the patch testcase to try out reconnect with different capability and the above issue showed up. > NodeManagers additions/restarts are not reported as node updates in > AllocateResponse responses to AMs > ----------------------------------------------------------------------------------------------------- > > Key: YARN-1343 > URL: https://issues.apache.org/jira/browse/YARN-1343 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.2.0 > Reporter: Alejandro Abdelnur > Assignee: Alejandro Abdelnur > Priority: Critical > Fix For: 2.2.1 > > Attachments: YARN-1343.patch, YARN-1343.patch > > > If a NodeManager joins the cluster or gets restarted, running AMs never > receive the node update indicating the Node is running. -- This message was sent by Atlassian JIRA (v6.1#6144)