[ 
https://issues.apache.org/jira/browse/YARN-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809565#comment-13809565
 ] 

Bikas Saha commented on YARN-1343:
----------------------------------

It looks like in the reconnect with different capacity case we will end up 
sending 2 NODE_USABLE events for the same node.
{code}
        }
        rmNode.context.getRMNodes().put(newNode.getNodeID(), newNode);
        rmNode.context.getDispatcher().getEventHandler().handle(
            new RMNodeEvent(newNode.getNodeID(), RMNodeEventType.STARTED)); // 
<=== First instance when this triggers the ADD_NODE_Transition
      }
      rmNode.context.getDispatcher().getEventHandler().handle(
          new NodesListManagerEvent(
              NodesListManagerEventType.NODE_USABLE, rmNode)); // <=== Second 
instance
{code}

So we could probably move the second instance to the first if-stmt where it 
also sends the NodeAddedSchedulerEvent. That would handle the case of the same 
node coming back while the STARTED event in the else stmt will cover the case 
of a different node with the same node name coming back (same as a new node 
being added).
{code}
if (rmNode.getTotalCapability().equals(newNode.getTotalCapability())
          && rmNode.getHttpPort() == newNode.getHttpPort()) {
        // Reset heartbeat ID since node just restarted.
        rmNode.getLastNodeHeartBeatResponse().setResponseId(0);
        if (rmNode.getState() != NodeState.UNHEALTHY) {
          // Only add new node if old state is not UNHEALTHY
          rmNode.context.getDispatcher().getEventHandler().handle(
              new NodeAddedSchedulerEvent(rmNode));
        }
      }
{code}

I modified the patch testcase to try out reconnect with different capability 
and the above issue showed up.

> NodeManagers additions/restarts are not reported as node updates in 
> AllocateResponse responses to AMs
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1343
>                 URL: https://issues.apache.org/jira/browse/YARN-1343
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Critical
>             Fix For: 2.2.1
>
>         Attachments: YARN-1343.patch, YARN-1343.patch
>
>
> If a NodeManager joins the cluster or gets restarted, running AMs never 
> receive the node update indicating the Node is running.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to