[ 
https://issues.apache.org/jira/browse/YARN-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784669#comment-13784669
 ] 

Sandy Ryza commented on YARN-1265:
----------------------------------

Attached patch removes the guard against nodes not being in the nodes map in 
CapacityScheduler.removeNode.  With the guard removed and without the other 
changes, TestResourceTrackerService.testReconnect fails.  It also fails without 
the changes when setting the Fair Scheduler as the default scheduler.  With the 
changes, it passes.

> Fair Scheduler chokes on unhealthy node reconnect
> -------------------------------------------------
>
>                 Key: YARN-1265
>                 URL: https://issues.apache.org/jira/browse/YARN-1265
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.1.1-beta
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>         Attachments: YARN-1265.patch
>
>
> Only nodes in the RUNNING state are tracked by schedulers.  When a node 
> reconnects, RMNodeImpl.ReconnectNodeTransition tries to remove it, even if 
> it's in the RUNNING state.  The FairScheduler doesn't guard against this.
> I think the best way to fix this is to check to see whether a node is RUNNING 
> before telling the scheduler to remove it.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to