Sandy Ryza created YARN-1265: -------------------------------- Summary: Fair Scheduler chokes on unhealthy node reconnect Key: YARN-1265 URL: https://issues.apache.org/jira/browse/YARN-1265 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.1.1-beta Reporter: Sandy Ryza Assignee: Sandy Ryza
Only nodes in the RUNNING state are tracked by schedulers. When a node reconnects, RMNodeImpl.ReconnectNodeTransition tries to remove it, even if it's in the RUNNING state. The FairScheduler doesn't guard against this. I think the best way to fix this is to check to see whether a node is RUNNING before telling the scheduler to remove it. -- This message was sent by Atlassian JIRA (v6.1#6144)