Lost NMs fail to rejoin ----------------------- Key: MAPREDUCE-3272 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3272 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Ramya Sunil Fix For: 0.23.0
Lost nodemanagers fail to join back. When the NM is lost, RM log reads {noformat} INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:<host:port> Timed out after 600 secs INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing <host:port> of type EXPIRE INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Removed Node <host:port> INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: <host:port> Node Transitioned from RUNNING to LOST {noformat} When the NM joins back, RM log reads {noformat} INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not found rebooting <host:port> {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira