[ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215306#comment-13215306 ]
Bikas Saha commented on MAPREDUCE-3353: --------------------------------------- A potential solution would be the following 1) have the scheduler interface return the set of bad nodes on which it has stopped scheduling. This keeps the decision of which node is bad in the scheduler. The scheduler is the ultimate authority on what runs on a node and should tell its clients whether about the nodes that it is not considering for scheduling. 2) 1) above could be done as another interface API or piggybacked on the scheduler.allocate() API. 3) The response could contain all the known bad nodes or deltas to the previous response. Deltas are cheaper to send but are susceptible to message loss and retransmission. Also, deltas would have to be divided into new bad nodes and new good nodes. 4) The AM might want to know the type of bad node. Say lost or unhealthy etc. The bad nodes information could be enhanced via querying the RMNode object for the actual reason/health. As an enhancement, we could add a new RMNodeMananger entity that manages all the RMNodes. The above functionality could move from the scheduler into RMNodeManager (though it would need to be in sync with the scheduler). After that, getting detailed information may not need direct access to RMNode object. Potentially, other interactions with RMNode could be forwarded through the RMNodeManager. But this would be a fairly significant refactoring thats best left to a separate future work item. > Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes > --------------------------------------------------------------------- > > Key: MAPREDUCE-3353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2, resourcemanager > Affects Versions: 0.23.0 > Reporter: Vinod Kumar Vavilapalli > Assignee: Bikas Saha > Priority: Critical > Fix For: 0.23.2 > > > When a node gets lost or turns faulty, AM needs to know about that event so > that it can take some action like for e.g. re-executing map tasks whose > intermediate output live on that faulty node. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira