[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215306#comment-13215306
 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

A potential solution would be the following
1) have the scheduler interface return the set of bad nodes on which it has 
stopped scheduling. This keeps the decision of which node is bad in the 
scheduler. The scheduler is the ultimate authority on what runs on a node and 
should tell its clients whether about the nodes that it is not considering for 
scheduling.
2) 1) above could be done as another interface API or piggybacked on the 
scheduler.allocate() API.
3) The response could contain all the known bad nodes or deltas to the previous 
response. Deltas are cheaper to send but are susceptible to message loss and 
retransmission. Also, deltas would have to be divided into new bad nodes and 
new good nodes.
4) The AM might want to know the type of bad node. Say lost or unhealthy etc. 
The bad nodes information could be enhanced via querying the RMNode object for 
the actual reason/health.

As an enhancement, we could add a new RMNodeMananger entity that manages all 
the RMNodes. The above functionality could move from the scheduler into 
RMNodeManager (though it would need to be in sync with the scheduler). After 
that, getting detailed information may not need direct access to RMNode object. 
Potentially, other interactions with RMNode could be forwarded through the 
RMNodeManager. But this would be a fairly significant refactoring thats best 
left to a separate future work item.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so 
> that it can take some action like for e.g. re-executing map tasks whose 
> intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to