[
https://issues.apache.org/jira/browse/MAPREDUCE-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174947#comment-13174947
]
Bh V S Kamesh commented on MAPREDUCE-3360:
--
Hi Jason,
Thanks for comments. Will incorporate your comments in my next patch. But
before submitting patch, would like clarify this.
When the RM, does not receive node heartbeat from an NM for *node expiry*
interval, RM removes the NM from its RM Nodes Map under node *EXPIRE* event.
Before removing the NM, corresponding Cluster metrics will be updated (In this
case, incrementing *lost* node count)
If the same NM sends heartbeat after above operation, RM checks whether there
is any node corresponding to this NodeId. If RM does not find any NM
corresponding to the NodeId, RM simply returns *reboot* as its heartbeat
response.
Before sending its heartbeat reponse, RM again updates the Cluster metrics
(this time, incrementing *reboot* node count).
Is it necessary to update different metrics for the same node's unavailability?
IMO, it shows incorrect information. I *think* either we need to update *lost*
node count or *reboot* node count but not both, in such circumstance.
any comments?
Provide information about lost nodes in the UI.
---
Key: MAPREDUCE-3360
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3360
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: mrv2
Affects Versions: 0.23.0
Environment: NA
Reporter: Bh V S Kamesh
Attachments: LostNodes.png, MAPREDUCE-3360-1.patch,
MAPREDUCE-3360.patch, lostNodes.png
Currently there is no information provided about *lost nodes*. Provide
information in the UI.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira