[jira] [Commented] (MAPREDUCE-3360) Provide information about lost nodes in the UI.

2011-12-22 Thread Bh V S Kamesh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174947#comment-13174947
 ] 

Bh V S Kamesh commented on MAPREDUCE-3360:
--

Hi Jason,
  
Thanks for comments. Will incorporate your comments in my next patch. But 
before submitting patch, would like clarify this.

When the RM, does not receive node heartbeat from an NM for *node expiry* 
interval, RM removes the NM from its RM Nodes Map under node *EXPIRE* event. 
Before removing the NM, corresponding Cluster metrics will be updated (In this 
case, incrementing *lost* node count)

If the same NM sends heartbeat after above operation, RM checks whether there 
is any node corresponding to this NodeId. If RM does not find any NM 
corresponding to the NodeId, RM simply returns *reboot* as its heartbeat 
response.
Before sending its heartbeat reponse, RM again updates the Cluster metrics 
(this time, incrementing *reboot* node count).

Is it necessary to update different metrics for the same node's unavailability?
IMO, it shows incorrect information. I *think* either we need to update *lost* 
node count or *reboot* node count but not both, in such circumstance.

any comments?

 Provide information about lost nodes in the UI.
 ---

 Key: MAPREDUCE-3360
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3360
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv2
Affects Versions: 0.23.0
 Environment: NA
Reporter: Bh V S Kamesh
 Attachments: LostNodes.png, MAPREDUCE-3360-1.patch, 
 MAPREDUCE-3360.patch, lostNodes.png


 Currently there is no information provided about *lost nodes*. Provide 
 information in the UI. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3585) RM unable to detect NMs restart

2011-12-20 Thread Bh V S Kamesh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173400#comment-13173400
 ] 

Bh V S Kamesh commented on MAPREDUCE-3585:
--

If the NMs have been configured to use static ports, we can store the lost NMs 
information by their NodeID and hence RM can detect the NM's comeback.
I *think* there should be a mechanism, even if the NMs have been configured to 
use ephemeral ports.

 RM unable to detect NMs restart
 ---

 Key: MAPREDUCE-3585
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3585
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Reporter: Bh V S Kamesh

 Suppose say in a single host, there have been multiple NMs configured. In 
 this case, there should be mechanism to detect the NMs comeback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira