[jira] [Commented] (MAPREDUCE-3363) The totalnodes and memorytotal fields show wrong information if the nodes are going down and coming up early(before 10min)

2012-01-11 Thread Jason Lowe (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184442#comment-13184442
 ] 

Jason Lowe commented on MAPREDUCE-3363:
---

It looks like the problem occurs because ephemeral ports are configured for the 
NodeManagers.  NMs are identified by host:port pairs, and when ephemeral ports 
are used we lose the ability to differentiate between a new node joining the 
cluster and a lost node rejoining the cluster.

In the screenshot's scenario, the ResourceManager believes that 4 nodes are in 
the cluster and only after the NM timeout interval (default 10min) will it 
realize 3 of the 4 nodes aren't there.  This is not much different than a case 
of a cluster that has 4 separate NM machines and three of the NMs go down at 
the same time.  The cluster capacity will be false within the timeout interval 
because the lost cluster capacity will not have been realized by the RM.

If ephemeral ports are not used then this problem cannot occur today because 
MAPREDUCE-3070 did not really fix the quick NM reboot scenario.  The NM reboot 
scenario only works with ephemeral ports because the RM sees it as a new NM 
joining the cluster (and a subsequent loss of an NM after the NM timeout) 
rather than a reboot of an existing NM.  If a cluster is configured without 
ephemeral ports then a restarting NM cannot rejoin the cluster until after the 
NM timeout interval has passed on the RM, and by then the node's resources will 
have been removed from the cluster before being added back in when it rejoins.

Ideally we should put in a real fix for MAPREDUCE-3070 so the RM can realize an 
existing NM trying to join the cluster is a reboot scenario instead of 
rejecting the new NM instance.  Of course, the RM would have to kill off all 
the existing containers for the NM when it rejoins.

The issue of detecting the difference between a new NM joining and an existing 
NM rejoining when ephemeral ports are configured is being tracked in 
MAPREDUCE-3585.

 The totalnodes  and memorytotal fields show wrong information if the 
 nodes are going down and coming up early(before 10min) 
 

 Key: MAPREDUCE-3363
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3363
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.24.0
Reporter: Ramgopal N
Priority: Critical
 Attachments: Applications.htm, screenshot-1.jpg


 The node details is not moved from Totalnodes to lostnodes for 60 ms.So 
 if the node is going down and coming up before the expiry interval, the 
 cluster status in terms of the total nodes and Total cluster memory displays 
 wrong values. 
 Atleast, if the same node is coming up again...should not consider as new 
 node.No point of time duplicate nodes should be displayed in Totalnodes list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3363) The totalnodes and memorytotal fields show wrong information if the nodes are going down and coming up early(before 10min)

2011-12-04 Thread Devaraj K (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162607#comment-13162607
 ] 

Devaraj K commented on MAPREDUCE-3363:
--

This issue will cause a huge loss, if multiple nodes restart at a time, and due 
to the false cluster capacity assumed, many jobs will fail in the cluster, 
which is a performance hit.

 The totalnodes  and memorytotal fields show wrong information if the 
 nodes are going down and coming up early(before 10min) 
 

 Key: MAPREDUCE-3363
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3363
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.24.0
Reporter: Ramgopal N
Assignee: Devaraj K
 Attachments: Applications.htm, screenshot-1.jpg


 The node details is not moved from Totalnodes to lostnodes for 60 ms.So 
 if the node is going down and coming up before the expiry interval, the 
 cluster status in terms of the total nodes and Total cluster memory displays 
 wrong values. 
 Atleast, if the same node is coming up again...should not consider as new 
 node.No point of time duplicate nodes should be displayed in Totalnodes list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira