[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908179#comment-13908179
 ] 

Hudson commented on YARN-1071:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #488 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/488/])
YARN-1071. Enabled ResourceManager to recover cluster metrics 
numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908318#comment-13908318
 ] 

Hudson commented on YARN-1071:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1680 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1680/])
YARN-1071. Enabled ResourceManager to recover cluster metrics 
numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908396#comment-13908396
 ] 

Hudson commented on YARN-1071:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1705 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1705/])
YARN-1071. Enabled ResourceManager to recover cluster metrics 
numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907437#comment-13907437
 ] 

Zhijie Shen commented on YARN-1071:
---

The approach should fix the problem here. Some minor comments:

1. It's better to use System.getProperty(line.separator) to replace \n
{code}
+fStream.write(\n.getBytes());
{code}

2. Put the setter in HostsFileReader#refresh(2params) instead?
{code}
+ClusterMetrics.getMetrics().setDecommisionedNMs(excludeList.size());
{code}

3. Check the ip as well as we do in NodesListManager#isValidNode?
{code}
+  if (!context.getNodesListManager().getHostsReader().getExcludedHosts()
+.contains(hostName)) {
{code}

4. In the test of testDecomissionedNMsMetricsOnRMRestart, is it good to involve 
a nm which has been decommissioned before restart, and it will not corrupt the 
count after restart?

In addition to the whitelist scenario, there's another one that the approach 
may not handle:
a. host1 in blacklist
b. refresh node, count = 1
c. rm stops
d. *blacklist change, host2 replaces host1 in blacklist*
e. rm starts
f. count = 1, however, actually both host1 and host2 are decommissioned

Not sure changing blacklist during between rm stop and start will be a common 
case. Probably we don't want to deal with it now.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907654#comment-13907654
 ] 

Jian He commented on YARN-1071:
---

Thanks zhijie for the review ! 
bq. HostsFileReader#refresh(2params)
That's hadoop-common code, we should probably not touch it.
bq. Check the ip as well as we do in NodesListManager#isValidNode?
good catch!
Fixed other comments also.

The patch doesn't fix the include list scenario and changing exclude list 
between rm restarts. For that, rm may need to persistently save the 
decomissionNM state

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907673#comment-13907673
 ] 

Zhijie Shen commented on YARN-1071:
---

bq. That's hadoop-common code, we should probably not touch it.

Reasonable. Then, close to where refresh is called? For example, in 
NodesListManager#createHostsFileReader. Or it is intentionally not to set 
counter in NodesListManager#disableHostsFileReader?

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907707#comment-13907707
 ] 

Jian He commented on YARN-1071:
---

bq. NodesListManager#disableHostsFileReader?
Right, updated the metrics inside this method as well.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907776#comment-13907776
 ] 

Hadoop QA commented on YARN-1071:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12630193/YARN-1071.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3139//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3139//console

This message is automatically generated.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907857#comment-13907857
 ] 

Zhijie Shen commented on YARN-1071:
---

I thought about another scenario:

a. host1 and host2 are in the exclude list
b. refresh node, count = 2
c. host1 starts again, count = 1
d. rm stops
e. rm starts
d. count = 2 after NodesListManager inits
e. count =1 after host1 reconnected

Here, the decommission count decrease will be eventually reflected after rm 
restarts. So this scenario should still be covered with this approach. Correct 
me if I'm wrong about the process.

Other than that, I'm general fine with patch except that the temp dir created 
for test is good to be deleted after test completion.
{code}
+  private final static File TEMP_DIR = new File(System.getProperty(
+test.build.data, /tmp), decommision);
{code}



 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907871#comment-13907871
 ] 

Jian He commented on YARN-1071:
---

bq. So this scenario should still be covered with this approach
Correct.

New patch deleted the test dir on test completion.


 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907918#comment-13907918
 ] 

Hadoop QA commented on YARN-1071:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12630228/YARN-1071.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3142//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3142//console

This message is automatically generated.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907956#comment-13907956
 ] 

Zhijie Shen commented on YARN-1071:
---

+1. The patch looks good to me. Vinod, do you want to have a look as well?

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908029#comment-13908029
 ] 

Zhijie Shen commented on YARN-1071:
---

Will commit the patch

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908055#comment-13908055
 ] 

Hudson commented on YARN-1071:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5203 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5203/])
YARN-1071. Enabled ResourceManager to recover cluster metrics 
numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, 
 YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905321#comment-13905321
 ] 

Hadoop QA commented on YARN-1071:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12629714/YARN-1071.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestRMNodeTransitions

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3118//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3118//console

This message is automatically generated.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905785#comment-13905785
 ] 

Hadoop QA commented on YARN-1071:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12629826/YARN-1071.2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3122//console

This message is automatically generated.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905888#comment-13905888
 ] 

Hadoop QA commented on YARN-1071:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12629834/YARN-1071.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3123//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3123//console

This message is automatically generated.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He
 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch


 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2014-02-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902075#comment-13902075
 ] 

Jian He commented on YARN-1071:
---

I found that the decommissioned nodes in the current implementation include 2 
parts: the nodes missing in include list (if include list not empty) or the 
nodes listed in excluded list.
We are able to know the decommissioned  nodes as per exclude list upon RM 
restart by just counting the hosts in the file, but not able to know the 
decommissioned nodes as per include list unless those nodes come to connect.



 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Jian He

 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2013-08-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742469#comment-13742469
 ] 

Jason Lowe commented on YARN-1071:
--

The NM counts are only for NMs that have connected to the RM since it has 
started.  Restarting the RM sets these all to zero.  Since the 3 NMs that were 
previously active would retry and reconnect to the RM after it restarted that 
explains why ActiveNM count is 3.  However the other three will not contact the 
RM since they're not running, and that explains why they are zero after the 
restart.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Priority: Critical

 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart

2013-08-16 Thread Srimanth Gunturi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742481#comment-13742481
 ] 

Srimanth Gunturi commented on YARN-1071:


That makes sense as to why they are 0. 

I can understand YARN not knowing about lost nodes as it doesnt have a list of 
all NM hosts.

However I think atleast the decommissioned count should be set based on the 
exclude file information. YARN already knows about the excluded hosts, as it 
knows to ignore their communication.

 ResourceManager's decommissioned and lost node count is 0 after restart
 ---

 Key: YARN-1071
 URL: https://issues.apache.org/jira/browse/YARN-1071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Priority: Critical

 I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's 
 {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin 
 -refreshNodes}}, RM's JMX correctly showed decommissioned node count:
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 1,
 NumLostNMs : 2,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 After restarting RM, the counts were shown as below in JMX.
 {noformat}
 NumActiveNMs : 3,
 NumDecommissionedNMs : 0,
 NumLostNMs : 0,
 NumUnhealthyNMs : 0,
 NumRebootedNMs : 0
 {noformat}
 Notice that the lost and decommissioned NM counts are both 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira