[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908179#comment-13908179 ] Hudson commented on YARN-1071: -- FAILURE: Integrated in Hadoop-Yarn-trunk #488 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/488/]) YARN-1071. Enabled ResourceManager to recover cluster metrics numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Fix For: 2.4.0 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908318#comment-13908318 ] Hudson commented on YARN-1071: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1680 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1680/]) YARN-1071. Enabled ResourceManager to recover cluster metrics numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Fix For: 2.4.0 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908396#comment-13908396 ] Hudson commented on YARN-1071: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1705 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1705/]) YARN-1071. Enabled ResourceManager to recover cluster metrics numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Fix For: 2.4.0 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907437#comment-13907437 ] Zhijie Shen commented on YARN-1071: --- The approach should fix the problem here. Some minor comments: 1. It's better to use System.getProperty(line.separator) to replace \n {code} +fStream.write(\n.getBytes()); {code} 2. Put the setter in HostsFileReader#refresh(2params) instead? {code} +ClusterMetrics.getMetrics().setDecommisionedNMs(excludeList.size()); {code} 3. Check the ip as well as we do in NodesListManager#isValidNode? {code} + if (!context.getNodesListManager().getHostsReader().getExcludedHosts() +.contains(hostName)) { {code} 4. In the test of testDecomissionedNMsMetricsOnRMRestart, is it good to involve a nm which has been decommissioned before restart, and it will not corrupt the count after restart? In addition to the whitelist scenario, there's another one that the approach may not handle: a. host1 in blacklist b. refresh node, count = 1 c. rm stops d. *blacklist change, host2 replaces host1 in blacklist* e. rm starts f. count = 1, however, actually both host1 and host2 are decommissioned Not sure changing blacklist during between rm stop and start will be a common case. Probably we don't want to deal with it now. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907654#comment-13907654 ] Jian He commented on YARN-1071: --- Thanks zhijie for the review ! bq. HostsFileReader#refresh(2params) That's hadoop-common code, we should probably not touch it. bq. Check the ip as well as we do in NodesListManager#isValidNode? good catch! Fixed other comments also. The patch doesn't fix the include list scenario and changing exclude list between rm restarts. For that, rm may need to persistently save the decomissionNM state ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907673#comment-13907673 ] Zhijie Shen commented on YARN-1071: --- bq. That's hadoop-common code, we should probably not touch it. Reasonable. Then, close to where refresh is called? For example, in NodesListManager#createHostsFileReader. Or it is intentionally not to set counter in NodesListManager#disableHostsFileReader? ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907707#comment-13907707 ] Jian He commented on YARN-1071: --- bq. NodesListManager#disableHostsFileReader? Right, updated the metrics inside this method as well. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907776#comment-13907776 ] Hadoop QA commented on YARN-1071: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630193/YARN-1071.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3139//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3139//console This message is automatically generated. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907857#comment-13907857 ] Zhijie Shen commented on YARN-1071: --- I thought about another scenario: a. host1 and host2 are in the exclude list b. refresh node, count = 2 c. host1 starts again, count = 1 d. rm stops e. rm starts d. count = 2 after NodesListManager inits e. count =1 after host1 reconnected Here, the decommission count decrease will be eventually reflected after rm restarts. So this scenario should still be covered with this approach. Correct me if I'm wrong about the process. Other than that, I'm general fine with patch except that the temp dir created for test is good to be deleted after test completion. {code} + private final static File TEMP_DIR = new File(System.getProperty( +test.build.data, /tmp), decommision); {code} ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907871#comment-13907871 ] Jian He commented on YARN-1071: --- bq. So this scenario should still be covered with this approach Correct. New patch deleted the test dir on test completion. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907918#comment-13907918 ] Hadoop QA commented on YARN-1071: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630228/YARN-1071.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3142//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3142//console This message is automatically generated. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907956#comment-13907956 ] Zhijie Shen commented on YARN-1071: --- +1. The patch looks good to me. Vinod, do you want to have a look as well? ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908029#comment-13908029 ] Zhijie Shen commented on YARN-1071: --- Will commit the patch ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908055#comment-13908055 ] Hudson commented on YARN-1071: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5203 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5203/]) YARN-1071. Enabled ResourceManager to recover cluster metrics numDecommissionedNMs after restarting. Contributed by Jian He. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1570469) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClusterMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Fix For: 2.4.0 Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch, YARN-1071.4.patch, YARN-1071.5.patch, YARN-1071.6.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905321#comment-13905321 ] Hadoop QA commented on YARN-1071: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12629714/YARN-1071.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMNodeTransitions The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3118//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3118//console This message is automatically generated. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905785#comment-13905785 ] Hadoop QA commented on YARN-1071: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12629826/YARN-1071.2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3122//console This message is automatically generated. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905888#comment-13905888 ] Hadoop QA commented on YARN-1071: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12629834/YARN-1071.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3123//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3123//console This message is automatically generated. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He Attachments: YARN-1071.1.patch, YARN-1071.2.patch, YARN-1071.3.patch I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902075#comment-13902075 ] Jian He commented on YARN-1071: --- I found that the decommissioned nodes in the current implementation include 2 parts: the nodes missing in include list (if include list not empty) or the nodes listed in excluded list. We are able to know the decommissioned nodes as per exclude list upon RM restart by just counting the hosts in the file, but not able to know the decommissioned nodes as per include list unless those nodes come to connect. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Jian He I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742469#comment-13742469 ] Jason Lowe commented on YARN-1071: -- The NM counts are only for NMs that have connected to the RM since it has started. Restarting the RM sets these all to zero. Since the 3 NMs that were previously active would retry and reconnect to the RM after it restarted that explains why ActiveNM count is 3. However the other three will not contact the RM since they're not running, and that explains why they are zero after the restart. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Priority: Critical I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1071) ResourceManager's decommissioned and lost node count is 0 after restart
[ https://issues.apache.org/jira/browse/YARN-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742481#comment-13742481 ] Srimanth Gunturi commented on YARN-1071: That makes sense as to why they are 0. I can understand YARN not knowing about lost nodes as it doesnt have a list of all NM hosts. However I think atleast the decommissioned count should be set based on the exclude file information. YARN already knows about the excluded hosts, as it knows to ignore their communication. ResourceManager's decommissioned and lost node count is 0 after restart --- Key: YARN-1071 URL: https://issues.apache.org/jira/browse/YARN-1071 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Priority: Critical I had 6 nodes in a cluster with 2 NMs stopped. Then I put a host into YARN's {{yarn.resourcemanager.nodes.exclude-path}}. After running {{yarn rmadmin -refreshNodes}}, RM's JMX correctly showed decommissioned node count: {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 1, NumLostNMs : 2, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} After restarting RM, the counts were shown as below in JMX. {noformat} NumActiveNMs : 3, NumDecommissionedNMs : 0, NumLostNMs : 0, NumUnhealthyNMs : 0, NumRebootedNMs : 0 {noformat} Notice that the lost and decommissioned NM counts are both 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira