[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071055#comment-14071055 ] Karthik Kambatla commented on YARN-2273: +1. Checking this in. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070965#comment-14070965 ] Hadoop QA commented on YARN-2273: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657187/YARN-2273-5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4397//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4397//console This message is automatically generated. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069085#comment-14069085 ] Karthik Kambatla commented on YARN-2273: Filed YARN-2328 for the latter comment. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069077#comment-14069077 ] Karthik Kambatla commented on YARN-2273: Still missing a return in this block: {code} } catch (InterruptedException e) { LOG.error("Continuous scheduling thread interrupted. Exiting. ", e); } {code} Unrelated, I think ContinuousSchedulingThread should be a separate class like UpdateThread. Both should extend Thread and be singleton classes. We can address this in another JIRA. In that JIRA, we should also add a test to make sure FairScheduler#stop stops both the threads. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069040#comment-14069040 ] Hadoop QA commented on YARN-2273: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656909/YARN-2273.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4384//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4384//console This message is automatically generated. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068910#comment-14068910 ] Hadoop QA commented on YARN-2273: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656888/YARN-2273.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4383//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4383//console This message is automatically generated. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it.
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068834#comment-14068834 ] Karthik Kambatla commented on YARN-2273: Thanks Wei. Few comments on the latest patch, some not specific to changes in this patch. # Can continuousSchedulingAttempt be package-private? # We should log the following at ERROR level {code} + } catch (Throwable ex) { +LOG.warn("Error while attempting scheduling for node " + node + +": " + ex.toString(), ex); } {code} # When the scheduling thread is interrupted, shouldn't we actually stop the thread? What are the cases where we want to ignore an interruption? # Update the log message in the catch-block of InterruptedException - "Continuous scheduling thread interrupted." May be add "Exiting." if we do decide to shut the thread down. # In the test, do we need to call FS#reinitialize()? # In the test, should we catch all exceptions instead of just NPE > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068684#comment-14068684 ] Wei Yan commented on YARN-2273: --- [~kasha], [~ozawa], will update a patch soon. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068640#comment-14068640 ] Tsuyoshi OZAWA commented on YARN-2273: -- Sounds reasonable. [~ywskycn], could you update to address Karthik's comment? > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068508#comment-14068508 ] Karthik Kambatla commented on YARN-2273: Actually, let me retract that +1 temporarily. Can we add a test case here? We can move while(true) into the run method and rename continuousScheduling to continuousSchedulingAttempt. The test from replayException can be used for the test, if we move the fail() to catch-block. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068500#comment-14068500 ] Karthik Kambatla commented on YARN-2273: +1. Checking this in. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068474#comment-14068474 ] Karthik Kambatla commented on YARN-2273: Yep. Thanks for pointing it out, Tsuyoshi. That makes sense. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067736#comment-14067736 ] Tsuyoshi OZAWA commented on YARN-2273: -- +1, confirmed that we can avoid NPE with the patch. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067733#comment-14067733 ] Tsuyoshi OZAWA commented on YARN-2273: -- [~kasha], please note that the reproducing patch includes following code snippet: {code} +// Invoke the continuous scheduling once +try{ + fs.oneTimeContinuousScheduling(nodeIdList); + fail("Exception is expected because one node is removed."); +} catch (NullPointerException e) { + // Exception is expected. +} {code} If no error happens with reproducing patch, it means that we face NPE. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067593#comment-14067593 ] Karthik Kambatla commented on YARN-2273: The test passes with the replay exception patch. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067347#comment-14067347 ] Hadoop QA commented on YARN-2273: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656686/YARN-2273-replayException.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4370//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4370//console This message is automatically generated. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273-replayException.patch, YARN-2273.patch, > YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This messag
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067290#comment-14067290 ] Karthik Kambatla commented on YARN-2273: [~wei.yan] - you mentioned writing a unit test to reproduce the issue. Can we include that in the patch? > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059307#comment-14059307 ] Hadoop QA commented on YARN-2273: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655283/YARN-2273.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4278//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4278//console This message is automatically generated. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059252#comment-14059252 ] Wei Yan commented on YARN-2273: --- Thanks, [~ozawa]. Update a new patch. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059186#comment-14059186 ] Tsuyoshi OZAWA commented on YARN-2273: -- Make sense. One additional point: should we add null check at the following point in {{continuousScheduling}} to avoid NPE? IIUC, {{getFSSchedulerNode(nodeId)}} can return null in this case. {code} -if (Resources.fitsIn(minimumAllocation, + if (node != null && Resources.fitsIn(minimumAllocation, node.getAvailableResource())) { attemptScheduling(node); } {code} > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059134#comment-14059134 ] Tsuyoshi OZAWA commented on YARN-2273: -- Thanks, [~ywskycn]. As you mentioned, {{nodes}} itself is thread-safe. IIUC, {{nodes}} in {{NodeAvailableResourceComparator#compare}} can be different from {{nodeIdList}} in {{continuousScheduling}}. If the copy is done inside synchronized block in {{continuousScheduling}}, the copy value {{nodeIdList}} is same to {{nodes}} while nodeIdList are being sorted. If {{nodeIdList}} contains keys which are same to {{nodes}}, I think we don't need to check it inside {{NodeAvailableResourceComparator#compare}}. Please correct me if I'm wrong. {code} @Override public int compare(NodeId n1, NodeId n2) { return RESOURCE_CALCULATOR.compare(clusterResource, nodes.get(n2).getAvailableResource(), nodes.get(n1).getAvailableResource()); } {code} > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059145#comment-14059145 ] Wei Yan commented on YARN-2273: --- Oh, I saw your point. Yes, move the copy operation to the synchronize part may help. But it hurts the performance, as other services may operate the nodes. The uploaded patch adds a check before we do the comparison. And after the sort and start to do the scheduling, we still check whether the node is still alive. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059116#comment-14059116 ] Hadoop QA commented on YARN-2273: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655241/YARN-2273.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4276//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4276//console This message is automatically generated. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059096#comment-14059096 ] Wei Yan commented on YARN-2273: --- Thanks, [~ozawa]. IMO, as the nodeIdList makes a copy from the nodes, and nodes itself is thread-safe, there is no race condition between the copy operation and the Collections.sort. NodeIdList uses a copy from nodes, so any change to nodes is not reflected in nodeIdList. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059084#comment-14059084 ] Tsuyoshi OZAWA commented on YARN-2273: -- Hi [~ywskycn], thank you for taking this JIRA. It looks race condition between {{new ArrayList(nodes.keySet());}} and {{Collections.sort}}. One straight way to fix is moving {{new ArrayList(nodes.keySet());}} into synchronized block. I think it's simpler way but one concern is to degrade performance because of lock. Wei, [~sandyr], what do you think? {code} private void continuousScheduling() { while (true) { List nodeIdList = new ArrayList(nodes.keySet()); // Sort the nodes by space available on them, so that we offer // containers on emptier nodes first, facilitating an even spread. This // requires holding the scheduler lock, so that the space available on a // node doesn't change during the sort. synchronized (this) { Collections.sort(nodeIdList, nodeAvailableResourceComparator); } .. } {code} > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > Attachments: YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058989#comment-14058989 ] Wei Yan commented on YARN-2273: --- I run a testcase locally. Disable the continuousScheduling, add a oneTimeContinuousScheduling() function in the fairScheduler, and remove one node before calling the oneTimeContinuousScheduling(). And the exception happened. {code} // Add two nodes RMNode node1 = MockNodes.newNodeInfo(1, Resources.createResource(8 * 1024, 8), 1, "127.0.0.1"); NodeAddedSchedulerEvent nodeEvent1 = new NodeAddedSchedulerEvent(node1); fs.handle(nodeEvent1); RMNode node2 = MockNodes.newNodeInfo(1, Resources.createResource(8 * 1024, 8), 2, "127.0.0.2"); NodeAddedSchedulerEvent nodeEvent2 = new NodeAddedSchedulerEvent(node2); fs.handle(nodeEvent2); Assert.assertEquals("We should have two alive nodes.", 2, fs.nodes.size()); List nodeIdList = new ArrayList(fs.nodes.keySet()); Assert.assertEquals("We should have two nodes to be sorted.", 2, nodeIdList.size()); // Remove the node NodeRemovedSchedulerEvent removeNode1 = new NodeRemovedSchedulerEvent(node1); fs.handle(removeNode1); fs.update(); Assert.assertEquals("We should only have one alive node.", 1, fs.nodes.size()); // Invoke the continuous scheduling once fs.oneTimeContinuousScheduling(nodeIdList); {code} Will upload a patch shortly. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057930#comment-14057930 ] Wei Yan commented on YARN-2273: --- Thanks for the catch, [~skeltoac]. A quick guess is that the NodeAvailableResourceComparator doesn't check whether the node is alive when does comparison. A node may be removed during the sorting process. I'll re-check it. > NPE in ContinuousScheduling Thread crippled RM after DN flap > > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)