[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071055#comment-14071055
 ] 

Karthik Kambatla commented on YARN-2273:


+1. Checking this in.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070965#comment-14070965
 ] 

Hadoop QA commented on YARN-2273:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657187/YARN-2273-5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4397//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4397//console

This message is automatically generated.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069085#comment-14069085
 ] 

Karthik Kambatla commented on YARN-2273:


Filed YARN-2328 for the latter comment. 

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069077#comment-14069077
 ] 

Karthik Kambatla commented on YARN-2273:


Still missing a return in this block:
{code}
} catch (InterruptedException e) {
  LOG.error("Continuous scheduling thread interrupted. Exiting. 
",
  e);
}
{code}

Unrelated, I think ContinuousSchedulingThread should be a separate class like 
UpdateThread. Both should extend Thread and be singleton classes. We can 
address this in another JIRA. In that JIRA, we should also add a test to make 
sure FairScheduler#stop stops both the threads. 

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069040#comment-14069040
 ] 

Hadoop QA commented on YARN-2273:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656909/YARN-2273.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4384//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4384//console

This message is automatically generated.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068910#comment-14068910
 ] 

Hadoop QA commented on YARN-2273:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12656888/YARN-2273.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4383//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4383//console

This message is automatically generated.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch, YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068834#comment-14068834
 ] 

Karthik Kambatla commented on YARN-2273:


Thanks Wei. Few comments on the latest patch, some not specific to changes in 
this patch.

# Can continuousSchedulingAttempt be package-private? 
# We should log the following at ERROR level
{code}
+  } catch (Throwable ex) {
+LOG.warn("Error while attempting scheduling for node " + node +
+": " + ex.toString(), ex);
   }
{code}
# When the scheduling thread is interrupted, shouldn't we actually stop the 
thread? What are the cases where we want to ignore an interruption?
# Update the log message in the catch-block of InterruptedException - 
"Continuous scheduling thread interrupted." May be add "Exiting." if we do 
decide to shut the thread down. 
# In the test, do we need to call FS#reinitialize()? 
# In the test, should we catch all exceptions instead of just NPE

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068684#comment-14068684
 ] 

Wei Yan commented on YARN-2273:
---

[~kasha], [~ozawa], will update a patch soon.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068640#comment-14068640
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

Sounds reasonable. [~ywskycn], could you update to address Karthik's comment?

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068508#comment-14068508
 ] 

Karthik Kambatla commented on YARN-2273:


Actually, let me retract that +1 temporarily. Can we add a test case here? 

We can move while(true) into the run method and rename continuousScheduling to 
continuousSchedulingAttempt. The test from replayException can be used for the 
test, if we move the fail() to catch-block. 


> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068500#comment-14068500
 ] 

Karthik Kambatla commented on YARN-2273:


+1. Checking this in. 

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-21 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068474#comment-14068474
 ] 

Karthik Kambatla commented on YARN-2273:


Yep. Thanks for pointing it out, Tsuyoshi. That makes sense. 

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067736#comment-14067736
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

+1, confirmed that we can avoid NPE with the patch.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067733#comment-14067733
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

[~kasha], please note that the reproducing patch includes following code 
snippet:

{code}
+// Invoke the continuous scheduling once
+try{
+  fs.oneTimeContinuousScheduling(nodeIdList);
+  fail("Exception is expected because one node is removed.");
+} catch (NullPointerException e) {
+  // Exception is expected.
+}
{code}

If no error happens with reproducing patch, it means that we face NPE.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-19 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067593#comment-14067593
 ] 

Karthik Kambatla commented on YARN-2273:


The test passes with the replay exception patch. 

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067347#comment-14067347
 ] 

Hadoop QA commented on YARN-2273:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12656686/YARN-2273-replayException.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4370//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4370//console

This message is automatically generated.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273-replayException.patch, YARN-2273.patch, 
> YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This messag

[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-18 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067290#comment-14067290
 ] 

Karthik Kambatla commented on YARN-2273:


[~wei.yan] - you mentioned writing a unit test to reproduce the issue. Can we 
include that in the patch? 

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059307#comment-14059307
 ] 

Hadoop QA commented on YARN-2273:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12655283/YARN-2273.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4278//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4278//console

This message is automatically generated.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059252#comment-14059252
 ] 

Wei Yan commented on YARN-2273:
---

Thanks, [~ozawa]. Update a new patch.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0, 2.4.1
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch, YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059186#comment-14059186
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

Make sense.

One additional point: should we add null check at the following point in 
{{continuousScheduling}} to avoid NPE? IIUC, {{getFSSchedulerNode(nodeId)}} can 
return null in this case.
{code}
-if (Resources.fitsIn(minimumAllocation,
+   if (node != null && Resources.fitsIn(minimumAllocation,
node.getAvailableResource())) {
  attemptScheduling(node);
}
{code}

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059134#comment-14059134
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

Thanks, [~ywskycn].
As you mentioned, {{nodes}} itself is thread-safe. IIUC, {{nodes}} in 
{{NodeAvailableResourceComparator#compare}} can be different from 
{{nodeIdList}} in {{continuousScheduling}}. If the copy is done inside 
synchronized block in {{continuousScheduling}}, the copy value {{nodeIdList}} 
is same to {{nodes}} while nodeIdList are being sorted. If {{nodeIdList}} 
contains keys which are same to {{nodes}}, I think we don't need to check it 
inside {{NodeAvailableResourceComparator#compare}}. Please correct me if I'm 
wrong.

{code} 
   @Override
public int compare(NodeId n1, NodeId n2) {
  return RESOURCE_CALCULATOR.compare(clusterResource,
  nodes.get(n2).getAvailableResource(),
  nodes.get(n1).getAvailableResource());
}
{code}

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059145#comment-14059145
 ] 

Wei Yan commented on YARN-2273:
---

Oh, I saw your point. Yes, move the copy operation to the synchronize part may 
help. But it hurts the performance, as other services may operate the nodes. 
The uploaded patch adds a check before we do the comparison. And after the sort 
and start to do the scheduling, we still check whether the node is still alive.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059116#comment-14059116
 ] 

Hadoop QA commented on YARN-2273:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12655241/YARN-2273.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4276//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4276//console

This message is automatically generated.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059096#comment-14059096
 ] 

Wei Yan commented on YARN-2273:
---

Thanks, [~ozawa].
IMO, as the nodeIdList makes a copy from the nodes, and nodes itself is 
thread-safe, there is no race condition between the copy operation and the 
Collections.sort. NodeIdList uses a copy from nodes, so any change to nodes is 
not reflected in nodeIdList.


> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059084#comment-14059084
 ] 

Tsuyoshi OZAWA commented on YARN-2273:
--

Hi [~ywskycn], thank you for taking this JIRA. It looks race condition between 
{{new ArrayList(nodes.keySet());}} and {{Collections.sort}}. One 
straight way to fix is moving {{new ArrayList(nodes.keySet());}} into 
synchronized block. I think it's simpler way but one concern is to degrade 
performance because of lock. Wei, [~sandyr], what do you think?
{code}
  private void continuousScheduling() {
while (true) {
  List nodeIdList = new ArrayList(nodes.keySet());
  // Sort the nodes by space available on them, so that we offer
  // containers on emptier nodes first, facilitating an even spread. This
  // requires holding the scheduler lock, so that the space available on a
  // node doesn't change during the sort.
  synchronized (this) {
Collections.sort(nodeIdList, nodeAvailableResourceComparator);
  }
  ..
   }
{code}

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
> Attachments: YARN-2273.patch
>
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-11 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058989#comment-14058989
 ] 

Wei Yan commented on YARN-2273:
---

I run a testcase locally. Disable the continuousScheduling, add a 
oneTimeContinuousScheduling() function in the fairScheduler, and remove one 
node before calling the oneTimeContinuousScheduling(). And the exception 
happened.
{code}
// Add two nodes
RMNode node1 =
MockNodes.newNodeInfo(1, Resources.createResource(8 * 1024, 8), 1,
"127.0.0.1");
NodeAddedSchedulerEvent nodeEvent1 = new NodeAddedSchedulerEvent(node1);
fs.handle(nodeEvent1);
RMNode node2 =
MockNodes.newNodeInfo(1, Resources.createResource(8 * 1024, 8), 2,
"127.0.0.2");
NodeAddedSchedulerEvent nodeEvent2 = new NodeAddedSchedulerEvent(node2);
fs.handle(nodeEvent2);
Assert.assertEquals("We should have two alive nodes.", 2, fs.nodes.size());

List nodeIdList = new ArrayList(fs.nodes.keySet());
Assert.assertEquals("We should have two nodes to be sorted.", 2, 
nodeIdList.size());

// Remove the node
NodeRemovedSchedulerEvent removeNode1 = new 
NodeRemovedSchedulerEvent(node1);
fs.handle(removeNode1);
fs.update();
Assert.assertEquals("We should only have one alive node.", 1, 
fs.nodes.size());

// Invoke the continuous scheduling once
fs.oneTimeContinuousScheduling(nodeIdList);
{code}

Will upload a patch shortly.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap

2014-07-10 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057930#comment-14057930
 ] 

Wei Yan commented on YARN-2273:
---

Thanks for the catch, [~skeltoac].

A quick guess is that the NodeAvailableResourceComparator doesn't check whether 
the node is alive when does comparison. A node may be removed during the 
sorting process. I'll re-check it.

> NPE in ContinuousScheduling Thread crippled RM after DN flap
> 
>
> Key: YARN-2273
> URL: https://issues.apache.org/jira/browse/YARN-2273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.3.0
> Environment: cdh5.0.2 wheezy
>Reporter: Andy Skelton
>
> One DN experienced memory errors and entered a cycle of rebooting and 
> rejoining the cluster. After the second time the node went away, the RM 
> produced this:
> {code}
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Application attempt appattempt_1404858438119_4352_01 released container 
> container_1404858438119_4352_01_04 on node: host: 
> node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 
> available= used= with event: KILL
> 2014-07-09 21:47:36,571 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: 
> 
> 2014-07-09 21:47:36,571 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[ContinuousScheduling,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
>   at java.util.TimSort.sort(TimSort.java:203)
>   at java.util.TimSort.sort(TimSort.java:173)
>   at java.util.Arrays.sort(Arrays.java:659)
>   at java.util.Collections.sort(Collections.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> A few cycles later YARN was crippled. The RM was running and jobs could be 
> submitted but containers were not assigned and no progress was made. 
> Restarting the RM resolved it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)