[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2021-01-08 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261274#comment-17261274
 ] 

Szilard Nemeth commented on YARN-10295:
---

Hi @wangda,
There's a dedicated jira to deal with test failures on branch-3.2: YARN-10249.

There are other commits that went in under similar circumstances: 
https://issues.apache.org/jira/browse/YARN-10194?focusedCommentId=17093663=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17093663

Nowadays test results are looking better for branch-3.2, see this for example: 
https://issues.apache.org/jira/browse/YARN-10528?focusedCommentId=17252715=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#



> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.2.2, 3.1.5
>
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, 
> YARN-10295.002.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-12-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249870#comment-17249870
 ] 

Wangda Tan commented on YARN-10295:
---

[~snemeth], just come across the issue, I saw a bunch of failed unit tests in 
the above jenkins output, is it patch related?

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.2.2, 3.1.5
>
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, 
> YARN-10295.002.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-10 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130896#comment-17130896
 ] 

Szilard Nemeth commented on YARN-10295:
---

Hi [~bteke],
Latest patches LGTM, committed them to their respective branches.
Resolving jira.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.2.2, 3.1.5
>
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, 
> YARN-10295.002.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-05 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126981#comment-17126981
 ] 

Hadoop QA commented on YARN-10295:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 
32s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
41s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
30s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
29s{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  0s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}393m 53s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}465m 26s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector 
|
|   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestParentQueue |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.policy.TestFairOrderingPolicy |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestSchedulingRequestContainerAllocation
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerWithMultiResourceTypes
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer
 |
|   | 

[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-05 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126613#comment-17126613
 ] 

Benjamin Teke commented on YARN-10295:
--

Hi [~snemeth],

Thanks. Good point, added an explaining comment to the branch-3.1 and 3.2 
patches.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, 
> YARN-10295.002.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-05 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126577#comment-17126577
 ] 

Szilard Nemeth commented on YARN-10295:
---

Hi [~bteke],
Looked at patch [^YARN-10295.001.branch-3.2.patch].
Can you please add an explanation comment and a jira reference as welll 
(YARN-10295) in order to keep this variable as is, above the if block?
Just by seeing a variable (reservedContainer) that is used in two places right 
below it's declaration, someone would have triggered to refactor, inlining the 
variable and we are back to square one and people won't notice it was a bad 
move because it seems an innocent touch.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-05 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126545#comment-17126545
 ] 

Adam Antal commented on YARN-10295:
---

Thanks for the explanation [~bteke]. LGTM (non-binding).

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-04 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126050#comment-17126050
 ] 

Benjamin Teke commented on YARN-10295:
--

[~adam.antal],

Yes, it is indeed blocked by that isDebugEnabled check, I missed to mention 
that in the description, so I updated it. Some lines are purposely omitted in 
the log extract above, and it is copied from our internal testsystem - because 
the error is quite tricky to reproduce -, that's why it doesn't show any debug 
level logs and incorrect line numbers in the exception.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-04 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125968#comment-17125968
 ] 

Adam Antal commented on YARN-10295:
---

Thanks for the investigation [~bteke]. 

I have a question here: the {{getReservedContainer()}} call is guarded by a 
{{LOG.isDebugEnabled()}} block. Therefore this issue should only happen when 
DEBUG logging is turned on. The extract of the log attached in the description 
does not show debug level logs. That makes me a little bit suspicious: are you 
sure the NPE happened at the exact same line? If I knew you had omitted the 
DEBUG lines, I'd be reassured.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled there is an 
> edge-case where a NullPointerException can cause the scheduler thread to exit 
> and the apps to get stuck without allocated resources. Consider the following 
> log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-06-02 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123844#comment-17123844
 ] 

Benjamin Teke commented on YARN-10295:
--

Test issues are related to YARN-10249 not this patch.

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled there is an 
> edge-case where a NullPointerException can cause the scheduler thread to exit 
> and the apps to get stuck without allocated resources. Consider the following 
> log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-05-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119986#comment-17119986
 ] 

Hadoop QA commented on YARN-10295:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
11s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 52s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
38s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
36s{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  3s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
37s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}397m 44s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}463m 52s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSchedulingRequestUpdate
 |
|   | hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector |
|   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestParentQueue |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.policy.TestFairOrderingPolicy |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestSchedulingRequestContainerAllocation
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerWithMultiResourceTypes
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing |
|   | 

[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-05-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119698#comment-17119698
 ] 

Hadoop QA commented on YARN-10295:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 16m  
1s{color} | {color:red} Docker failed to build yetus/hadoop:a6371bfdb8c. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10295 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004349/YARN-10295.001.branch-3.1.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/26084/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |


This message was automatically generated.



> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled there is an 
> edge-case where a NullPointerException can cause the scheduler thread to exit 
> and the apps to get stuck without allocated resources. Consider the following 
> log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira