[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261274#comment-17261274 ] Szilard Nemeth commented on YARN-10295: --- Hi @wangda, There's a dedicated jira to deal with test failures on branch-3.2: YARN-10249. There are other commits that went in under similar circumstances: https://issues.apache.org/jira/browse/YARN-10194?focusedCommentId=17093663=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17093663 Nowadays test results are looking better for branch-3.2, see this for example: https://issues.apache.org/jira/browse/YARN-10528?focusedCommentId=17252715=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel# > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.2.2, 3.1.5 > > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, > YARN-10295.002.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249870#comment-17249870 ] Wangda Tan commented on YARN-10295: --- [~snemeth], just come across the issue, I saw a bunch of failed unit tests in the above jenkins output, is it patch related? > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.2.2, 3.1.5 > > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, > YARN-10295.002.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130896#comment-17130896 ] Szilard Nemeth commented on YARN-10295: --- Hi [~bteke], Latest patches LGTM, committed them to their respective branches. Resolving jira. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.2.2, 3.1.5 > > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, > YARN-10295.002.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126981#comment-17126981 ] Hadoop QA commented on YARN-10295: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 32s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 41s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 46s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 30s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 29s{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 0s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}393m 53s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 38s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}465m 26s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestParentQueue | | | hadoop.yarn.server.resourcemanager.scheduler.policy.TestFairOrderingPolicy | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestSchedulingRequestContainerAllocation | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerWithMultiResourceTypes | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer | | |
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126613#comment-17126613 ] Benjamin Teke commented on YARN-10295: -- Hi [~snemeth], Thanks. Good point, added an explaining comment to the branch-3.1 and 3.2 patches. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, > YARN-10295.002.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126577#comment-17126577 ] Szilard Nemeth commented on YARN-10295: --- Hi [~bteke], Looked at patch [^YARN-10295.001.branch-3.2.patch]. Can you please add an explanation comment and a jira reference as welll (YARN-10295) in order to keep this variable as is, above the if block? Just by seeing a variable (reservedContainer) that is used in two places right below it's declaration, someone would have triggered to refactor, inlining the variable and we are back to square one and people won't notice it was a bad move because it seems an innocent touch. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126545#comment-17126545 ] Adam Antal commented on YARN-10295: --- Thanks for the explanation [~bteke]. LGTM (non-binding). > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126050#comment-17126050 ] Benjamin Teke commented on YARN-10295: -- [~adam.antal], Yes, it is indeed blocked by that isDebugEnabled check, I missed to mention that in the description, so I updated it. Some lines are purposely omitted in the log extract above, and it is copied from our internal testsystem - because the error is quite tricky to reproduce -, that's why it doesn't show any debug level logs and incorrect line numbers in the exception. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125968#comment-17125968 ] Adam Antal commented on YARN-10295: --- Thanks for the investigation [~bteke]. I have a question here: the {{getReservedContainer()}} call is guarded by a {{LOG.isDebugEnabled()}} block. Therefore this issue should only happen when DEBUG logging is turned on. The extract of the log attached in the description does not show debug level logs. That makes me a little bit suspicious: are you sure the NPE happened at the exact same line? If I knew you had omitted the DEBUG lines, I'd be reassured. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled there is an > edge-case where a NullPointerException can cause the scheduler thread to exit > and the apps to get stuck without allocated resources. Consider the following > log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123844#comment-17123844 ] Benjamin Teke commented on YARN-10295: -- Test issues are related to YARN-10249 not this patch. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled there is an > edge-case where a NullPointerException can cause the scheduler thread to exit > and the apps to get stuck without allocated resources. Consider the following > log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119986#comment-17119986 ] Hadoop QA commented on YARN-10295: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 11s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 52s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 38s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 36s{color} | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 3s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 37s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}397m 44s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}463m 52s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSchedulingRequestUpdate | | | hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestParentQueue | | | hadoop.yarn.server.resourcemanager.scheduler.policy.TestFairOrderingPolicy | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestSchedulingRequestContainerAllocation | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerWithMultiResourceTypes | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing | | |
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119698#comment-17119698 ] Hadoop QA commented on YARN-10295: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 16m 1s{color} | {color:red} Docker failed to build yetus/hadoop:a6371bfdb8c. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10295 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004349/YARN-10295.001.branch-3.1.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/26084/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org | This message was automatically generated. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled there is an > edge-case where a NullPointerException can cause the scheduler thread to exit > and the apps to get stuck without allocated resources. Consider the following > log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira