[jira] [Comment Edited] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls
[ https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173358#comment-17173358 ] Muhammad Samir Khan edited comment on YARN-10390 at 8/15/20, 1:30 AM: -- Worth mentioning that I had to repeat some of the tests because of java.lang.IllegalArgumentException: Comparison method violates its general contract! Searched and found that there are existing jiras with same problem, e.g. YARN-8764, YARN-10178. Have not investigated further. Adding stack trace for one failed run: {quote}Exception in thread "Thread-9" java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeHi(TimSort.java:899) at java.util.TimSort.mergeAt(TimSort.java:516) at java.util.TimSort.mergeForceCollapse(TimSort.java:457) at java.util.TimSort.sort(TimSort.java:254) at java.util.Arrays.sort(Arrays.java:1512) at java.util.ArrayList.sort(ArrayList.java:1462) at java.util.Collections.sort(Collections.java:177) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:785) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:796) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:628) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1676) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1614) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1768) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf$CapacitySchedulerPerf.allocateContainersToNode(TestCapacitySchedulerPerf.java:90) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1522) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:571) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:604) {quote} was (Author: samkhan): Worth mentioning that I ran some of the tests again because of java.lang.IllegalArgumentException: Comparison method violates its general contract! Searched and found that there are [existing jiras|https://issues.apache.org/jira/browse/YARN-10178?jql=project%20%3D%20YARN%20AND%20text%20~%20%22comparison%20method%22] with same problem, e.g. YARN-8764, YARN-10178. Have not investigated further. Adding stack trace for one failed run: {quote}Exception in thread "Thread-9" java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeHi(TimSort.java:899) at java.util.TimSort.mergeAt(TimSort.java:516) at java.util.TimSort.mergeForceCollapse(TimSort.java:457) at java.util.TimSort.sort(TimSort.java:254) at java.util.Arrays.sort(Arrays.java:1512) at java.util.ArrayList.sort(ArrayList.java:1462) at java.util.Collections.sort(Collections.java:177) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:785) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:796) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:628) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1676) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1614) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1768) at org.apache.hadoop.yarn.server.resourcemanager.sche
[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls
[ https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173358#comment-17173358 ] Muhammad Samir Khan commented on YARN-10390: Worth mentioning that I ran some of the tests again because of java.lang.IllegalArgumentException: Comparison method violates its general contract! Searched and found that there are [existing jiras|https://issues.apache.org/jira/browse/YARN-10178?jql=project%20%3D%20YARN%20AND%20text%20~%20%22comparison%20method%22] with same problem, e.g. YARN-8764, YARN-10178. Have not investigated further. Adding stack trace for one failed run: {quote}Exception in thread "Thread-9" java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeHi(TimSort.java:899) at java.util.TimSort.mergeAt(TimSort.java:516) at java.util.TimSort.mergeForceCollapse(TimSort.java:457) at java.util.TimSort.sort(TimSort.java:254) at java.util.Arrays.sort(Arrays.java:1512) at java.util.ArrayList.sort(ArrayList.java:1462) at java.util.Collections.sort(Collections.java:177) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:785) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:796) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:628) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1676) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1614) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1768) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf$CapacitySchedulerPerf.allocateContainersToNode(TestCapacitySchedulerPerf.java:90) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1522) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:571) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:604) {quote} > LeafQueue: retain user limits cache across assignContainers() calls > --- > > Key: YARN-10390 > URL: https://issues.apache.org/jira/browse/YARN-10390 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: user limit caching profile.pdf > > > Currently, user limits are cached locally in leafQueue.assignContainers call > to avoid repeating some steps. This cache can be retained across the calls. > Will put up a PR soon. Profiling was done using the proposed changes in > TestCapacitySchedulerPerf. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls
[ https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173349#comment-17173349 ] Muhammad Samir Khan commented on YARN-10390: There is a typo in the attached pdf. The base profile was run on trunk branch, not master. > LeafQueue: retain user limits cache across assignContainers() calls > --- > > Key: YARN-10390 > URL: https://issues.apache.org/jira/browse/YARN-10390 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: user limit caching profile.pdf > > > Currently, user limits are cached locally in leafQueue.assignContainers call > to avoid repeating some steps. This cache can be retained across the calls. > Will put up a PR soon. Profiling was done using the proposed changes in > TestCapacitySchedulerPerf. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls
Muhammad Samir Khan created YARN-10390: -- Summary: LeafQueue: retain user limits cache across assignContainers() calls Key: YARN-10390 URL: https://issues.apache.org/jira/browse/YARN-10390 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, capacityscheduler Reporter: Muhammad Samir Khan Assignee: Muhammad Samir Khan Attachments: user limit caching profile.pdf Currently, user limits are cached locally in leafQueue.assignContainers call to avoid repeating some steps. This cache can be retained across the calls. Will put up a PR soon. Profiling was done using the proposed changes in TestCapacitySchedulerPerf. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9702) Backport YARN-5788 to branch-2.8
[ https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893177#comment-16893177 ] Muhammad Samir Khan commented on YARN-9702: --- The unit tests are failing for me on branch-2.8 without the patch as well. > Backport YARN-5788 to branch-2.8 > > > Key: YARN-9702 > URL: https://issues.apache.org/jira/browse/YARN-9702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.6 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: YARN-9702-branch-2.8.001.patch > > > Backport YARN-5788 to branch-2.8. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9702) Backport YARN-5788 to branch-2.8
[ https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893024#comment-16893024 ] Muhammad Samir Khan commented on YARN-9702: --- The patch is from cherry-picking commit 7d2d8d25ba0 for YARN-5788 to branch-2.8 and manually resolving conflicts. > Backport YARN-5788 to branch-2.8 > > > Key: YARN-9702 > URL: https://issues.apache.org/jira/browse/YARN-9702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.6 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: YARN-9702-branch-2.8.001.patch > > > Backport YARN-5788 to branch-2.8. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893023#comment-16893023 ] Muhammad Samir Khan commented on YARN-9596: --- Created YARN-9702 for backporting YARN-5788 to branch-2.8. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9702) Backport YARN-5788 to branch-2.8
[ https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9702: -- Attachment: YARN-9702-branch-2.8.001.patch > Backport YARN-5788 to branch-2.8 > > > Key: YARN-9702 > URL: https://issues.apache.org/jira/browse/YARN-9702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.6 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: YARN-9702-branch-2.8.001.patch > > > Backport YARN-5788 to branch-2.8. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9702) Backport YARN-5788 to branch-2.8
Muhammad Samir Khan created YARN-9702: - Summary: Backport YARN-5788 to branch-2.8 Key: YARN-9702 URL: https://issues.apache.org/jira/browse/YARN-9702 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 2.8.6 Reporter: Muhammad Samir Khan Assignee: Muhammad Samir Khan Backport YARN-5788 to branch-2.8. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892966#comment-16892966 ] Muhammad Samir Khan commented on YARN-9596: --- Also seeing the unit test failures and errors on branch-2.8 without the patch. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892265#comment-16892265 ] Muhammad Samir Khan commented on YARN-9596: --- Posted a patch for 2.8. It also includes a workaround in the unit test for race condition in AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375). For 2.8, we will also have to backport YARN-5788. Shall I post a patch here or should that be tracked separately? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9596: -- Attachment: YARN-9596-branch-2.8.005.patch > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892191#comment-16892191 ] Muhammad Samir Khan commented on YARN-9596: --- The remaining two unit tests in TestNodeLabelContainerAllocation should have been fixed with YARN-7466 addendum patch but seems to be still broken in branch-3.0. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892122#comment-16892122 ] Muhammad Samir Khan commented on YARN-9596: --- YARN-4901 fixes some of the unit test failures but it is not in branch-3.0. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892081#comment-16892081 ] Muhammad Samir Khan commented on YARN-9596: --- The findbugs warnings are from branch-3.0 (pre-patch). The unit test failures are also happening in branch-3.0. They just happen a little later since the assert statement is later in branch-3.0. Some of the tests fail if I run all tests in TestNodeLabelContainerAllocation but not if I run the specific tests by themselves. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891957#comment-16891957 ] Muhammad Samir Khan commented on YARN-9596: --- Looking at the UT failures. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891432#comment-16891432 ] Muhammad Samir Khan commented on YARN-9596: --- {quote}[~eepayne] yes, the patch applies cleanly with the --3way option on git apply. For branch-2.8 though the unit test fails because of a race condition in AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375) {quote} Due to whitespace changes between patch 002 and patch 003, the latest patch no longer applies cleanly to branch-3.0 and earlier versions. Uploaded a patch for that. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9596: -- Attachment: YARN-9596-branch-3.0.004.patch > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890284#comment-16890284 ] Muhammad Samir Khan commented on YARN-9596: --- CSQueueUtils#updateUsedCapacity is called before getMaxAvailableResourceToQueuePartition. So any checks for correct partition should be in CSQueueUtils#updateQueueStatistics so that it captures both the methods. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9671) Improve Locality Scheduling when cluster is busy
[ https://issues.apache.org/jira/browse/YARN-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9671: -- Attachment: YARN-9671.001.patch > Improve Locality Scheduling when cluster is busy > > > Key: YARN-9671 > URL: https://issues.apache.org/jira/browse/YARN-9671 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: YARN-9671.001.patch > > > When a cluster is very busy, scheduling opportunities are few and far > between. Scheduling opportunities are how an application knows when to give > up looking for decent locality. > It doesn't make sense to work hard waiting for locality when the odds of it > coming are very small and it may actually take a very long time to actually > give up. > This causes the priority of queues to be violated which is the last thing we > want to do when the cluster is full. > Add a mode to disable skipping locality when cluster is busy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9596: -- Attachment: YARN-9596.003.patch > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887198#comment-16887198 ] Muhammad Samir Khan commented on YARN-9596: --- Updated with changes. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9671) Improve Locality Scheduling when cluster is busy
[ https://issues.apache.org/jira/browse/YARN-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885468#comment-16885468 ] Muhammad Samir Khan commented on YARN-9671: --- Edited description: will handle priority inversion metrics separately in another jira. > Improve Locality Scheduling when cluster is busy > > > Key: YARN-9671 > URL: https://issues.apache.org/jira/browse/YARN-9671 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > > When a cluster is very busy, scheduling opportunities are few and far > between. Scheduling opportunities are how an application knows when to give > up looking for decent locality. > It doesn't make sense to work hard waiting for locality when the odds of it > coming are very small and it may actually take a very long time to actually > give up. > This causes the priority of queues to be violated which is the last thing we > want to do when the cluster is full. > Add a mode to disable skipping locality when cluster is busy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9671) Improve Locality Scheduling when cluster is busy
[ https://issues.apache.org/jira/browse/YARN-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9671: -- Description: When a cluster is very busy, scheduling opportunities are few and far between. Scheduling opportunities are how an application knows when to give up looking for decent locality. It doesn't make sense to work hard waiting for locality when the odds of it coming are very small and it may actually take a very long time to actually give up. This causes the priority of queues to be violated which is the last thing we want to do when the cluster is full. Add a mode to disable skipping locality when cluster is busy. was: When a cluster is very busy, scheduling opportunities are few and far between. Scheduling opportunities are how an application knows when to give up looking for decent locality. It doesn't make sense to work hard waiting for locality when the odds of it coming are very small and it may actually take a very long time to actually give up. This causes the priority of queues to be violated which is the last thing we want to do when the cluster is full. * Add metrics for queue priority inversions. * Add mode to disable skipping locality when cluster is busy. > Improve Locality Scheduling when cluster is busy > > > Key: YARN-9671 > URL: https://issues.apache.org/jira/browse/YARN-9671 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > > When a cluster is very busy, scheduling opportunities are few and far > between. Scheduling opportunities are how an application knows when to give > up looking for decent locality. > It doesn't make sense to work hard waiting for locality when the odds of it > coming are very small and it may actually take a very long time to actually > give up. > This causes the priority of queues to be violated which is the last thing we > want to do when the cluster is full. > Add a mode to disable skipping locality when cluster is busy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9671) Improve Locality Scheduling when cluster is busy
Muhammad Samir Khan created YARN-9671: - Summary: Improve Locality Scheduling when cluster is busy Key: YARN-9671 URL: https://issues.apache.org/jira/browse/YARN-9671 Project: Hadoop YARN Issue Type: Improvement Reporter: Muhammad Samir Khan Assignee: Muhammad Samir Khan When a cluster is very busy, scheduling opportunities are few and far between. Scheduling opportunities are how an application knows when to give up looking for decent locality. It doesn't make sense to work hard waiting for locality when the odds of it coming are very small and it may actually take a very long time to actually give up. This causes the priority of queues to be violated which is the last thing we want to do when the cluster is full. * Add metrics for queue priority inversions. * Add mode to disable skipping locality when cluster is busy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868693#comment-16868693 ] Muhammad Samir Khan commented on YARN-9596: --- [~eepayne] yes, the patch applies cleanly with the --3way option on git apply. For branch-2.8 though the unit test fails because of a race condition in AsyncDispatcher (see [YARN-3878|[https://issues.apache.org/jira/browse/]YARN-3878], [YARN-5436|[https://issues.apache.org/jira/browse/]YARN-5436], and [YARN-5375|[https://issues.apache.org/jira/browse/]YARN-5375]) > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868693#comment-16868693 ] Muhammad Samir Khan edited comment on YARN-9596 at 6/20/19 4:34 PM: [~eepayne] yes, the patch applies cleanly with the --3way option on git apply. For branch-2.8 though the unit test fails because of a race condition in AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375) was (Author: samkhan): [~eepayne] yes, the patch applies cleanly with the --3way option on git apply. For branch-2.8 though the unit test fails because of a race condition in AsyncDispatcher (see [YARN-3878|[https://issues.apache.org/jira/browse/]YARN-3878], [YARN-5436|[https://issues.apache.org/jira/browse/]YARN-5436], and [YARN-5375|[https://issues.apache.org/jira/browse/]YARN-5375]) > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9596: -- Affects Version/s: 3.3.0 2.8.0 > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858834#comment-16858834 ] Muhammad Samir Khan commented on YARN-9596: --- [~Naganarasimha] [~maniraj...@gmail.com] this is related to YARN-6467. Can you please take a look? Thanks. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-9596: -- Attachment: YARN-9596.002.patch > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan reassigned YARN-9596: - Assignee: Muhammad Samir Khan > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
Muhammad Samir Khan created YARN-9596: - Summary: QueueMetrics has incorrect metrics when labelled partitions are involved Key: YARN-9596 URL: https://issues.apache.org/jira/browse/YARN-9596 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Muhammad Samir Khan Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 2019-06-03 at 4.44.15 PM.png After YARN-6467, QueueMetrics should only be tracking metrics for the default partition. However, the metrics are incorrect when labelled partitions are involved. Steps to reproduce == # Configure capacity-scheduler.xml with label configuration # Add label "test" to cluster and replace label on node1 to be "test" # Note down "totalMB" at /ws/v1/cluster/metrics # Start first job on test queue. # Start second job on default queue (does not work if the order of two jobs is swapped). # While the two applications are running, the "totalMB" at /ws/v1/cluster/metrics will go down by the amount of MB used by the first job (screenshots attached). Alternately: In TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), add the following line at the end of the test before rm1.close(): CSQueue rootQueue = cs.getRootQueue(); assertEquals(10*GB, rootQueue.getMetrics().getAvailableMB() + rootQueue.getMetrics().getAllocatedMB()); There are two nodes of 10GB each and only one of them have a non-default label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6834) A container request with only racks specified and relax locality set to false is never honoured
[ https://issues.apache.org/jira/browse/YARN-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-6834: -- Attachment: YARN-6834.001.patch Not sure if the attached patch is the best way to solve the issue but putting it up for comments. > A container request with only racks specified and relax locality set to false > is never honoured > --- > > Key: YARN-6834 > URL: https://issues.apache.org/jira/browse/YARN-6834 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Muhammad Samir Khan > Attachments: YARN-6834.001.patch, yarn-6834-unittest.patch > > > A patch for a unit test is attached to reproduce the issue. It creates a > container request with only racks specified (nodes=null) and relax locality > set to false. With the node-locality-delay conf set appropriately, we wait > indefinitely for a container allocation and the test will timeout. > My understanding of what causes this issue is as follows. The > RegularContainerAllocator delays a rack local allocation based on the > node-locality-delay parameter. This delay is based on missed opportunities. > However, the corresponding off-switch request is skipped but does not count > towards a missed opportunity (because relax locality is set to false). So the > allocator waits indefinitely. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime
[ https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100280#comment-16100280 ] Muhammad Samir Khan commented on YARN-6867: --- The patch in YARN-4719 solves the problem. > AbstractYarnScheduler reports the configured maximum resources, instead of > the actual, even after the configured waittime > - > > Key: YARN-6867 > URL: https://issues.apache.org/jira/browse/YARN-6867 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Muhammad Samir Khan >Assignee: Nathan Roberts > Attachments: YARN-6867.001.patch > > > AbstractYarnScheduler has a configured wait time during which it reports the > maximum resources from the configuration instead of the actual resources > available in the cluster. However, the first query after the wait time > expiration is responded by the configured maximum resources instead of the > actual maximum resources. This can result in a app submission to fail with an > InvalidResourceRequestException (will attach a unit test in the patch) since > the maximum resources reported by the RM is different than the one it sanity > checks against at app submission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime
[ https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100276#comment-16100276 ] Muhammad Samir Khan commented on YARN-6867: --- Sorry, it looks like the problem was solved in trunk via YARN-4719. I should have checked on trunk before proceeding. Closing the JIRA now. > AbstractYarnScheduler reports the configured maximum resources, instead of > the actual, even after the configured waittime > - > > Key: YARN-6867 > URL: https://issues.apache.org/jira/browse/YARN-6867 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Muhammad Samir Khan >Assignee: Nathan Roberts > Attachments: YARN-6867.001.patch > > > AbstractYarnScheduler has a configured wait time during which it reports the > maximum resources from the configuration instead of the actual resources > available in the cluster. However, the first query after the wait time > expiration is responded by the configured maximum resources instead of the > actual maximum resources. This can result in a app submission to fail with an > InvalidResourceRequestException (will attach a unit test in the patch) since > the maximum resources reported by the RM is different than the one it sanity > checks against at app submission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime
[ https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100208#comment-16100208 ] Muhammad Samir Khan commented on YARN-6867: --- Requesting [~rkanter] and [~kasha] for comments. > AbstractYarnScheduler reports the configured maximum resources, instead of > the actual, even after the configured waittime > - > > Key: YARN-6867 > URL: https://issues.apache.org/jira/browse/YARN-6867 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Muhammad Samir Khan >Assignee: Nathan Roberts > Attachments: YARN-6867.001.patch > > > AbstractYarnScheduler has a configured wait time during which it reports the > maximum resources from the configuration instead of the actual resources > available in the cluster. However, the first query after the wait time > expiration is responded by the configured maximum resources instead of the > actual maximum resources. This can result in a app submission to fail with an > InvalidResourceRequestException (will attach a unit test in the patch) since > the maximum resources reported by the RM is different than the one it sanity > checks against at app submission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime
[ https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100193#comment-16100193 ] Muhammad Samir Khan commented on YARN-6867: --- If the application gets submitted before the wait time has expired, then the sanity check for resources will pass and the app submission will go through. However, if the requested resources is more than available in the cluster, then the app will "hang" forever waiting for the AM container to be allocated. I think YARN-56 captures this issue. > AbstractYarnScheduler reports the configured maximum resources, instead of > the actual, even after the configured waittime > - > > Key: YARN-6867 > URL: https://issues.apache.org/jira/browse/YARN-6867 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Muhammad Samir Khan > Attachments: YARN-6867.001.patch > > > AbstractYarnScheduler has a configured wait time during which it reports the > maximum resources from the configuration instead of the actual resources > available in the cluster. However, the first query after the wait time > expiration is responded by the configured maximum resources instead of the > actual maximum resources. This can result in a app submission to fail with an > InvalidResourceRequestException (will attach a unit test in the patch) since > the maximum resources reported by the RM is different than the one it sanity > checks against at app submission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime
[ https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-6867: -- Attachment: YARN-6867.001.patch Added a new unit test in TestRM to demonstrate how an app submission can fail with a InvalidResourceRequestException being thrown. Changes to AbstractYarnScheduler to report the correct value after the wait time is over. Also added a unit test to TestAbstractYarnScheduler. This does not handle all the corner cases, e.g. the RM reported the max values before the wait time was over but the app was submitted after the wait time had expired. But this should handle the more reproducible one. > AbstractYarnScheduler reports the configured maximum resources, instead of > the actual, even after the configured waittime > - > > Key: YARN-6867 > URL: https://issues.apache.org/jira/browse/YARN-6867 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Muhammad Samir Khan > Attachments: YARN-6867.001.patch > > > AbstractYarnScheduler has a configured wait time during which it reports the > maximum resources from the configuration instead of the actual resources > available in the cluster. However, the first query after the wait time > expiration is responded by the configured maximum resources instead of the > actual maximum resources. This can result in a app submission to fail with an > InvalidResourceRequestException (will attach a unit test in the patch) since > the maximum resources reported by the RM is different than the one it sanity > checks against at app submission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime
Muhammad Samir Khan created YARN-6867: - Summary: AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime Key: YARN-6867 URL: https://issues.apache.org/jira/browse/YARN-6867 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Muhammad Samir Khan AbstractYarnScheduler has a configured wait time during which it reports the maximum resources from the configuration instead of the actual resources available in the cluster. However, the first query after the wait time expiration is responded by the configured maximum resources instead of the actual maximum resources. This can result in a app submission to fail with an InvalidResourceRequestException (will attach a unit test in the patch) since the maximum resources reported by the RM is different than the one it sanity checks against at app submission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6834) A container request with only racks specified and relax locality set to false is never honoured
[ https://issues.apache.org/jira/browse/YARN-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated YARN-6834: -- Attachment: yarn-6834-unittest.patch > A container request with only racks specified and relax locality set to false > is never honoured > --- > > Key: YARN-6834 > URL: https://issues.apache.org/jira/browse/YARN-6834 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Muhammad Samir Khan > Attachments: yarn-6834-unittest.patch > > > A patch for a unit test is attached to reproduce the issue. It creates a > container request with only racks specified (nodes=null) and relax locality > set to false. With the node-locality-delay conf set appropriately, we wait > indefinitely for a container allocation and the test will timeout. > My understanding of what causes this issue is as follows. The > RegularContainerAllocator delays a rack local allocation based on the > node-locality-delay parameter. This delay is based on missed opportunities. > However, the corresponding off-switch request is skipped but does not count > towards a missed opportunity (because relax locality is set to false). So the > allocator waits indefinitely. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6834) A container request with only racks specified and relax locality set to false is never honoured
Muhammad Samir Khan created YARN-6834: - Summary: A container request with only racks specified and relax locality set to false is never honoured Key: YARN-6834 URL: https://issues.apache.org/jira/browse/YARN-6834 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Muhammad Samir Khan A patch for a unit test is attached to reproduce the issue. It creates a container request with only racks specified (nodes=null) and relax locality set to false. With the node-locality-delay conf set appropriately, we wait indefinitely for a container allocation and the test will timeout. My understanding of what causes this issue is as follows. The RegularContainerAllocator delays a rack local allocation based on the node-locality-delay parameter. This delay is based on missed opportunities. However, the corresponding off-switch request is skipped but does not count towards a missed opportunity (because relax locality is set to false). So the allocator waits indefinitely. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org