[jira] [Comment Edited] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-08-14 Thread Muhammad Samir Khan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173358#comment-17173358
 ] 

Muhammad Samir Khan edited comment on YARN-10390 at 8/15/20, 1:30 AM:
--

Worth mentioning that I had to repeat some of the tests because of 
java.lang.IllegalArgumentException: Comparison method violates its general 
contract! Searched and found that there are existing jiras with same problem, 
e.g. YARN-8764, YARN-10178.

Have not investigated further. Adding stack trace for one failed run:
{quote}Exception in thread "Thread-9" java.lang.IllegalArgumentException: 
Comparison method violates its general contract!

        at java.util.TimSort.mergeHi(TimSort.java:899)

        at java.util.TimSort.mergeAt(TimSort.java:516)

        at java.util.TimSort.mergeForceCollapse(TimSort.java:457)

        at java.util.TimSort.sort(TimSort.java:254)

        at java.util.Arrays.sort(Arrays.java:1512)

        at java.util.ArrayList.sort(ArrayList.java:1462)

        at java.util.Collections.sort(Collections.java:177)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:785)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:796)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:628)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1676)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1614)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1768)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf$CapacitySchedulerPerf.allocateContainersToNode(TestCapacitySchedulerPerf.java:90)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1522)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:571)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:604)
{quote}


was (Author: samkhan):
Worth mentioning that I ran some of the tests again because of 
java.lang.IllegalArgumentException: Comparison method violates its general 
contract! Searched and found that there are [existing 
jiras|https://issues.apache.org/jira/browse/YARN-10178?jql=project%20%3D%20YARN%20AND%20text%20~%20%22comparison%20method%22]
 with same problem, e.g. YARN-8764, YARN-10178.

Have not investigated further. Adding stack trace for one failed run:
{quote}Exception in thread "Thread-9" java.lang.IllegalArgumentException: 
Comparison method violates its general contract!

        at java.util.TimSort.mergeHi(TimSort.java:899)

        at java.util.TimSort.mergeAt(TimSort.java:516)

        at java.util.TimSort.mergeForceCollapse(TimSort.java:457)

        at java.util.TimSort.sort(TimSort.java:254)

        at java.util.Arrays.sort(Arrays.java:1512)

        at java.util.ArrayList.sort(ArrayList.java:1462)

        at java.util.Collections.sort(Collections.java:177)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:785)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:796)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:628)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1676)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1614)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1768)

        at 
org.apache.hadoop.yarn.server.resourcemanager.sche

[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-08-07 Thread Muhammad Samir Khan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173358#comment-17173358
 ] 

Muhammad Samir Khan commented on YARN-10390:


Worth mentioning that I ran some of the tests again because of 
java.lang.IllegalArgumentException: Comparison method violates its general 
contract! Searched and found that there are [existing 
jiras|https://issues.apache.org/jira/browse/YARN-10178?jql=project%20%3D%20YARN%20AND%20text%20~%20%22comparison%20method%22]
 with same problem, e.g. YARN-8764, YARN-10178.

Have not investigated further. Adding stack trace for one failed run:
{quote}Exception in thread "Thread-9" java.lang.IllegalArgumentException: 
Comparison method violates its general contract!

        at java.util.TimSort.mergeHi(TimSort.java:899)

        at java.util.TimSort.mergeAt(TimSort.java:516)

        at java.util.TimSort.mergeForceCollapse(TimSort.java:457)

        at java.util.TimSort.sort(TimSort.java:254)

        at java.util.Arrays.sort(Arrays.java:1512)

        at java.util.ArrayList.sort(ArrayList.java:1462)

        at java.util.Collections.sort(Collections.java:177)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:785)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:796)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:628)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1676)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1614)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1768)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf$CapacitySchedulerPerf.allocateContainersToNode(TestCapacitySchedulerPerf.java:90)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1522)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:571)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:604)
{quote}

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: user limit caching profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-08-07 Thread Muhammad Samir Khan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173349#comment-17173349
 ] 

Muhammad Samir Khan commented on YARN-10390:


There is a typo in the attached pdf. The base profile was run on trunk branch, 
not master.

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: user limit caching profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-08-06 Thread Muhammad Samir Khan (Jira)
Muhammad Samir Khan created YARN-10390:
--

 Summary: LeafQueue: retain user limits cache across 
assignContainers() calls
 Key: YARN-10390
 URL: https://issues.apache.org/jira/browse/YARN-10390
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, capacityscheduler
Reporter: Muhammad Samir Khan
Assignee: Muhammad Samir Khan
 Attachments: user limit caching profile.pdf

Currently, user limits are cached locally in leafQueue.assignContainers call to 
avoid repeating some steps. This cache can be retained across the calls.

Will put up a PR soon. Profiling was done using the proposed changes in 
TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9702) Backport YARN-5788 to branch-2.8

2019-07-25 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893177#comment-16893177
 ] 

Muhammad Samir Khan commented on YARN-9702:
---

The unit tests are failing for me on branch-2.8 without the patch as well.

> Backport YARN-5788 to branch-2.8
> 
>
> Key: YARN-9702
> URL: https://issues.apache.org/jira/browse/YARN-9702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.6
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-9702-branch-2.8.001.patch
>
>
> Backport YARN-5788 to branch-2.8.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9702) Backport YARN-5788 to branch-2.8

2019-07-25 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893024#comment-16893024
 ] 

Muhammad Samir Khan commented on YARN-9702:
---

The patch is from cherry-picking commit 7d2d8d25ba0 for YARN-5788 to branch-2.8 
and manually resolving conflicts.

> Backport YARN-5788 to branch-2.8
> 
>
> Key: YARN-9702
> URL: https://issues.apache.org/jira/browse/YARN-9702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.6
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-9702-branch-2.8.001.patch
>
>
> Backport YARN-5788 to branch-2.8.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-25 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893023#comment-16893023
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

Created YARN-9702 for backporting YARN-5788 to branch-2.8.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, 
> YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9702) Backport YARN-5788 to branch-2.8

2019-07-25 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9702:
--
Attachment: YARN-9702-branch-2.8.001.patch

> Backport YARN-5788 to branch-2.8
> 
>
> Key: YARN-9702
> URL: https://issues.apache.org/jira/browse/YARN-9702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.6
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-9702-branch-2.8.001.patch
>
>
> Backport YARN-5788 to branch-2.8.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9702) Backport YARN-5788 to branch-2.8

2019-07-25 Thread Muhammad Samir Khan (JIRA)
Muhammad Samir Khan created YARN-9702:
-

 Summary: Backport YARN-5788 to branch-2.8
 Key: YARN-9702
 URL: https://issues.apache.org/jira/browse/YARN-9702
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 2.8.6
Reporter: Muhammad Samir Khan
Assignee: Muhammad Samir Khan


Backport YARN-5788 to branch-2.8.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-25 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892966#comment-16892966
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

Also seeing the unit test failures and errors on branch-2.8 without the patch.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, 
> YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-24 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892265#comment-16892265
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

Posted a patch for 2.8. It also includes a workaround in the unit test for race 
condition in AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375).

For 2.8, we will also have to backport YARN-5788. Shall I post a patch here or 
should that be tracked separately?

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, 
> YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-24 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9596:
--
Attachment: YARN-9596-branch-2.8.005.patch

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, 
> YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-24 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892191#comment-16892191
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

The remaining two unit tests in TestNodeLabelContainerAllocation should have 
been fixed with YARN-7466 addendum patch but seems to be still broken in 
branch-3.0.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-24 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892122#comment-16892122
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

YARN-4901 fixes some of the unit test failures but it is not in branch-3.0.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-24 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892081#comment-16892081
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

The findbugs warnings are from branch-3.0 (pre-patch).

The unit test failures are also happening in branch-3.0. They just happen a 
little later since the assert statement is later in branch-3.0. Some of the 
tests fail if I run all tests in TestNodeLabelContainerAllocation but not if I 
run the specific tests by themselves.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-24 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891957#comment-16891957
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

Looking at the UT failures.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-23 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891432#comment-16891432
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

{quote}[~eepayne] yes, the patch applies cleanly with the --3way option on git 
apply. For branch-2.8 though the unit test fails because of a race condition in 
AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375)
{quote}
Due to whitespace changes between patch 002 and patch 003, the latest patch no 
longer applies cleanly to branch-3.0 and earlier versions. Uploaded a patch for 
that.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-23 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9596:
--
Attachment: YARN-9596-branch-3.0.004.patch

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-22 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890284#comment-16890284
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

CSQueueUtils#updateUsedCapacity is called before 
getMaxAvailableResourceToQueuePartition. So any checks for correct partition 
should be in CSQueueUtils#updateQueueStatistics so that it captures both the 
methods.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9671) Improve Locality Scheduling when cluster is busy

2019-07-17 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9671:
--
Attachment: YARN-9671.001.patch

> Improve Locality Scheduling when cluster is busy
> 
>
> Key: YARN-9671
> URL: https://issues.apache.org/jira/browse/YARN-9671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-9671.001.patch
>
>
> When a cluster is very busy, scheduling opportunities are few and far 
> between. Scheduling opportunities are how an application knows when to give 
> up looking for decent locality.
> It doesn't make sense to work hard waiting for locality when the odds of it 
> coming are very small and it may actually take a very long time to actually 
> give up.
> This causes the priority of queues to be violated which is the last thing we 
> want to do when the cluster is full.
> Add a mode to disable skipping locality when cluster is busy.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-17 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9596:
--
Attachment: YARN-9596.003.patch

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-07-17 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887198#comment-16887198
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

Updated with changes.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, 
> YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9671) Improve Locality Scheduling when cluster is busy

2019-07-15 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885468#comment-16885468
 ] 

Muhammad Samir Khan commented on YARN-9671:
---

Edited description: will handle priority inversion metrics separately in 
another jira.

> Improve Locality Scheduling when cluster is busy
> 
>
> Key: YARN-9671
> URL: https://issues.apache.org/jira/browse/YARN-9671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
>
> When a cluster is very busy, scheduling opportunities are few and far 
> between. Scheduling opportunities are how an application knows when to give 
> up looking for decent locality.
> It doesn't make sense to work hard waiting for locality when the odds of it 
> coming are very small and it may actually take a very long time to actually 
> give up.
> This causes the priority of queues to be violated which is the last thing we 
> want to do when the cluster is full.
> Add a mode to disable skipping locality when cluster is busy.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9671) Improve Locality Scheduling when cluster is busy

2019-07-15 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9671:
--
Description: 
When a cluster is very busy, scheduling opportunities are few and far between. 
Scheduling opportunities are how an application knows when to give up looking 
for decent locality.

It doesn't make sense to work hard waiting for locality when the odds of it 
coming are very small and it may actually take a very long time to actually 
give up.

This causes the priority of queues to be violated which is the last thing we 
want to do when the cluster is full.

Add a mode to disable skipping locality when cluster is busy.

  was:
When a cluster is very busy, scheduling opportunities are few and far between. 
Scheduling opportunities are how an application knows when to give up looking 
for decent locality.

It doesn't make sense to work hard waiting for locality when the odds of it 
coming are very small and it may actually take a very long time to actually 
give up.

This causes the priority of queues to be violated which is the last thing we 
want to do when the cluster is full.
 * Add metrics for queue priority inversions.
 * Add mode to disable skipping locality when cluster is busy.


> Improve Locality Scheduling when cluster is busy
> 
>
> Key: YARN-9671
> URL: https://issues.apache.org/jira/browse/YARN-9671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
>
> When a cluster is very busy, scheduling opportunities are few and far 
> between. Scheduling opportunities are how an application knows when to give 
> up looking for decent locality.
> It doesn't make sense to work hard waiting for locality when the odds of it 
> coming are very small and it may actually take a very long time to actually 
> give up.
> This causes the priority of queues to be violated which is the last thing we 
> want to do when the cluster is full.
> Add a mode to disable skipping locality when cluster is busy.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9671) Improve Locality Scheduling when cluster is busy

2019-07-11 Thread Muhammad Samir Khan (JIRA)
Muhammad Samir Khan created YARN-9671:
-

 Summary: Improve Locality Scheduling when cluster is busy
 Key: YARN-9671
 URL: https://issues.apache.org/jira/browse/YARN-9671
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Muhammad Samir Khan
Assignee: Muhammad Samir Khan


When a cluster is very busy, scheduling opportunities are few and far between. 
Scheduling opportunities are how an application knows when to give up looking 
for decent locality.

It doesn't make sense to work hard waiting for locality when the odds of it 
coming are very small and it may actually take a very long time to actually 
give up.

This causes the priority of queues to be violated which is the last thing we 
want to do when the cluster is full.
 * Add metrics for queue priority inversions.
 * Add mode to disable skipping locality when cluster is busy.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-20 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868693#comment-16868693
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

[~eepayne] yes, the patch applies cleanly with the --3way option on git apply. 
For branch-2.8 though the unit test fails because of a race condition in 
AsyncDispatcher (see 
[YARN-3878|[https://issues.apache.org/jira/browse/]YARN-3878], 
[YARN-5436|[https://issues.apache.org/jira/browse/]YARN-5436], and 
[YARN-5375|[https://issues.apache.org/jira/browse/]YARN-5375])

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-20 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868693#comment-16868693
 ] 

Muhammad Samir Khan edited comment on YARN-9596 at 6/20/19 4:34 PM:


[~eepayne] yes, the patch applies cleanly with the --3way option on git apply. 
For branch-2.8 though the unit test fails because of a race condition in 
AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375)


was (Author: samkhan):
[~eepayne] yes, the patch applies cleanly with the --3way option on git apply. 
For branch-2.8 though the unit test fails because of a race condition in 
AsyncDispatcher (see 
[YARN-3878|[https://issues.apache.org/jira/browse/]YARN-3878], 
[YARN-5436|[https://issues.apache.org/jira/browse/]YARN-5436], and 
[YARN-5375|[https://issues.apache.org/jira/browse/]YARN-5375])

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-07 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9596:
--
Affects Version/s: 3.3.0
   2.8.0

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.8.0, 3.3.0
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-07 Thread Muhammad Samir Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858834#comment-16858834
 ] 

Muhammad Samir Khan commented on YARN-9596:
---

[~Naganarasimha] [~maniraj...@gmail.com] this is related to YARN-6467. Can you 
please take a look? Thanks.

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-05 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-9596:
--
Attachment: YARN-9596.002.patch

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-03 Thread Muhammad Samir Khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan reassigned YARN-9596:
-

Assignee: Muhammad Samir Khan

> QueueMetrics has incorrect metrics when labelled partitions are involved
> 
>
> Key: YARN-9596
> URL: https://issues.apache.org/jira/browse/YARN-9596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> /ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> /ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

2019-06-03 Thread Muhammad Samir Khan (JIRA)
Muhammad Samir Khan created YARN-9596:
-

 Summary: QueueMetrics has incorrect metrics when labelled 
partitions are involved
 Key: YARN-9596
 URL: https://issues.apache.org/jira/browse/YARN-9596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Muhammad Samir Khan
 Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
2019-06-03 at 4.44.15 PM.png

After YARN-6467, QueueMetrics should only be tracking metrics for the default 
partition. However, the metrics are incorrect when labelled partitions are 
involved.

Steps to reproduce

==
 # Configure capacity-scheduler.xml with label configuration
 # Add label "test" to cluster and replace label on node1 to be "test"
 # Note down "totalMB" at 
/ws/v1/cluster/metrics
 # Start first job on test queue.
 # Start second job on default queue (does not work if the order of two jobs is 
swapped).
 # While the two applications are running, the "totalMB" at 
/ws/v1/cluster/metrics will go down by the 
amount of MB used by the first job (screenshots attached).

Alternately:

In 
TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
 add the following line at the end of the test before rm1.close():

CSQueue rootQueue = cs.getRootQueue();
assertEquals(10*GB,
 rootQueue.getMetrics().getAvailableMB() + 
rootQueue.getMetrics().getAllocatedMB());

There are two nodes of 10GB each and only one of them have a non-default label. 
The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6834) A container request with only racks specified and relax locality set to false is never honoured

2017-08-10 Thread Muhammad Samir Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-6834:
--
Attachment: YARN-6834.001.patch

Not sure if the attached patch is the best way to solve the issue but putting 
it up for comments.

> A container request with only racks specified and relax locality set to false 
> is never honoured
> ---
>
> Key: YARN-6834
> URL: https://issues.apache.org/jira/browse/YARN-6834
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Muhammad Samir Khan
> Attachments: YARN-6834.001.patch, yarn-6834-unittest.patch
>
>
> A patch for a unit test is attached to reproduce the issue. It creates a 
> container request with only racks specified (nodes=null) and relax locality 
> set to false. With the node-locality-delay conf set appropriately, we wait 
> indefinitely for a container allocation and the test will timeout.
> My understanding of what causes this issue is as follows. The 
> RegularContainerAllocator delays a rack local allocation based on the 
> node-locality-delay parameter. This delay is based on missed opportunities. 
> However, the corresponding off-switch request is skipped but does not count 
> towards a missed opportunity (because relax locality is set to false). So the 
> allocator waits indefinitely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime

2017-07-25 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100280#comment-16100280
 ] 

Muhammad Samir Khan commented on YARN-6867:
---

The patch in YARN-4719 solves the problem.

> AbstractYarnScheduler reports the configured maximum resources, instead of 
> the actual, even after the configured waittime
> -
>
> Key: YARN-6867
> URL: https://issues.apache.org/jira/browse/YARN-6867
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Muhammad Samir Khan
>Assignee: Nathan Roberts
> Attachments: YARN-6867.001.patch
>
>
> AbstractYarnScheduler has a configured wait time during which it reports the 
> maximum resources from the configuration instead of the actual resources 
> available in the cluster. However, the first query after the wait time 
> expiration is responded by the configured maximum resources instead of the 
> actual maximum resources. This can result in a app submission to fail with an 
> InvalidResourceRequestException (will attach a unit test in the patch) since 
> the maximum resources reported by the RM is different than the one it sanity 
> checks against at app submission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime

2017-07-25 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100276#comment-16100276
 ] 

Muhammad Samir Khan commented on YARN-6867:
---

Sorry, it looks like the problem was solved in trunk via YARN-4719. I should 
have checked on trunk before proceeding. Closing the JIRA now.

> AbstractYarnScheduler reports the configured maximum resources, instead of 
> the actual, even after the configured waittime
> -
>
> Key: YARN-6867
> URL: https://issues.apache.org/jira/browse/YARN-6867
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Muhammad Samir Khan
>Assignee: Nathan Roberts
> Attachments: YARN-6867.001.patch
>
>
> AbstractYarnScheduler has a configured wait time during which it reports the 
> maximum resources from the configuration instead of the actual resources 
> available in the cluster. However, the first query after the wait time 
> expiration is responded by the configured maximum resources instead of the 
> actual maximum resources. This can result in a app submission to fail with an 
> InvalidResourceRequestException (will attach a unit test in the patch) since 
> the maximum resources reported by the RM is different than the one it sanity 
> checks against at app submission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime

2017-07-25 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100208#comment-16100208
 ] 

Muhammad Samir Khan commented on YARN-6867:
---

Requesting [~rkanter] and [~kasha] for comments.

> AbstractYarnScheduler reports the configured maximum resources, instead of 
> the actual, even after the configured waittime
> -
>
> Key: YARN-6867
> URL: https://issues.apache.org/jira/browse/YARN-6867
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Muhammad Samir Khan
>Assignee: Nathan Roberts
> Attachments: YARN-6867.001.patch
>
>
> AbstractYarnScheduler has a configured wait time during which it reports the 
> maximum resources from the configuration instead of the actual resources 
> available in the cluster. However, the first query after the wait time 
> expiration is responded by the configured maximum resources instead of the 
> actual maximum resources. This can result in a app submission to fail with an 
> InvalidResourceRequestException (will attach a unit test in the patch) since 
> the maximum resources reported by the RM is different than the one it sanity 
> checks against at app submission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime

2017-07-25 Thread Muhammad Samir Khan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100193#comment-16100193
 ] 

Muhammad Samir Khan commented on YARN-6867:
---

If the application gets submitted before the wait time has expired, then the 
sanity check for resources will pass and the app submission will go through. 
However, if the requested resources is more than available in the cluster, then 
the app will "hang" forever waiting for the AM container to be allocated. I 
think YARN-56 captures this issue.

> AbstractYarnScheduler reports the configured maximum resources, instead of 
> the actual, even after the configured waittime
> -
>
> Key: YARN-6867
> URL: https://issues.apache.org/jira/browse/YARN-6867
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Muhammad Samir Khan
> Attachments: YARN-6867.001.patch
>
>
> AbstractYarnScheduler has a configured wait time during which it reports the 
> maximum resources from the configuration instead of the actual resources 
> available in the cluster. However, the first query after the wait time 
> expiration is responded by the configured maximum resources instead of the 
> actual maximum resources. This can result in a app submission to fail with an 
> InvalidResourceRequestException (will attach a unit test in the patch) since 
> the maximum resources reported by the RM is different than the one it sanity 
> checks against at app submission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime

2017-07-25 Thread Muhammad Samir Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-6867:
--
Attachment: YARN-6867.001.patch

Added a new unit test in TestRM to demonstrate how an app submission can fail 
with a InvalidResourceRequestException being thrown. Changes to 
AbstractYarnScheduler to report the correct value after the wait time is over. 
Also added a unit test to TestAbstractYarnScheduler.

This does not handle all the corner cases, e.g. the RM reported the max values 
before the wait time was over but the app was submitted after the wait time had 
expired. But this should handle the more reproducible one.

> AbstractYarnScheduler reports the configured maximum resources, instead of 
> the actual, even after the configured waittime
> -
>
> Key: YARN-6867
> URL: https://issues.apache.org/jira/browse/YARN-6867
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Muhammad Samir Khan
> Attachments: YARN-6867.001.patch
>
>
> AbstractYarnScheduler has a configured wait time during which it reports the 
> maximum resources from the configuration instead of the actual resources 
> available in the cluster. However, the first query after the wait time 
> expiration is responded by the configured maximum resources instead of the 
> actual maximum resources. This can result in a app submission to fail with an 
> InvalidResourceRequestException (will attach a unit test in the patch) since 
> the maximum resources reported by the RM is different than the one it sanity 
> checks against at app submission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6867) AbstractYarnScheduler reports the configured maximum resources, instead of the actual, even after the configured waittime

2017-07-25 Thread Muhammad Samir Khan (JIRA)
Muhammad Samir Khan created YARN-6867:
-

 Summary: AbstractYarnScheduler reports the configured maximum 
resources, instead of the actual, even after the configured waittime
 Key: YARN-6867
 URL: https://issues.apache.org/jira/browse/YARN-6867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Muhammad Samir Khan


AbstractYarnScheduler has a configured wait time during which it reports the 
maximum resources from the configuration instead of the actual resources 
available in the cluster. However, the first query after the wait time 
expiration is responded by the configured maximum resources instead of the 
actual maximum resources. This can result in a app submission to fail with an 
InvalidResourceRequestException (will attach a unit test in the patch) since 
the maximum resources reported by the RM is different than the one it sanity 
checks against at app submission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6834) A container request with only racks specified and relax locality set to false is never honoured

2017-07-17 Thread Muhammad Samir Khan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated YARN-6834:
--
Attachment: yarn-6834-unittest.patch

> A container request with only racks specified and relax locality set to false 
> is never honoured
> ---
>
> Key: YARN-6834
> URL: https://issues.apache.org/jira/browse/YARN-6834
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Muhammad Samir Khan
> Attachments: yarn-6834-unittest.patch
>
>
> A patch for a unit test is attached to reproduce the issue. It creates a 
> container request with only racks specified (nodes=null) and relax locality 
> set to false. With the node-locality-delay conf set appropriately, we wait 
> indefinitely for a container allocation and the test will timeout.
> My understanding of what causes this issue is as follows. The 
> RegularContainerAllocator delays a rack local allocation based on the 
> node-locality-delay parameter. This delay is based on missed opportunities. 
> However, the corresponding off-switch request is skipped but does not count 
> towards a missed opportunity (because relax locality is set to false). So the 
> allocator waits indefinitely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6834) A container request with only racks specified and relax locality set to false is never honoured

2017-07-17 Thread Muhammad Samir Khan (JIRA)
Muhammad Samir Khan created YARN-6834:
-

 Summary: A container request with only racks specified and relax 
locality set to false is never honoured
 Key: YARN-6834
 URL: https://issues.apache.org/jira/browse/YARN-6834
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Muhammad Samir Khan


A patch for a unit test is attached to reproduce the issue. It creates a 
container request with only racks specified (nodes=null) and relax locality set 
to false. With the node-locality-delay conf set appropriately, we wait 
indefinitely for a container allocation and the test will timeout.

My understanding of what causes this issue is as follows. The 
RegularContainerAllocator delays a rack local allocation based on the 
node-locality-delay parameter. This delay is based on missed opportunities. 
However, the corresponding off-switch request is skipped but does not count 
towards a missed opportunity (because relax locality is set to false). So the 
allocator waits indefinitely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org