[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-27 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221616#comment-17221616
 ] 

Wangda Tan commented on YARN-10458:
---

It looks good to me, the only question/ask is since this issue occurred when we 
have queue placement based on app tag userid + queue auto creation, can we make 
sure either a test added for that (preferred), we need definitely set up a 
single node cluster and make sure it works in the scenario.

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch, YARN-10458-002.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220808#comment-17220808
 ] 

Wangda Tan commented on YARN-10458:
---

[~pbacsko], 

Thanks for working on the patch, 

I'm trying to understand if

{{929       String queue = appPlacementContext.getQueue();}}

always returns relative queue path, if no, the below logic could be wrong: 
{code:java}
931  if (parent != null) {
932queue = parent + "." + queue;
933  } {code}
And do we always have full qualified queue path in the queue mapping setting? 
{code:java}
2295  // can only check proper ACLs if the path is fully qualified
2296  while (queue == null || !queueName.equals("root")) { {code}
I don't fully remember this part of logic.

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217285#comment-17217285
 ] 

Wangda Tan edited comment on YARN-10178 at 10/20/20, 5:33 AM:
--

Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
{code:java}
 AbsoluteUsedCapacity
 UsedCapacity
 ConfiguredMinResource
 AbsoluteCapacity

And plus CSQueue's reference
){code}
Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 


was (Author: wangda):
Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
)

Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
>

[jira] [Commented] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217285#comment-17217285
 ] 

Wangda Tan commented on YARN-10178:
---

Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
)

Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function 

[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217279#comment-17217279
 ] 

Wangda Tan commented on YARN-8737:
--

Rekicked Jenkins, after reviewed the case, the fix looks good to me, even 
though it covered a small set of the issues. I agree to move scheduling-related 
issues in YARN-10178.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-09-28 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203545#comment-17203545
 ] 

Wangda Tan commented on YARN-8737:
--

cc: [~snemeth], [~bteke] to help with patch reviews, test, and commit.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-09-28 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203544#comment-17203544
 ] 

Wangda Tan commented on YARN-8737:
--

[~Tao Yang], missed this ticket, we recently got a customer report about this 
ticket.

And based on the comment from [~tuyu] (also apologize I didn't get back to you 
on the Jira) on YARN-10058: 
{quote}when patch YARN-8737 to local repo, this can not fix race condition
{quote}
I'm not sure if this ticket can solve the problem or not. I found [~tuyu] filed 
YARN-10178 with detailed analysis.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters

2020-09-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194378#comment-17194378
 ] 

Wangda Tan commented on YARN-4971:
--

I think we should revisit the patch based on comment from Karthik: 
https://issues.apache.org/jira/browse/YARN-4971?focusedCommentId=15281097=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15281097

I also don't quite understand why the following methods of ClientRMService are 
different: 

One is:
{code:java}
  InetSocketAddress getBindAddress(Configuration conf) {
return conf.getSocketAddr(
YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,
YarnConfiguration.DEFAULT_RM_ADDRESS,
YarnConfiguration.DEFAULT_RM_PORT);
  } {code}
 

And another one is: 
{code:java}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
   YarnConfiguration.RM_ADDRESS,
   
YarnConfiguration.DEFAULT_RM_ADDRESS,
   
server.getListenerAddress());{code}
 

Basically, in serviceInit and serviceStart, how to get RM address is different. 
Is that a potential root cause of the problem? [~wilfreds], [~shuzirra]

> RM fails to re-bind to wildcard IP after failover in multi homed clusters
> -
>
> Key: YARN-4971
> URL: https://issues.apache.org/jira/browse/YARN-4971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-4971.1.patch
>
>
> If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first 
> time the service becomes active binding to the wildcard works as expected. If 
> the service has transitioned from active to standby and then becomes active 
> again after failovers the service only binds to one of the ip addresses.
> There is a difference between the services inside the RM: it only seem to 
> happen for the services listening on ports: 8030 and 8032



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-07-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168116#comment-17168116
 ] 

Wangda Tan commented on YARN-10380:
---

cc: [~prabhujoseph], I think we identified more issues during a debug session. 
I saw YARN-10360 is filed, but I think there're more issues, do you remember? 

Also + [~sunil.gov...@gmail.com], [~tangzhankun]. 

I checked logics of other parts, I didn't see too many other issues, but I 
didn't spend much time on this so it is possible I missed something. 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Priority: Critical
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-07-30 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10380:
-

 Summary: Import logic of multi-node allocation in CapacityScheduler
 Key: YARN-10380
 URL: https://issues.apache.org/jira/browse/YARN-10380
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan


*1) Entry point:* 
When we do multi-node allocation, we're using the same logic of async 
scheduling:
{code:java}
// Allocate containers of node [start, end)
 for (FiCaSchedulerNode node : nodes) {
  if (current++ >= start) {
     if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
        continue;
     }
     cs.allocateContainersToNode(node.getNodeID(), false);
  }
 } {code}
Is it the most effective way to do multi-node scheduling? Should we allocate 
based on partitions? In above logic, if we have thousands of node in one 
partition, we will repeatly access all nodes of the partition thousands of 
times.

I would suggest looking at making entry-point for node-heartbeat, 
async-scheduling (single node), and async-scheduling (multi-node) to be 
different.

Node-heartbeat and async-scheduling (single node) can be still similar and 
share most of the code. 

async-scheduling (multi-node): should iterate partition first, using pseudo 
code like: 
{code:java}
for (partition : all partitions) {
  allocateContainersOnMultiNodes(getCandidate(partition))
} {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162139#comment-17162139
 ] 

Wangda Tan commented on YARN-10352:
---

Hi [~prabhujoseph], thanks for the update, unRegisterNM discussion we can move 
to another Jira. +1 for the uploaded patch, will get it committed once Jenkins 
get back. 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161699#comment-17161699
 ] 

Wangda Tan commented on YARN-10352:
---

Also, we need to systematically handle the node heartbeat interval problem, in 
a cloud environment, node can be frequently commissioned, if we always wait for 
10 mins timeout, it may not be good, it's better to improve the logic by 
preempting containers newly allocated (by not acquired) on NM which stopped 
heartbeating. With this, we can proactively relocate containers to different 
nodes before the 10 mins timeout. It can be a follow up of this Jira. 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161696#comment-17161696
 ] 

Wangda Tan commented on YARN-10352:
---

Thanks [~prabhujoseph], 

Then it makes sense, but the original logic is too confusing, I think we should 
clean it up, make sure multi-node v.s. single-node allocation, CandidateSet 
v.s. Multi-node sorter be more clear. 

Just one nit, can we reuse this method: 
{code:java}
159  long timeElapsedFromLastHeartbeat =
160  Time.monotonicNow() - cached.getLastHeartbeatMonotonicTime();
161  if (timeElapsedFromLastHeartbeat <= nmHeartbeatInterval * 2) { 
{code}
[~ztang], can you help to take a look? 

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2020-07-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161615#comment-17161615
 ] 

Wangda Tan commented on YARN-10352:
---

[~prabhujoseph], I'm trying to understand this logic, why we have two separate 
logics to filter outdated nodes? We have one in MultiNodeSortingManager and one 
in getNodesHeartbeated. I'm wondering if it is necessary or not.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-12 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134338#comment-17134338
 ] 

Wangda Tan commented on YARN-10293:
---

Missed last comments, thanks [~prabhujoseph]/[~Tao Yang]! 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> 2020-05-21 12:13:33,243 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Allocation proposal accepted
> {code}
> 

[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler

2020-06-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132598#comment-17132598
 ] 

Wangda Tan commented on YARN-9930:
--

Thanks [~pbacsko], it will be make sense to create a one-pager doc and talk 
about what is the behavior looks like.

For example, how this feature related to maximum-am-limit. And how refreshQueue 
works with this feature (increase #running-app-limit seems fine, but how about 
shrink #running-app-limit?).

> Support max running app logic for CapacityScheduler
> ---
>
> Key: YARN-9930
> URL: https://issues.apache.org/jira/browse/YARN-9930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.0, 3.1.1
>Reporter: zhoukang
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9930-001.patch, YARN-9930-002.patch, 
> YARN-9930-003.patch, YARN-9930-004.patch, YARN-9930-POC01.patch, 
> YARN-9930-POC02.patch, YARN-9930-POC03.patch, YARN-9930-POC04.patch, 
> YARN-9930-POC05.patch, screenshot-1.png
>
>
> In FairScheduler, there has limitation for max running which will let 
> application pending.
> But in CapacityScheduler there has no feature like max running app.Only got 
> max app,and jobs will be rejected directly on client.
> This jira i want to implement this semantic for CapacityScheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124195#comment-17124195
 ] 

Wangda Tan commented on YARN-10293:
---

[~Tao Yang], the suggestion totally make sense to me. When we have done the 
initial global scheduling framework, the goal is to make it compatible to the 
previous behavior, I agree to make additional steps to overhaul reservation 
logic under the context of global scheduling is a good idea. Now the code is 
very hard to read and understand.

I think we can do this step by step, first, let's fix low hanging fruits like 
this Jira. (I hope to get idea from you about the proposed change: 
https://issues.apache.org/jira/browse/YARN-10293?focusedCommentId=17121419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17121419

And [~prabhujoseph] if you have time/bandwidth, can you take a look into 
reservation related logic + preemption + unreserve + global scheduling and see 
what we can optimize here?

 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch, 
> YARN-10293-003-WIP.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: 

[jira] [Commented] (YARN-10296) Make ContainerPBImpl#getId/setId synchronized

2020-06-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124121#comment-17124121
 ] 

Wangda Tan commented on YARN-10296:
---

[~bteke], 

It makes sense to covert other method which uses containerId to synchronized as 
well, for example, {{getProto}}, performance should not be a big concern here, 
as multiple threads not likely access same Container object. (Because we have 
many container objects in the memory).

> Make ContainerPBImpl#getId/setId synchronized
> -
>
> Key: YARN-10296
> URL: https://issues.apache.org/jira/browse/YARN-10296
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Minor
> Attachments: YARN-10296.001.patch
>
>
> ContainerPBImpl getId and setId methods can be accessed from multiple 
> threads. In order to avoid any simultaneous accesses and race conditions 
> these methods should be synchronized.
> The idea came from the issue described in YARN-10295, however that patch is 
> only applicable to branch-3.2 and 3.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-01 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121419#comment-17121419
 ] 

Wangda Tan commented on YARN-10293:
---

[~prabhujoseph], I agree with you, I think the entire {{if}} check is helpful 
when cluster is full, we won't go into the allocation phase and save some CPU 
cycles.  

However, it won't matter too much if the cluster is full – we cannot get 
container allocation in any case. I suggest simplifying this logic by removing 
the if check, it sounds dangerous to me. If we see it cause performance issue, 
we can solve it in a different way (like increase wait time if nothing can be 
allocated or reserved).

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-05-29 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119866#comment-17119866
 ] 

Wangda Tan commented on YARN-10293:
---

[~prabhujoseph],  

This looks like a valid bug, but I'm wondering if we really want to add the 
check like: 
{code:java}
if (getRootQueue().getQueueCapacities().getUsedCapacity(
candidates.getPartition()) >= 1.0f
&& preemptionManager.getKillableResource(
CapacitySchedulerConfiguration.ROOT, candidates.getPartition())
== Resources.none()) {
   ...
} {code}
In my opinion, we can try to allocate from previous reserved, and then 
allocate/reserve new containers. 

Adding checks of partition capacity, etc. cannot be error-proof and could lead 
to the issues you mentioned. However, on the other side, I don't know if remove 
it could lead to other bugs or not, for example, 
https://issues.apache.org/jira/browse/YARN-9432 updated logics around this area 
a lot. I suggest you can consult Tao if possible. 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041

[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-29 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119822#comment-17119822
 ] 

Wangda Tan commented on YARN-10259:
---

Thanks [~prabhujoseph], I think we should also put this to 3.3.1, this is an 
important fix we should have.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10259-001.patch, YARN-10259-002.patch, 
> YARN-10259-003.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> Attached testcase which reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-05-12 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105675#comment-17105675
 ] 

Wangda Tan commented on YARN-10259:
---

Reviewed the patch, it looks good to me, I think it may introduce performance 
regression for large clusters, but I agree this is the right fix otherwise we 
can see issues such as scheduler got stuck. 

Can we move this (and similar logs) to debug: 
{code:java}
LOG.warn("Node : " + node.getNodeID()
+ " does not have sufficient resource for ask : " + pendingAsk
+ " node total capability : " + node.getTotalResource()); {code}
Because for a heterogeneous cluster, we can see this quite often, putting this 
to warn is overkill to me.

So +1 to the patch, please move some logs to debug to make sure we won't see 
the number of logs increased too much after this change. 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10259-001.patch, YARN-10259-002.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10154) CS Dynamic Queues cannot be configured with absolute resources

2020-04-16 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085078#comment-17085078
 ] 

Wangda Tan commented on YARN-10154:
---

[~maniraj...@gmail.com], there's an ASF license issue. [~sunilg], can you 
please remember to fix it when you committing?

> CS Dynamic Queues cannot be configured with absolute resources
> --
>
> Key: YARN-10154
> URL: https://issues.apache.org/jira/browse/YARN-10154
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.3
>Reporter: Sunil G
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-10154.001.patch, YARN-10154.002.patch, 
> YARN-10154.003.patch
>
>
> In CS, ManagedParent Queue and its template cannot take absolute resource 
> value like 
> [memory=8192,vcores=8]
>  Thsi Jira is to track and improve the configuration reading module of 
> DynamicQueue to support absolute resource values.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10154) CS Dynamic Queues cannot be configured with absolute resources

2020-04-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084389#comment-17084389
 ] 

Wangda Tan commented on YARN-10154:
---

[~maniraj...@gmail.com], thank you so much for the patch! It looks good to me, 
[~sunilg] do you want to take another look at it? Thanks.

> CS Dynamic Queues cannot be configured with absolute resources
> --
>
> Key: YARN-10154
> URL: https://issues.apache.org/jira/browse/YARN-10154
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.3
>Reporter: Sunil G
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-10154.001.patch, YARN-10154.002.patch, 
> YARN-10154.003.patch
>
>
> In CS, ManagedParent Queue and its template cannot take absolute resource 
> value like 
> [memory=8192,vcores=8]
>  Thsi Jira is to track and improve the configuration reading module of 
> DynamicQueue to support absolute resource values.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10154) CS Dynamic Queues cannot be configured with absolute resources

2020-04-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084390#comment-17084390
 ] 

Wangda Tan commented on YARN-10154:
---

cc: [~prabhujoseph] 

> CS Dynamic Queues cannot be configured with absolute resources
> --
>
> Key: YARN-10154
> URL: https://issues.apache.org/jira/browse/YARN-10154
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.3
>Reporter: Sunil G
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-10154.001.patch, YARN-10154.002.patch, 
> YARN-10154.003.patch
>
>
> In CS, ManagedParent Queue and its template cannot take absolute resource 
> value like 
> [memory=8192,vcores=8]
>  Thsi Jira is to track and improve the configuration reading module of 
> DynamicQueue to support absolute resource values.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality

2020-04-06 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-10151.
---
Resolution: Won't Fix

Thanks folks for commenting about YARN-9838. I think we don't need this change 
now given we have a fix of the reported issue already.

> Disable Capacity Scheduler's move app between queue functionality
> -
>
> Key: YARN-10151
> URL: https://issues.apache.org/jira/browse/YARN-10151
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Saw this happened in many clusters: Capacity Scheduler cannot work correctly 
> with the move app between queue features. It will cause weird JMX issue, 
> resource accounting issue, etc. In a lot of causes it will cause RM 
> completely hung and available resource became negative, nothing can be 
> allocated after that. We should turn off CapacityScheduler's move app between 
> queue feature. (see: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}}
>  )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10219) YARN service placement constraints is broken

2020-04-03 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074778#comment-17074778
 ] 

Wangda Tan commented on YARN-10219:
---

Thanks [~eyang] for creating the JIRA and upload fixes. 

cc: [~prabhujoseph] to review the fix.

> YARN service placement constraints is broken
> 
>
> Key: YARN-10219
> URL: https://issues.apache.org/jira/browse/YARN-10219
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2, 3.3.0, 3.2.1, 3.1.3
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-10219.001.patch
>
>
> YARN service placement constraint does not work with node label nor node 
> attributes. Example of placement constraints: 
> {code} 
>   "placement_policy": {
> "constraints": [
>   {
> "type": "AFFINITY",
> "scope": "NODE",
> "node_attributes": {
>   "label":["genfile"]
> },
> "target_tags": [
>   "ping"
> ] 
>   }
> ]
>   },
> {code}
> Node attribute added: 
> {code} ./bin/yarn nodeattributes -add "host-3.example.com:label=genfile" 
> {code} 
> Scheduling activities shows: 
> {code}  Node does not match partition or placement constraints, 
> unsatisfied PC expression="in,node,ping", target-type=ALLOCATION_TAG 
> 
>  1
>  host-3.example.com:45454{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10154) CS Dynamic Queues cannot be configured with absolute resources

2020-03-27 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069082#comment-17069082
 ] 

Wangda Tan commented on YARN-10154:
---

I thought I submitted the review comments: 
[~maniraj...@gmail.com], thanks for working on this patch, overall it looks 
good to me. 
However, I suggest to add an end to end patch, can you refer to 
{{TestCapacitySchedulerAutoQueueCreation#testAutoCreateLeafQueueCreation}}
Thanks!

> CS Dynamic Queues cannot be configured with absolute resources
> --
>
> Key: YARN-10154
> URL: https://issues.apache.org/jira/browse/YARN-10154
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.3
>Reporter: Sunil G
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-10154.001.patch, YARN-10154.002.patch
>
>
> In CS, ManagedParent Queue and its template cannot take absolute resource 
> value like 
> [memory=8192,vcores=8]
>  Thsi Jira is to track and improve the configuration reading module of 
> DynamicQueue to support absolute resource values.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-03-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059529#comment-17059529
 ] 

Wangda Tan commented on YARN-9879:
--

Thanks [~shuzirra] for the update. 

I only checked the updates of CSQueueStore, now the class looks good to me.  I 
will let others to check the rest of the patch. :) 

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: fs2cs
> Attachments: CSQueue.getQueueUsage.txt, DesignDoc_v1.pdf, 
> YARN-9879.POC001.patch, YARN-9879.POC002.patch, YARN-9879.POC003.patch, 
> YARN-9879.POC004.patch, YARN-9879.POC005.patch, YARN-9879.POC006.patch, 
> YARN-9879.POC007.patch, YARN-9879.POC008.patch, YARN-9879.POC009.patch, 
> YARN-9879.POC010.patch, YARN-9879.POC011.patch, YARN-9879.POC012.patch, 
> YARN-9879.POC013.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10192) CapacityScheduler stuck in loop rejecting allocation proposals

2020-03-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057277#comment-17057277
 ] 

Wangda Tan commented on YARN-10192:
---

[~Tao Yang] did you remember to see this issue before?

> CapacityScheduler stuck in loop rejecting allocation proposals
> --
>
> Key: YARN-10192
> URL: https://issues.apache.org/jira/browse/YARN-10192
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Jonathan Hung
>Priority: Major
>
> On a 2.10.0 cluster, we observed containers were being scheduled very slowly. 
> Based on logs, it seems to reject a bunch of allocation proposals, then 
> accept a bunch of reserved containers, but very few containers are actually 
> getting allocated:
> {noformat}
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,965 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,968 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster=
> 2020-03-10 06:31:48,977 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal
> 2020-03-10 06:31:48,981 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1582403122262_15460_01 
> container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu
> 2020-03-10 06:31:48,982 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=misc usedCapacity=0.0031771248 
> absoluteUsedCapacity=3.1771246E-4 used= 
> cluster=
> 2020-03-10 06:31:48,982 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.30113637 
> 

[jira] [Comment Edited] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-03-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057274#comment-17057274
 ] 

Wangda Tan edited comment on YARN-9879 at 3/11/20, 5:55 PM:


Thanks [~shuzirra] for uploading another monster patch! 

I didn't check other places, I only checked the CSQueueStore related items: 

*Nits: *

- CapacitySchedulerQueueManager#getShortNameQueues please mark @VisibleByTests
- Similarily, mark CSQueueStore#getShortNameQueues @VisibleByTests

*Primary: Locking of the class still have many issues: *

For all methods will be accessed by external class. Make sure that: 

1) Avoid using synchronized lock when read/write lock present.
2) ALL external read-only methods protected by readlock. 
3) ALL external writable methods protected by writelock.
4) Use
{code} 
try {
lock.(read/or write).lock()

.. your logic ..
} catch (exception) {
// if there's any
} finally {
lock.(read/or write).unlock()
}
{code}
To make sure lock is always released: Example: CapacityScheduler#serviceStop

5) After the above changes, you can remove all usage of {{ConcurrentHashMap}}, 
it is bad for performance with locks. Hashmap will be way faster under the 
protection of lock.



was (Author: wangda):
Thanks [~shuzirra] for uploading another monster patch! 

I didn't check other places, I only checked the CSQueueStore related items: 

*Nits: *

- CapacitySchedulerQueueManager#getShortNameQueues please mark @VisibleByTests
- Similarily, mark CSQueueStore#getShortNameQueues @VisibleByTests

*Primary: Locking of the class still have many issues: *

For all methods will be accessed by external class. Make sure that: 

1) There're no synchronized lock. 
2) All external read-only method use readlock. 
3) All external writable method use writelock.
4) Use
{code} 
try {
lock.(read/or write).lock()

.. your logic ..
} catch (exception) {
// if there's any
} finally {
lock.(read/or write).unlock()
}
{code}

To make sure lock is always released: Example: CapacityScheduler#serviceStop

5) After above changes, you can remove all usage of {{ConcurrentHashMap}}, it 
is bad for performance with locks. Hashmap will be way faster under protection 
of lock.


> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: fs2cs
> Attachments: CSQueue.getQueueUsage.txt, DesignDoc_v1.pdf, 
> YARN-9879.POC001.patch, YARN-9879.POC002.patch, YARN-9879.POC003.patch, 
> YARN-9879.POC004.patch, YARN-9879.POC005.patch, YARN-9879.POC006.patch, 
> YARN-9879.POC007.patch, YARN-9879.POC008.patch, YARN-9879.POC009.patch, 
> YARN-9879.POC010.patch, YARN-9879.POC011.patch, YARN-9879.POC012.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-03-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057274#comment-17057274
 ] 

Wangda Tan commented on YARN-9879:
--

Thanks [~shuzirra] for uploading another monster patch! 

I didn't check other places, I only checked the CSQueueStore related items: 

*Nits: *

- CapacitySchedulerQueueManager#getShortNameQueues please mark @VisibleByTests
- Similarily, mark CSQueueStore#getShortNameQueues @VisibleByTests

*Primary: Locking of the class still have many issues: *

For all methods will be accessed by external class. Make sure that: 

1) There're no synchronized lock. 
2) All external read-only method use readlock. 
3) All external writable method use writelock.
4) Use
{code} 
try {
lock.(read/or write).lock()

.. your logic ..
} catch (exception) {
// if there's any
} finally {
lock.(read/or write).unlock()
}
{code}

To make sure lock is always released: Example: CapacityScheduler#serviceStop

5) After above changes, you can remove all usage of {{ConcurrentHashMap}}, it 
is bad for performance with locks. Hashmap will be way faster under protection 
of lock.


> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: fs2cs
> Attachments: CSQueue.getQueueUsage.txt, DesignDoc_v1.pdf, 
> YARN-9879.POC001.patch, YARN-9879.POC002.patch, YARN-9879.POC003.patch, 
> YARN-9879.POC004.patch, YARN-9879.POC005.patch, YARN-9879.POC006.patch, 
> YARN-9879.POC007.patch, YARN-9879.POC008.patch, YARN-9879.POC009.patch, 
> YARN-9879.POC010.patch, YARN-9879.POC011.patch, YARN-9879.POC012.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10168) FS-CS Converter: tool doesn't handle min/max resource conversion correctly

2020-03-09 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055209#comment-17055209
 ] 

Wangda Tan commented on YARN-10168:
---

[~pbacsko], suggested change make sense to me.

> FS-CS Converter: tool doesn't handle min/max resource conversion correctly
> --
>
> Key: YARN-10168
> URL: https://issues.apache.org/jira/browse/YARN-10168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: fs2cs
> Attachments: YARN-10168-001.patch, YARN-10168-002.patch, 
> YARN-10168-003.patch, YARN-10168-004.patch
>
>
> Trying to understand logics of convert min and max resource from FS to CS, 
> and found some issues:
> 1)
> In FSQueueConverter#emitMaximumCapacity
> Existing logic in FS is to either specify a maximum percentage for queues 
> against cluster resources. Or, specify an absolute valued maximum resource.
> In the existing FS2CS converter, when a percentage-based maximum resource is 
> specified, the converter takes a global resource from fs2cs CLI, and applies 
> percentages to that. It is not correct since the percentage-based value will 
> get lost, and in the future when cluster resources go up and down, the 
> maximum resource cannot be changed.
> 2)
> The logic to deal with min/weight resource is also questionable:
> The existing fs2cs tool, it takes precedence of percentage over 
> absoluteResource, and could set both to a queue config. See 
> FSQueueConverter.Capacity#toString
> However, in CS, comparing to FS, the weights/min resource is quite different:
> CS use the same queue.capacity to specify both percentage-based or 
> absolute-resource-based configs (Similar to how FS deal with maximum 
> Resource).
>  The capacity defines guaranteed resource, which also impact fairshare of the 
> queue. (The more guaranteed resource a queue has, the larger "pie" the queue 
> can get if there's any additional available resource).
>  In FS, minResource defined the guaranteed resource, and weight defined how 
> much the pie can grow to.
> So to me, in FS, we should pick and choose either weight or minResource to 
> generate CS.
> 3)
> In FS, mix-use of absolute-resource configs (like min/maxResource), and 
> percentage-based (like weight) is allowed. But in CS, it is not allowed. The 
> reason is discussed on YARN-5881, and find [a]Should we support specifying a 
> mix of percentage ...
> The existing fs2cs doesn't handle the issue, which could set mixed absolute 
> resource and percentage-based resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-03-05 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052616#comment-17052616
 ] 

Wangda Tan commented on YARN-9879:
--

Thanks [~shuzirra] for the monster patch! 

Took a quick look at the patch, overall looks good. (I skipped the hardest 
queue mapping module to leave to other folks to review). 

 1) Make sure commented code is not part of the final patch.

2) CSQueueStore:
 - Only fullNameQueues is ConcurrentHashMap, is it intentional?
 - getByShortName can be converted to private method, and the 
{{CapacitySchedulerQueueManager#getQueueByShortName}} is not used, can be 
removed.
 - Instead of Synchronized lock, I suggest to use ReadWriteLock, the method 
like {{get}} is not safe since it access multiple fields. There's very 
infrequent write to queue map comparing to read.

3) CapacityScheduler.java:
{code:java}
1144  Queue  queue = attempt.getQueue();
1145  CSQueue csQueue = queue instanceof CSQueue
{code}
This check is uncessceary. When CS is enabled, all queues in the RM is CSQueue.

4) CapacitySchedulerConfigValidator.java: 
 validateQueueHierarchy: Have mixed usage of queueName and queuePath, suggest 
to move to queuePath for less ambiguous.

5) There're 18 TODOs in the patch, I suggest to mark "must-to-fix" TODOs to 
FIXME, in most cases TODO means we will never do it. :). In Hadoop there're 731 
TODOs.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: fs2cs
> Attachments: CSQueue.getQueueUsage.txt, DesignDoc_v1.pdf, 
> YARN-9879.POC001.patch, YARN-9879.POC002.patch, YARN-9879.POC003.patch, 
> YARN-9879.POC004.patch, YARN-9879.POC005.patch, YARN-9879.POC006.patch, 
> YARN-9879.POC007.patch, YARN-9879.POC008.patch, YARN-9879.POC009.patch, 
> YARN-9879.POC010.patch, YARN-9879.POC011.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10180) TimelineV2ClientImpl$TimelineEntityDispatcher threads leak

2020-03-04 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051499#comment-17051499
 ] 

Wangda Tan commented on YARN-10180:
---

Thanks [~prabhujoseph] for filing this! 

I think we should think about to solve this in a short term. (Make sure block 
of write doesn't stop releasing thread). Also we need to solve this in a long 
term. (number of threads for ATS Client should be bounded instead of linear 
grow with number of apps, in a large cluster it is normal to have several 
thousands concurrent running apps). 

And it is worth to look at if RM has the same issue or not.

> TimelineV2ClientImpl$TimelineEntityDispatcher threads leak
> --
>
> Key: YARN-10180
> URL: https://issues.apache.org/jira/browse/YARN-10180
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> TimelineV2ClientImpl$TimelineEntityDispatcher threads leak when NM Timeline 
> Dispatcher thread is waiting for synchronous putEntities to complete and 
> which hangs for some reason. The STOP_TIMELINE_CLIENT for completed 
> applications waits in dispatcher queue causing threads started by 
> ApplicationImpl -> TimelineV2ClientImpl to leak.
> {code}
> "pool-19133-thread-1" #1362413 prio=5 os_prio=0 tid=0x7f027bab0800 
> nid=0x4786c waiting on condition [0x7efdbb2bf000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0004272df388> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher$1.run(TimelineV2ClientImpl.java:426)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> *NM Timeline dispatcher Thread*
> {code}
> "NM Timeline dispatcher" #283 prio=5 os_prio=0 tid=0x7f02db875000 
> nid=0x25bc22 waiting on condition [0x7f0255de9000]"NM Timeline 
> dispatcher" #283 prio=5 os_prio=0 tid=0x7f02db875000 nid=0x25bc22 waiting 
> on condition [0x7f0255de9000]   java.lang.Thread.State: WAITING (parking) 
> at sun.misc.Unsafe.park(Native Method) - parking to wait for  
> <0x000411d71310> (a 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$EntitiesHolder) 
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at 
> java.util.concurrent.FutureTask.get(FutureTask.java:191) at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
>  at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:335)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.handleNMTimelineEvent(NMTimelinePublisher.java:145)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ForwardingEventHandler.handle(NMTimelinePublisher.java:427)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ForwardingEventHandler.handle(NMTimelinePublisher.java:422)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) 
> at java.lang.Thread.run(Thread.java:748) 
> {code}
> cc [~leftnoteasy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-03-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049947#comment-17049947
 ] 

Wangda Tan commented on YARN-10178:
---

[~tuyu], can you add more details like error message, thread trace, etc?

 

Thanks,

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10167) FS-CS Converter: Need validate c-s.xml after converting

2020-02-27 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046749#comment-17046749
 ] 

Wangda Tan commented on YARN-10167:
---

[~pbacsko], agree with this: 
{quote}Note that the converter itself already starts an FS instance inside to 
parse and load the allocation file. We can do the same thing with CS. Just load 
the converted config along with the delta {{yarn-site.xml}} (which essentially 
means that we merge the original site + the delta) and let's see if it can 
start.
{quote}
We can check if MiniYARNCluster can help or not. I'm not sure if we can 
directly initialize CS or not, since it has other module dependencies. 

> FS-CS Converter: Need validate c-s.xml after converting
> ---
>
> Key: YARN-10167
> URL: https://issues.apache.org/jira/browse/YARN-10167
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Major
>  Labels: fs2cs, newbie
>
> Currently we just generated c-s.xml, but we didn't validate that. To make 
> sure the c-s.xml is correct after conversion, it's better to initialize the 
> CS scheduler using configs.
> Also, in the test, we should try to leverage MockRM to validate generated 
> configs as much as we could.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10168) FS-CS Convert: Converter tool doesn't handle min/max resource conversion correct

2020-02-27 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046736#comment-17046736
 ] 

Wangda Tan commented on YARN-10168:
---

[~pbacsko], what you mentioned are all make sense to me.

I think we should only support weight convert to capacity for now (and set max 
to 100). It will have reliable behavior. This is the only blocker issue we need 
to fix. For min/maxResource we should push to another JIRA.

> FS-CS Convert: Converter tool doesn't handle min/max resource conversion 
> correct
> 
>
> Key: YARN-10168
> URL: https://issues.apache.org/jira/browse/YARN-10168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Blocker
>
> Trying to understand logics of convert min and max resource from FS to CS, 
> and found some issues:
> 1)
> In FSQueueConverter#emitMaximumCapacity
> Existing logic in FS is to either specify a maximum percentage for queues 
> against cluster resources. Or, specify an absolute valued maximum resource.
> In the existing FS2CS converter, when a percentage-based maximum resource is 
> specified, the converter takes a global resource from fs2cs CLI, and applies 
> percentages to that. It is not correct since the percentage-based value will 
> get lost, and in the future when cluster resources go up and down, the 
> maximum resource cannot be changed.
> 2)
> The logic to deal with min/weight resource is also questionable:
> The existing fs2cs tool, it takes precedence of percentage over 
> absoluteResource, and could set both to a queue config. See 
> FSQueueConverter.Capacity#toString
> However, in CS, comparing to FS, the weights/min resource is quite different:
> CS use the same queue.capacity to specify both percentage-based or 
> absolute-resource-based configs (Similar to how FS deal with maximum 
> Resource).
>  The capacity defines guaranteed resource, which also impact fairshare of the 
> queue. (The more guaranteed resource a queue has, the larger "pie" the queue 
> can get if there's any additional available resource).
>  In FS, minResource defined the guaranteed resource, and weight defined how 
> much the pie can grow to.
> So to me, in FS, we should pick and choose either weight or minResource to 
> generate CS.
> 3)
> In FS, mix-use of absolute-resource configs (like min/maxResource), and 
> percentage-based (like weight) is allowed. But in CS, it is not allowed. The 
> reason is discussed on YARN-5881, and find [a]Should we support specifying a 
> mix of percentage ...
> The existing fs2cs doesn't handle the issue, which could set mixed absolute 
> resource and percentage-based resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10167) FS-CS Converter: Need validate c-s.xml after converting

2020-02-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046233#comment-17046233
 ] 

Wangda Tan commented on YARN-10167:
---

Kinga: we may not be able to use that since we can not assume there’s a running 
cluster with CS enabled, we have to do this validation from the CLI itself.

> FS-CS Converter: Need validate c-s.xml after converting
> ---
>
> Key: YARN-10167
> URL: https://issues.apache.org/jira/browse/YARN-10167
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Major
>  Labels: fs2cs, newbie
>
> Currently we just generated c-s.xml, but we didn't validate that. To make 
> sure the c-s.xml is correct after conversion, it's better to initialize the 
> CS scheduler using configs.
> Also, in the test, we should try to leverage MockRM to validate generated 
> configs as much as we could.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10170) Should revisit mix-usage of percentage-based and absolute-value-based min/max resource in CapacityScheduler

2020-02-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046076#comment-17046076
 ] 

Wangda Tan commented on YARN-10170:
---

cc: [~sunil.gov...@gmail.com]

> Should revisit mix-usage of percentage-based and absolute-value-based min/max 
> resource in CapacityScheduler
> ---
>
> Key: YARN-10170
> URL: https://issues.apache.org/jira/browse/YARN-10170
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Major
>
> This should be finished after YARN-10169. (If we can get this one easily, we 
> should do this one instead of YARN-10169).
> Absolute resource means mem=x, vcore=y.
> Percentage resource means x%
> We should not allow percentage-based child, but absolute-based parent. (root 
> is considered as percentage-based).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10170) Should revisit mix-usage of percentage-based and absolute-value-based min/max resource in CapacityScheduler

2020-02-26 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10170:
-

 Summary: Should revisit mix-usage of percentage-based and 
absolute-value-based min/max resource in CapacityScheduler
 Key: YARN-10170
 URL: https://issues.apache.org/jira/browse/YARN-10170
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


This should be finished after YARN-10169. (If we can get this one easily, we 
should do this one instead of YARN-10169).

Absolute resource means mem=x, vcore=y.

Percentage resource means x%

We should not allow percentage-based child, but absolute-based parent. (root is 
considered as percentage-based).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-02-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046017#comment-17046017
 ] 

Wangda Tan commented on YARN-10169:
---

cc: [~sunil.gov...@gmail.com]

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Blocker
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-02-26 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10169:
-

 Summary: Mixed absolute resource value and percentage-based 
resource value in CapacityScheduler should fail
 Key: YARN-10169
 URL: https://issues.apache.org/jira/browse/YARN-10169
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


To me this is a bug: if there's a queue has capacity set to float, and 
maximum-capacity set to absolute value. Existing logic allows the behavior.

For example:
{code:java}
queue.capacity = 0.8 
queue.maximum-capacity = [mem=x, vcore=y] {code}
We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10168) FS-CS Convert: Converter tool doesn't handle min/max resource conversion correct

2020-02-26 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10168:
-

 Summary: FS-CS Convert: Converter tool doesn't handle min/max 
resource conversion correct
 Key: YARN-10168
 URL: https://issues.apache.org/jira/browse/YARN-10168
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


Trying to understand logics of convert min and max resource from FS to CS, and 
found some issues:

1)

In FSQueueConverter#emitMaximumCapacity

Existing logic in FS is to either specify a maximum percentage for queues 
against cluster resources. Or, specify an absolute valued maximum resource.

In the existing FS2CS converter, when a percentage-based maximum resource is 
specified, the converter takes a global resource from fs2cs CLI, and applies 
percentages to that. It is not correct since the percentage-based value will 
get lost, and in the future when cluster resources go up and down, the maximum 
resource cannot be changed.

2)

The logic to deal with min/weight resource is also questionable:

The existing fs2cs tool, it takes precedence of percentage over 
absoluteResource, and could set both to a queue config. See 
FSQueueConverter.Capacity#toString

However, in CS, comparing to FS, the weights/min resource is quite different:

CS use the same queue.capacity to specify both percentage-based or 
absolute-resource-based configs (Similar to how FS deal with maximum Resource).
 The capacity defines guaranteed resource, which also impact fairshare of the 
queue. (The more guaranteed resource a queue has, the larger "pie" the queue 
can get if there's any additional available resource).
 In FS, minResource defined the guaranteed resource, and weight defined how 
much the pie can grow to.

So to me, in FS, we should pick and choose either weight or minResource to 
generate CS.

3)

In FS, mix-use of absolute-resource configs (like min/maxResource), and 
percentage-based (like weight) is allowed. But in CS, it is not allowed. The 
reason is discussed on YARN-5881, and find [a]Should we support specifying a 
mix of percentage ...

The existing fs2cs doesn't handle the issue, which could set mixed absolute 
resource and percentage-based resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10167) FS-CS Converter: Need validate c-s.xml after converting

2020-02-26 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10167:
--
Summary: FS-CS Converter: Need validate c-s.xml after converting  (was: 
Need validate c-s.xml after converting)

> FS-CS Converter: Need validate c-s.xml after converting
> ---
>
> Key: YARN-10167
> URL: https://issues.apache.org/jira/browse/YARN-10167
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Major
>  Labels: fs2cs, newbie
>
> Currently we just generated c-s.xml, but we didn't validate that. To make 
> sure the c-s.xml is correct after conversion, it's better to initialize the 
> CS scheduler using configs.
> Also, in the test, we should try to leverage MockRM to validate generated 
> configs as much as we could.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10167) Need validate c-s.xml after converting

2020-02-26 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10167:
-

 Summary: Need validate c-s.xml after converting
 Key: YARN-10167
 URL: https://issues.apache.org/jira/browse/YARN-10167
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


Currently we just generated c-s.xml, but we didn't validate that. To make sure 
the c-s.xml is correct after conversion, it's better to initialize the CS 
scheduler using configs.

Also, in the test, we should try to leverage MockRM to validate generated 
configs as much as we could.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality

2020-02-18 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039495#comment-17039495
 ] 

Wangda Tan commented on YARN-10151:
---

And this should apply to all branches.

> Disable Capacity Scheduler's move app between queue functionality
> -
>
> Key: YARN-10151
> URL: https://issues.apache.org/jira/browse/YARN-10151
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Saw this happened in many clusters: Capacity Scheduler cannot work correctly 
> with the move app between queue features. It will cause weird JMX issue, 
> resource accounting issue, etc. In a lot of causes it will cause RM 
> completely hung and available resource became negative, nothing can be 
> allocated after that. We should turn off CapacityScheduler's move app between 
> queue feature. (see: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}}
>  )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality

2020-02-18 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10151:
-

 Summary: Disable Capacity Scheduler's move app between queue 
functionality
 Key: YARN-10151
 URL: https://issues.apache.org/jira/browse/YARN-10151
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Saw this happened in many clusters: Capacity Scheduler cannot work correctly 
with the move app between queue features. It will cause weird JMX issue, 
resource accounting issue, etc. In a lot of causes it will cause RM completely 
hung and available resource became negative, nothing can be allocated after 
that. We should turn off CapacityScheduler's move app between queue feature. 
(see: 
{{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}}
 )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-22 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021651#comment-17021651
 ] 

Wangda Tan commented on YARN-9879:
--

Thanks [~shuzirra], [~wilfreds] for sharing your thoughts!

1) Regarding change semantics of GetQueueName() to return full qualified queue 
name v.s. use GetQueuePath:

If we decide to go the first route, we need to remove usages of 
AbstractCSQueue.GetQueuePath (which has 128 usages), and add a 
GetShortQueueName in some places. So to me, there are no significant 
differences to just change internal CS usages to use GetQueuePath().

2) No matter which way we decided to go, I think we should make sure that:

API compatibility, this is critical since I assume there're lots of monitoring 
framework, JMX metrics, etc. based on this. If we upgrade an existing CS-based 
cluster, they should expect the same result. Please refer to API compatibility: 
[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html]

Internal usage of GetQueuePath (or GetShortQueueName if we choose proposed 
approach). And externally, we should make sure we can get a queue by short 
name, or long name. I want to make sure we only check short name / long name in 
external call (like submit app to specified queue), and in all other places, we 
use the full queue path to operate. I think introducing a new CSQueueStore 
sounds good, but I recommend to add a separate method to CSQueueStore to check 
both short/long names and make it used by external callers only (And in 
contrast, internal CS method should check only one HashMap instead of two). We 
can review details of CSQueueStore separately.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf, YARN-9879.POC001.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020522#comment-17020522
 ] 

Wangda Tan commented on YARN-9879:
--

[~shuzirra], I think we should not change semantics of GetQueueName of 
AbstractCSQueue to avoid the change of API. (We should keep REST API related to 
queues unchanged otherwise it will be an incompatible change).

Instead of changing GetQueueName, you should check all callers of the 
GetQueueName first. And there's already a GetQueuePath, you can leverage that.

I briefly checked GetQueueName usages, there're 155 of them in production code. 
Most of them are just for logging purposes 
("org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.*" should 
be considered as logging also). It may take a few hours to identify and change 
everything, but manually change GetQueueName to GetQueuePath case-by-case 
sounds like a safer option to me.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf, YARN-9879.POC001.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10049) FIFOOrderingPolicy Improvements

2020-01-17 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018193#comment-17018193
 ] 

Wangda Tan commented on YARN-10049:
---

[~sunilg],

I agree that priority > FIFO, for both fair/fifo policy, we should not override 
priority.

I also put a comment on 
https://issues.apache.org/jira/browse/YARN-10043?focusedCommentId=17017328=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17017328

I also skeptical about considering name as part of app ordering.

 

> FIFOOrderingPolicy Improvements
> ---
>
> Key: YARN-10049
> URL: https://issues.apache.org/jira/browse/YARN-10049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> FIFOPolicy of FS does the following comparisons in addition to app priority 
> comparison:
> 1. Using Start time
> 2. Using Name
> Scope of this jira is to achieve the same comparisons in FIFOOrderingPolicy 
> of CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10085) FS-CS converter: remove mixed ordering policy check

2020-01-16 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017331#comment-17017331
 ] 

Wangda Tan commented on YARN-10085:
---

[~pbacsko], I also posted a comment on YARN-10043, to me it is sufficient to 
convert (drf/fair) from FS to fair (CS), if any of the drf is set in FS, we 
should set global DominanteResourceCalculator in CS, and we can print a warning 
for that. To be honest it is a minor behavior which we don't need warning too.

> FS-CS converter: remove mixed ordering policy check
> ---
>
> Key: YARN-10085
> URL: https://issues.apache.org/jira/browse/YARN-10085
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>
> When YARN-9892 gets committed, this part will become unnecessary:
> {noformat}
> // Validate ordering policy
> if (queueConverter.isDrfPolicyUsedOnQueueLevel()) {
>   if (queueConverter.isFifoOrFairSharePolicyUsed()) {
> throw new ConversionException(
> "DRF ordering policy cannot be used together with fifo/fair");
>   } else {
> capacitySchedulerConfig.set(
> CapacitySchedulerConfiguration.RESOURCE_CALCULATOR_CLASS,
> DominantResourceCalculator.class.getCanonicalName());
>   }
> }
> {noformat}
> We will be able to freely mix fifo/fair/drf, so let's get rid of this strict 
> check and also rewrite {{FSQueueConverter.emitOrderingPolicy()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10043) FairOrderingPolicy Improvements

2020-01-16 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017328#comment-17017328
 ] 

Wangda Tan commented on YARN-10043:
---

Thanks [~maniraj...@gmail.com]  for posting thoughts on this.

In my opinion, the importance of mentioned behaviors is, 3 > 4 > 1; 3 is 
already supported.

#4 is important, we should add it.

#1 to me only impact performance not correctness (app without demand won't 
allocate anything), but comparing one more field could also impact performance. 
So I would say it is minor.

And:

To me #5 is not a necessary behavior, why an app start with "a" will be more 
important than an app start with "z". Since we have compared 3/4, I felt it is 
not worth to add.

#2 is not necessary to me, two reasons: a. it only related to queues, b. for 
queues, CS already compares relative usage. I don't really think add one more 
resource comparison is worth here.

I think YARN-10049 is also the same.

+ [~pbacsko]  since Peter is asking the similar quesitons.

> FairOrderingPolicy Improvements
> ---
>
> Key: YARN-10043
> URL: https://issues.apache.org/jira/browse/YARN-10043
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> FairOrderingPolicy can be improved by using some of the approaches (only 
> relevant) implemented in FairSharePolicy of FS. This improvement has 
> significance in FS to CS migration context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level

2020-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016160#comment-17016160
 ] 

Wangda Tan commented on YARN-9892:
--

Spent a bit time to look at the code, not fully dig into all details, but I 
think there's significant different between FS and CS to handle resource 
calculator:

In CS, resource calculator is initialized and used across all the logic.

In FS, there's a separate SchedulingPolicy, which is a wrapper of resource 
calculator. It covers areas like compute shares, sort apps/queues, compute 
headrooms, etc.

For the patch uploaded in the Jira, it handles a small area only, which is sort 
apps.

I want to hear thoughts from [~wilfreds] that why the drf for different queues 
is a P0 feature, to me DRF is just an natural extension of Fair (which 
considers multiple resource types). I still think we should optimize the logic 
and knobs to the users. Cluster admin has to re-tune lots of queue capacities 
and SLAs after conversion, we cannot maintain all the different behaviors.

And since there's still a global configuration of ResourceCalculator, which is 
used by 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils#getNormalizedResource
 when the request came in to the scheduler. I'm a bit confused that 
SchedulingPolicy could conflict with the global ResourceCalculator.

> Capacity scheduler: support DRF ordering policy on queue level
> --
>
> Key: YARN-9892
> URL: https://issues.apache.org/jira/browse/YARN-9892
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9892-003.patch, YARN-9892.001.patch, 
> YARN-9892.002.patch
>
>
> Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering 
> policy on queue level. Only "fifo" and "fair" are accepted for 
> {{yarn.scheduler.capacity..ordering-policy}}.
> DRF can only be used globally if 
> {{yarn.scheduler.capacity.resource-calculator}} is set to 
> DominantResourceCalculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level

2020-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016151#comment-17016151
 ] 

Wangda Tan commented on YARN-9892:
--

[~pbacsko], 

My concern is adding DRF only in application ordering policy is not meaningful,

All the resource requests of apps are normalized differently by resource 
calculator implementations (default or drf): 
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator#normalize

When a global resource calculator is configured as "default", we will simply 
ignore vcores in all calculation, including queue, nodes, apps, so what does 
DRF mean when in that scenario?

To me, we should map both FairScheduler's fair/drf to fair in 
CapacityScheduler. Because in CapacityScheduler it do app/queue sorting based 
on global resource calculator, which could be fair (mem only, when default 
calculator configured) or drf (all resource types, when DRC configured).

> Capacity scheduler: support DRF ordering policy on queue level
> --
>
> Key: YARN-9892
> URL: https://issues.apache.org/jira/browse/YARN-9892
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9892-003.patch, YARN-9892.001.patch, 
> YARN-9892.002.patch
>
>
> Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering 
> policy on queue level. Only "fifo" and "fair" are accepted for 
> {{yarn.scheduler.capacity..ordering-policy}}.
> DRF can only be used globally if 
> {{yarn.scheduler.capacity.resource-calculator}} is set to 
> DominantResourceCalculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016121#comment-17016121
 ] 

Wangda Tan commented on YARN-9879:
--

[~wilfreds], I agree with,
{quote}The behaviour inside the scheduler must all be based on the full queue 
paths anyway.
{quote}
I also agree that we need to carefully think about queue mapping and queue 
path. I would suggest moving queue mapping related changes to a different Jira 
to avoid putting two big patches together. (If it already considered both 
scenario we can keep it here).

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9892) Capacity scheduler: support DRF ordering policy on queue level

2020-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016112#comment-17016112
 ] 

Wangda Tan commented on YARN-9892:
--

[~pbacsko], [~maniraj...@gmail.com] , thanks for working on this.

Sorry for chiming in late, I have several questions:

1) What is the use case of allowing app's DRF sort only? (While other 
computations are using memory only). I saw it is under FS -> CS conversion 
tool, but is it a feature we really want to support?

2) In CS, IIRC when DRF is disabled, all the resource accounting will only look 
at memory. For example, a queue which set 100 vcore max limit will not be 
respected, asking 100 vcore is the same as asking 0 vore. In that case, I don't 
think look at DRF of apps is meaningful.

cc: [~sunilg] , [~wilfreds]  to add some thoughts here.

> Capacity scheduler: support DRF ordering policy on queue level
> --
>
> Key: YARN-9892
> URL: https://issues.apache.org/jira/browse/YARN-9892
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9892-003.patch, YARN-9892.001.patch, 
> YARN-9892.002.patch
>
>
> Capacity scheduler does not support DRF (Dominant Resource Fairness) ordering 
> policy on queue level. Only "fifo" and "fair" are accepted for 
> {{yarn.scheduler.capacity..ordering-policy}}.
> DRF can only be used globally if 
> {{yarn.scheduler.capacity.resource-calculator}} is set to 
> DominantResourceCalculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015404#comment-17015404
 ] 

Wangda Tan edited comment on YARN-9879 at 1/14/20 10:01 PM:


[~snemeth], most of the explanation looks reasonable to me. Regarding how to 
prevent breaking existing CS queue contract. Instead of adding a flag to each 
queue, I suggest to have a global config in CS about allow duplicated leaf 
queue name or not.

Why I'm opposed to add the flag to each queue?

To me, use a queue name, or queue path is an intuitive choice of a user (not 
admin). If the queue name had duplicates, it should fail and give you the right 
reason.

If everybody think we should not implicitly change the CS behavior to allow 
duplicate-named leaf queues, a top-level CS config should be sufficient (like 
...duplicated-queue-names.allowed), and clearly document it may 
cause existing app failures. This won't add any burden for user to understand, 
and it is also relatively easy for admin to understand. Anything config added 
to the queue hierarchy seems a bit tricky. (Like admin has to think about how 
is the queue override looks like, etc.). And for the auto-created queue case it 
is not obvious to add such configs, etc.My big lesson learned is we should add 
as less knobs as we could, too many knobs will increase our support areas a lot 
and make code hard to be maintained.


was (Author: leftnoteasy):
[~snemeth], most of the explanation looks reasonable to me. Regarding how to 
prevent breaking existing CS queue contract. Instead of adding a flag to each 
queue, I suggest to have a global config in CS about allow duplicated leaf 
queue name or not.

Why I'm opposed to add the flag to each queue?

To me, use a queue name, or queue path is an intuitive choice of a user (not 
admin). If the queue name had duplicates, it should fail and give you the right 
reason.

If everybody think we should not implicitly change the CS behavior to allow 
duplicate-named leaf queues, a top-level CS config should be sufficient (like 
...duplicated-queue-names.allowed), and clearly document it may 
cause existing app failures. This won't add any burden for user to understand, 
and it is also relatively easy for admin to understand. Anything config added 
to the queue hierarchy seems a bit tricky.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015404#comment-17015404
 ] 

Wangda Tan commented on YARN-9879:
--

[~snemeth], most of the explanation looks reasonable to me. Regarding how to 
prevent breaking existing CS queue contract. Instead of adding a flag to each 
queue, I suggest to have a global config in CS about allow duplicated leaf 
queue name or not.

Why I'm opposed to add the flag to each queue?

To me, use a queue name, or queue path is an intuitive choice of a user (not 
admin). If the queue name had duplicates, it should fail and give you the right 
reason.

If everybody think we should not implicitly change the CS behavior to allow 
duplicate-named leaf queues, a top-level CS config should be sufficient (like 
...duplicated-queue-names.allowed), and clearly document it may 
cause existing app failures. This won't add any burden for user to understand, 
and it is also relatively easy for admin to understand. Anything config added 
to the queue hierarchy seems a bit tricky.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-09 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012282#comment-17012282
 ] 

Wangda Tan commented on YARN-9879:
--

Thanks [~shuzirra], I think adding a flag (suggestion from [~adam.antal]) will 
prevent admin to change it accidentally, but it is hard to be understand 
(thinking about a regular Hadoop user). And we need to maintain it in a long 
run.

So instead, I would like to allow user to make changes but fail the application 
submission with a clear message (like you cannot submit the application because 
there're multiple queue with the name XYZ, you can make change to use the full 
qualified queue name or remove/rename duplicated queues, etc.). If admin want 
to regret and make changes back, they can easily do that.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-07 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009975#comment-17009975
 ] 

Wangda Tan commented on YARN-9879:
--

[~pbacsko], thanks for working on the design. 

In general, I agree with what [~wilfreds] mentioned: we should try to avoid 
change RPC protocols, instead we just change internal logic to make sure 
multiple queues can be handled.

To me there're two major parts:

1) Whatever logic inside CS to allow multiple queue names. Either solution 
mentioned in the comment: 
https://issues.apache.org/jira/browse/YARN-9879?focusedCommentId=17009845=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17009845
 should be fine. And I expect the lookup of queue name (not queue path) should 
only be called when submit application.

And once application is submitted to CS, internal to CS, we should make sure we 
use queue path instead of queue name at all other places. Otherwise we will 
complicate other logics.

2) When submit app, the scheduler going to accept/reject app based on the 
uniqueness of queue name or path specified. The core part need to be changed is 
inside RMAppManager:
{code:java}
 if (!isRecovery && YarnConfiguration.isAclEnabled(conf)) {
  if (scheduler instanceof CapacityScheduler) {
String queueName = submissionContext.getQueue();
String appName = submissionContext.getApplicationName();
CSQueue csqueue = ((CapacityScheduler) scheduler).getQueue(queueName);{code}
Instead of using scheduler.getQueue, we may need to consider to add a method 
like getAppSubmissionQueue() to get a queue based on path or name, and after 
that, we will put normalized queue_path back to submission context of 
application to make sure in the future inside scheduler we all refer to queue 
path.

For the comment from [~wilfreds]: 
{quote}The important part is applying a new configuration. If the configuration 
adds a leaf queue that is not unique the configuration update currently is 
rejected. With this change we would allow that config to become active. This 
*could* break existing applications when they try to submit to the leaf queue 
that is no longer unique.
{quote}
I personally think it is not a big deal if application reject reasons from RM 
can clearly guide users to use full qualified queue path when duplicated queue 
names exists. It is like if a team has only one Peter we can use the first name 
only otherwise we will add last name to avoid confusion. It isn't 
counter-intuitive to me.

Also, we need to handle queue mapping for queue-path instead of queue name 
also, I didn't see it from the design doc or I missed it.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10009) In Capacity Scheduler, DRC can treat minimum user limit percent as a max when custom resource is defined

2019-12-06 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990169#comment-16990169
 ] 

Wangda Tan commented on YARN-10009:
---

[~epayne], is the failure related?

Thanks

> In Capacity Scheduler, DRC can treat minimum user limit percent as a max when 
> custom resource is defined
> 
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.10.0, 3.3.0, 3.2.1, 3.1.3, 2.11.0
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Attachments: YARN-10009.001.patch, YARN-10009.002.patch, 
> YARN-10009.003.patch, YARN-10009.UT.patch
>
>
> | |Memory|Vcores|res_1|
> |Queue1 Totals|20GB|100|80|
> |Resources requested by App1 in Queue1|8GB (40% of total)|8 (8% of total)|80 
> (100% of total)|
> In the previous use case:
>  - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
>  - User1 has requested 8 containers with {{}} 
> each
>  - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 2 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10009) In Capacity Scheduler, DRC can treat minimum user limit percent as a max when custom resource is defined

2019-12-04 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988283#comment-16988283
 ] 

Wangda Tan commented on YARN-10009:
---

+1 from my side, except one comment:
{quote}[^YARN-10009.001.patch]+ // allocate 5 containers for app1 with 1GB 
memory, 1 vcore, 5 res_1s
{quote}
The above comment is not right in the test case.

[~sunilg], do you want to take a look?

> In Capacity Scheduler, DRC can treat minimum user limit percent as a max when 
> custom resource is defined
> 
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.10.0, 3.3.0, 3.2.1, 3.1.3, 2.11.0
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Attachments: YARN-10009.001.patch, YARN-10009.UT.patch
>
>
> | |Memory|Vcores|res_1|
> |Queue1 Totals|20GB|100|80|
> |Resources requested by App1 in Queue1|8GB (40% of total)|8 (8% of total)|80 
> (100% of total)|
> In the previous use case:
>  - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
>  - User1 has requested 8 containers with {{}} 
> each
>  - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 2 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10009) In Capacity Scheduler, DRC can treat minimum user limit percent as a max when custom resource is defined

2019-12-04 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10009:
--
Priority: Critical  (was: Major)

> In Capacity Scheduler, DRC can treat minimum user limit percent as a max when 
> custom resource is defined
> 
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 2.10.0, 3.3.0, 3.2.1, 3.1.3, 2.11.0
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Attachments: YARN-10009.001.patch, YARN-10009.UT.patch
>
>
> | |Memory|Vcores|res_1|
> |Queue1 Totals|20GB|100|80|
> |Resources requested by App1 in Queue1|8GB (40% of total)|8 (8% of total)|80 
> (100% of total)|
> In the previous use case:
>  - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
>  - User1 has requested 8 containers with {{}} 
> each
>  - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 2 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978955#comment-16978955
 ] 

Wangda Tan commented on YARN-8373:
--

Thanks [~wilfreds]  for the patch and everybody for the review!

Patch looks good to me!

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Fix For: 3.3.0, 3.2.2
>
> Attachments: YARN-8373-branch-3.1.001.patch, 
> YARN-8373-branch.3.1.001.patch, YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch, YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9927) RM multi-thread event processing mechanism

2019-10-22 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957378#comment-16957378
 ] 

Wangda Tan commented on YARN-9927:
--

Thanks [~hcarrot] for working on this.

Tagging: [~prabhujoseph] , [~jhung] ,[~sunil.gov...@gmail.com] , [~epayne] for 
review.

> RM multi-thread event processing mechanism
> --
>
> Key: YARN-9927
> URL: https://issues.apache.org/jira/browse/YARN-9927
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.0.0, 2.9.2
>Reporter: hcarrot
>Priority: Major
> Attachments: RM multi-thread event processing mechanism.pdf
>
>
> Recently, we have observed serious event blocking in RM event dispatcher 
> queue. After analysis of RM event monitoring data and RM event processing 
> logic, we found that
> 1) environment: a cluster with thousands of nodes
> 2) RMNodeStatusEvent dominates 90% time consumption of RM event scheduler
> 3) Meanwhile, RM event processing is in a single-thread mode, and It results 
> in the low headroom of RM event scheduler, thus performance of RM.
> So we proposed a RM multi-thread event processing mechanism to improve RM 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9887) Capacity scheduler: add support for limiting maxRunningApps per user

2019-10-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956297#comment-16956297
 ] 

Wangda Tan commented on YARN-9887:
--

[~pbacsko], [~epayne], IIRC the max app per user in FS is across queues, but CS 
workaround mentioned by Eric is per queue. 

For the convertion tool, I think it might be good enough to document and move 
on, adding another app-limit per user globally sounds like creating more issues 
for troubleshooting.

> Capacity scheduler: add support for limiting maxRunningApps per user
> 
>
> Key: YARN-9887
> URL: https://issues.apache.org/jira/browse/YARN-9887
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Priority: Major
>
> Fair Scheduler supports limiting the number of applications that a particular 
> user can submit:
> {noformat}
> 
>   10
> 
> {noformat}
> Capacity Scheduler does not have an exact equivalent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9886) Queue mapping based on userid passed through application tag

2019-10-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956284#comment-16956284
 ] 

Wangda Tan commented on YARN-9886:
--

[~kmarton], can we also make sure apps from privileged users can do such 
operation? Maybe we can add a config to the mapping policy to make sure only 
users like "hive" can do such operation. BTW this is also a requirement from 
Hive when doAs set to false. 

cc: [~ashutoshc], [~thejas]

> Queue mapping based on userid passed through application tag
> 
>
> Key: YARN-9886
> URL: https://issues.apache.org/jira/browse/YARN-9886
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
> Attachments: YARN-9886-WIP.patch
>
>
> There are situations when the real submitting user differs from the user what 
> arrives to YARN. For example in case of a Hive application when Hive 
> impersonation is turned off, the hive queries will run as Hive user and the 
> mapping is done based on this username. Unfortunately in this case YARN 
> doesn't have any information about the real user and there are cases when the 
> customer may want to map this applications to the real submitting user's 
> queue instead of the Hive one.
> For this cases if they would pass the username in the application tag we may 
> read it and use that one during the queue mapping, if that user has rights to 
> run on the real user's queue.  
> [~sunilg] please correct me if I missed something.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-9889) [UI] Add Application Tag column to RM All Applications table

2019-10-15 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-9889:
-
Comment: was deleted

(was: Thanks [~kmarton] , thanks for working on this. To me there's no strong 
connections between the 3 tasks, can we move this Jira to a separate ticket and 
make the 2 children JIRAs to separate tickets also?)

> [UI] Add Application Tag column to RM All Applications table
> 
>
> Key: YARN-9889
> URL: https://issues.apache.org/jira/browse/YARN-9889
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
> Attachments: AllApplications_with_ApplicationTag.png, 
> YARN-9889.001.patch
>
>
> Right now AFAIK there is no possibility to filter the applications based on 
> the application tag in the UI. Adding this new column to the app table will 
> make this filtering possible as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9886) Queue mapping based on userid passed through application tag

2019-10-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952267#comment-16952267
 ] 

Wangda Tan commented on YARN-9886:
--

[~kmarton] , thanks for working on this. To me there's no strong connections 
between the 3 tasks, can we move this Jira to a separate ticket and make the 2 
children JIRAs to separate tickets also?

> Queue mapping based on userid passed through application tag
> 
>
> Key: YARN-9886
> URL: https://issues.apache.org/jira/browse/YARN-9886
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
> Attachments: YARN-9886-WIP.patch
>
>
> There are situations when the real submitting user differs from the user what 
> arrives to YARN. For example in case of a Hive application when Hive 
> impersonation is turned off, the hive queries will run as Hive user and the 
> mapping is done based on this username. Unfortunately in this case YARN 
> doesn't have any information about the real user and there are cases when the 
> customer may want to map this applications to the real submitting user's 
> queue instead of the Hive one.
> For this cases if they would pass the username in the application tag we may 
> read it and use that one during the queue mapping, if that user has rights to 
> run on the real user's queue.  
> [~sunilg] please correct me if I missed something.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9889) [UI] Add Application Tag column to RM All Applications table

2019-10-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952265#comment-16952265
 ] 

Wangda Tan commented on YARN-9889:
--

Thanks [~kmarton] , thanks for working on this. To me there's no strong 
connections between the 3 tasks, can we move this Jira to a separate ticket and 
make the 2 children JIRAs to separate tickets also?

> [UI] Add Application Tag column to RM All Applications table
> 
>
> Key: YARN-9889
> URL: https://issues.apache.org/jira/browse/YARN-9889
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Kinga Marton
>Assignee: Kinga Marton
>Priority: Major
> Attachments: AllApplications_with_ApplicationTag.png, 
> YARN-9889.001.patch
>
>
> Right now AFAIK there is no possibility to filter the applications based on 
> the application tag in the UI. Adding this new column to the app table will 
> make this filtering possible as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9656) Plugin to avoid scheduling jobs on node which are not in "schedulable" state, but are healthy otherwise.

2019-10-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951399#comment-16951399
 ] 

Wangda Tan commented on YARN-9656:
--

[~pgolash], [~mayank_bansal], to me if a node cannot schedule new tasks because 
of either near-full disk or stressed, it is under the same "unhealthy" state.  

Is there any diagnostic we can use to put a reasonable why the node is 
unhealthy? If we can add a "unhealthy reason/type" to node info, is that good 
enough to solve the problem? Putting this to a file and load by RM seems just a 
way to by-pass RPC between RM/NM but the leave a lot of works to the plugin to 
implement logics like collect NM metrics, putting them to a file and place it 
to a filesystem which is accessible by RM. 

If we choose to leave the plugin in NM, anybody can implement new logic to 
categorize issues on NM and admin can query it from the web UI, etc. 

Thoughts?

> Plugin to avoid scheduling jobs on node which are not in "schedulable" state, 
> but are healthy otherwise.
> 
>
> Key: YARN-9656
> URL: https://issues.apache.org/jira/browse/YARN-9656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.9.1, 3.1.2
>Reporter: Prashant Golash
>Assignee: Prashant Golash
>Priority: Major
> Attachments: 2.patch
>
>
> Creating this Jira to get idea from the community if this is something 
> helpful which can be done in YARN. Some times the nodes go in a bad state for 
> e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if 
> CGroup is not enabled, nodes may be running very high on CPU and the jobs 
> scheduled on them will suffer.
>  
> The idea is three-fold:
>  # Gather relevant metrics from node-managers and put in some form (for e.g. 
> exclude file).
>  # RM loads the files and put the nodes as part of the blacklist.
>  # Once the node becomes good, they can again be put in the whitelist.
> Various optimizations can be done here, but I would like to understand if 
> this is something which could be helpful as an upstream feature in YARN.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Deleted] (YARN-9878) the

2019-10-08 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan deleted YARN-9878:
-


> the
> ---
>
> Key: YARN-9878
> URL: https://issues.apache.org/jira/browse/YARN-9878
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: jenny
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9656) Plugin to avoid scheduling jobs on node which are not in "schedulable" state, but are healthy otherwise.

2019-09-29 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940508#comment-16940508
 ] 

Wangda Tan commented on YARN-9656:
--

[~pgolash], how about just define these nodes in unhealthy state? The script to 
check if a node is healthy or not can be defined by admin. If in production, a 
node has too high CPU utilization which may cause job issues, to me it is 
reasonable to define the node is unhealthy.

> Plugin to avoid scheduling jobs on node which are not in "schedulable" state, 
> but are healthy otherwise.
> 
>
> Key: YARN-9656
> URL: https://issues.apache.org/jira/browse/YARN-9656
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.9.1, 3.1.2
>Reporter: Prashant Golash
>Assignee: Prashant Golash
>Priority: Major
> Attachments: 2.patch
>
>
> Creating this Jira to get idea from the community if this is something 
> helpful which can be done in YARN. Some times the nodes go in a bad state for 
> e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if 
> CGroup is not enabled, nodes may be running very high on CPU and the jobs 
> scheduled on them will suffer.
>  
> The idea is three-fold:
>  # Gather relevant metrics from node-managers and put in some form (for e.g. 
> exclude file).
>  # RM loads the files and put the nodes as part of the blacklist.
>  # Once the node becomes good, they can again be put in the whitelist.
> Various optimizations can be done here, but I would like to understand if 
> this is something which could be helpful as an upstream feature in YARN.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-09-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934797#comment-16934797
 ] 

Wangda Tan commented on YARN-4946:
--

I would still prefer to revert the patch. But due to my bandwidth, I hope to 
get someone to help review details of the reverting patch and related fields 
before making a decision. 

cc: [~snemeth] , [~sunil.gov...@gmail.com] , [~Prabhu Joseph]

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9813) RM does not start on JDK11 when UIv2 is enabled

2019-09-06 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924600#comment-16924600
 ] 

Wangda Tan commented on YARN-9813:
--

Thanks [~eyang] for updating the patch. 

+1, pending Jenkins.

> RM does not start on JDK11 when UIv2 is enabled
> ---
>
> Key: YARN-9813
> URL: https://issues.apache.org/jira/browse/YARN-9813
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Critical
> Attachments: YARN-9813.001.patch, YARN-9813.002.patch, 
> YARN-9813.003.patch
>
>
> Starting a ResourceManager on JDK11 with UIv2 is enabled, RM startup fails 
> with the following message:
> {noformat}
> Error starting ResourceManager
> java.lang.ClassCastException: class 
> jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class 
> java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and 
> java.net.URLClassLoader are in module java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1190)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1333)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1531)
> {noformat}
> It is a known issue that the systemClassLoader is not URLClassLoader anymore 
> from JDK9 (see related UT failure: YARN-9512). 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2019-09-05 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923763#comment-16923763
 ] 

Wangda Tan commented on YARN-9698:
--

Thanks [~shuzirra] , [~Prabhu Joseph] ,[~snemeth] , [~wilfreds] , [~sunilg]  
for sorting this out! Looks great! 

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
> Attachments: FS-CS Migration.pdf
>
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-04 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922915#comment-16922915
 ] 

Wangda Tan commented on YARN-9795:
--

[~fengnanli], thanks for working on the Jira. I just added you to contributor 
list so you can assign YARN JIRAs to yourself in the future. It looks like an 
important improvement.

[~Tao Yang] , [~tangzhankun] can you help to review the patch? Thanks

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9795) ClusterMetrics to include AM allocation delay

2019-09-04 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-9795:


Assignee: Fengnan Li

> ClusterMetrics to include AM allocation delay
> -
>
> Key: YARN-9795
> URL: https://issues.apache.org/jira/browse/YARN-9795
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Attachments: YARN-9795.001.patch
>
>
> Add AM container allocation in QueueMetrics to help diagnose performance 
> issue. This is following 
> [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-09-02 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921081#comment-16921081
 ] 

Wangda Tan commented on YARN-9785:
--

+1 to the latest patch.

[~sunilg] do you want to take another look?

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919264#comment-16919264
 ] 

Wangda Tan commented on YARN-9785:
--

Can we add tests to make sure no regression after this patch? And apart from 
that, I think we can get isInvalidDivisor entirely, maybe we can do it in a 
separate patch as this Jira blocks two releases.

Thanks,

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-08-29 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918357#comment-16918357
 ] 

Wangda Tan commented on YARN-9785:
--

Thanks [~BilwaST] for the patch and everybody for discussing. 

I'm doubt if this is a right fix or not:
{code:java}
public static boolean lessThanOrEqual(
ResourceCalculator resourceCalculator, 
Resource clusterResource,
Resource lhs, Resource rhs) {
  return resourceCalculator.fitsIn(lhs, rhs)
  && resourceCalculator.compare(clusterResource, lhs, rhs) <= 0;
} {code}
If lhs fits in rhs, the check after && is not need correct? And when we check 
less than or equal, we check dominante value of the two resources, I don't 
understand why we check fitsIn here.

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9770) Create a queue ordering policy which picks child queues with equal probability

2019-08-28 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917528#comment-16917528
 ] 

Wangda Tan commented on YARN-9770:
--

[~jhung] , I understand the use case, however I think this will break 
preemption. When there're two queues A and B. A uses more than guaranteed and 
have pending resource, B uses less than guaranteed and has pending resource.

Before this patch, any resources preempted from A can be guarateed to consumed 
by B. However, after this patch, it is possible that A get preference 
allocation order and get the preempted resource again.

> Create a queue ordering policy which picks child queues with equal probability
> --
>
> Key: YARN-9770
> URL: https://issues.apache.org/jira/browse/YARN-9770
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9770.001.patch, YARN-9770.002.patch, 
> YARN-9770.003.patch
>
>
> Ran some simulations with the default queue_utilization_ordering_policy:
> An underutilized queue which receives an application with many (thousands) 
> resource requests will hog scheduler allocations for a long time (on the 
> order of a minute). In the meantime apps are getting submitted to all other 
> queues, which increases activeUsers in these queues, which drops user limit 
> in these queues to small values if minimum-user-limit-percent is configured 
> to small values (e.g. 10%).
> To avoid this issue, we assign to queues with equal probability, to avoid 
> scenarios where queues don't get allocations for a long time.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue

2019-08-26 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8657:
-
Priority: Major  (was: Critical)

> User limit calculation should be read-lock-protected within LeafQueue
> -
>
> Key: YARN-8657
> URL: https://issues.apache.org/jira/browse/YARN-8657
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8657.001.patch, YARN-8657.002.patch
>
>
> When async scheduling is enabled, user limit calculation could be wrong: 
> It is possible that scheduler calculated a user_limit, but inside 
> {{canAssignToUser}} it becomes staled. 
> We need to protect user limit calculation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue

2019-08-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915509#comment-16915509
 ] 

Wangda Tan commented on YARN-8657:
--

I'd prefer to move it to next releases and downgrade the priority. This only 
causes some trouble in the allocation phase, and it will be double-checked by 
{{accept}} in writeLock.

> User limit calculation should be read-lock-protected within LeafQueue
> -
>
> Key: YARN-8657
> URL: https://issues.apache.org/jira/browse/YARN-8657
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8657.001.patch, YARN-8657.002.patch
>
>
> When async scheduling is enabled, user limit calculation could be wrong: 
> It is possible that scheduler calculated a user_limit, but inside 
> {{canAssignToUser}} it becomes staled. 
> We need to protect user limit calculation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue

2019-08-26 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8657:
-
Target Version/s: 3.2.2, 3.1.4  (was: 3.2.1, 3.1.3)

> User limit calculation should be read-lock-protected within LeafQueue
> -
>
> Key: YARN-8657
> URL: https://issues.apache.org/jira/browse/YARN-8657
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8657.001.patch, YARN-8657.002.patch
>
>
> When async scheduling is enabled, user limit calculation could be wrong: 
> It is possible that scheduler calculated a user_limit, but inside 
> {{canAssignToUser}} it becomes staled. 
> We need to protect user limit calculation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9751) Separate queue and app ordering policy capacity scheduler configs

2019-08-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911891#comment-16911891
 ] 

Wangda Tan commented on YARN-9751:
--

Thanks [~jhung] for the patch. 

Are there any changes of behavior after this patch? Previous behavior is, admin 
configures one config which works for both queue and app. 

The expected new behavior should be: 
- If admin configures ordering-policy only, it should apply to both queue and 
app.
- If admin configures ordering-policy and app-ordering-policy, queue and app 
policies could be different. 

> Separate queue and app ordering policy capacity scheduler configs
> -
>
> Key: YARN-9751
> URL: https://issues.apache.org/jira/browse/YARN-9751
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Attachments: YARN-9751.001.patch
>
>
> Right now it's not possible to specify distinct app and queue ordering 
> policies since they share the same {{ordering-policy}} suffix.
> There's already a TODO in CapacitySchedulerConfiguration for this. This Jira 
> intends to fix it.
> {noformat}
> // TODO (wangda): We need to better distinguish app ordering policy and queue
> // ordering policy's classname / configuration options, etc. And dedup code
> // if possible.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2019-08-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-9698:
-
Target Version/s: 3.3.0

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

2019-02-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765169#comment-16765169
 ] 

Wangda Tan commented on YARN-9195:
--

Thanks [~ssy],

[~sunilg], [~cheersyang] if you have bandwidth, could u help to check the fix?

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -
>
> Key: YARN-9195
> URL: https://issues.apache.org/jira/browse/YARN-9195
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.1.0
>Reporter: Shengyang Sha
>Assignee: Shengyang Sha
>Priority: Critical
> Attachments: YARN-9195.001.patch, YARN-9195.002.patch, 
> cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8761) Service AM support for decommissioning component instances

2019-02-08 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763837#comment-16763837
 ] 

Wangda Tan commented on YARN-8761:
--

+1 to back port to branch-3.1, branch-3.2, thanks [~billie.rinaldi], [~eyang]. 

> Service AM support for decommissioning component instances
> --
>
> Key: YARN-8761
> URL: https://issues.apache.org/jira/browse/YARN-8761
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8761-branch-3.1.01.patch, YARN-8761.01.patch, 
> YARN-8761.02.patch, YARN-8761.03.patch, YARN-8761.04.patch, YARN-8761.05.patch
>
>
> The idea behind this feature is to have a flex down where specific component 
> instances are removed. Currently on a flex down, the service AM chooses for 
> removal the component instances with the highest IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9209) When nodePartition is not set in Placement Constraints, containers are allocated only in default partition

2019-01-28 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754249#comment-16754249
 ] 

Wangda Tan commented on YARN-9209:
--

[~tarunparimi], [~cheersyang],  

Actually, this is by design. There're two reasons which we don't want to 
support ANY node partition. 

1) The purpose of node partition is to separate cluster to multiple "virtual" 
clusters which have independent queue capacities, ACLs, usages, etc. We want 
user explicitly ask for one partition instead of "give me arbitrary partition". 
And the typical reason to have multiple partitions is to isolate important 
resources, such as GPU resources. We definitely don't want a sleep MR test job 
use resources on GPU partition unless it is asked explicitly.

2) A more technical challenge is, once multiple partitions are asked, how to 
calculate pending resource gonna be a problem. We don't want to double counting 
pending resources. (because there's only one ask) But it also seems not correct 
to only increase default partition. A miscalculated pending resource will cause 
trouble when doing preemption.

We originally wanted to support ANY partition when we did original node 
partition support (YARN-796), we gave up because of the two reasons. Please let 
me know if there're any solutions to get #1 and #2 resolved.

> When nodePartition is not set in Placement Constraints, containers are 
> allocated only in default partition
> --
>
> Key: YARN-9209
> URL: https://issues.apache.org/jira/browse/YARN-9209
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9209.001.patch
>
>
> When application sets a placement constraint without specifying a 
> nodePartition, the default partition is always chosen as the constraint when 
> allocating containers. This can be a problem. when an application is 
> submitted to a queue which has doesn't have enough capacity available on the 
> default partition.
>  This is a common scenario when node labels are configured for a particular 
> queue. The below sample sleeper service cannot get even a single container 
> allocated when it is submitted to a "labeled_queue", even though enough 
> capacity is available on the label/partition configured for the queue. Only 
> the AM container runs. 
> {code:java}{
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ]
> }
> ]
> }
> }
> ]
> }
> {code}
> It runs fine if I specify the node_partition explicitly in the constraints 
> like below. 
> {code:java}
> {
> "name": "sleeper-service",
> "version": "1.0.0",
> "queue": "labeled_queue",
> "components": [
> {
> "name": "sleeper",
> "number_of_containers": 2,
> "launch_command": "sleep 9",
> "resource": {
> "cpus": 1,
> "memory": "4096"
> },
> "placement_policy": {
> "constraints": [
> {
> "type": "ANTI_AFFINITY",
> "scope": "NODE",
> "target_tags": [
> "sleeper"
> ],
> "node_partitions": [
> "label"
> ]
> }
> ]
> }
> }
> ]
> }
> {code} 
> The problem seems to be because only the default partition "" is considered 
> when node_partition constraint is not specified as seen in below RM log. 
> {code:java}
> 2019-01-17 16:51:59,921 INFO placement.SingleConstraintAppPlacementAllocator 
> (SingleConstraintAppPlacementAllocator.java:validateAndSetSchedulingRequest(367))
>  - Successfully added SchedulingRequest to 
> app=appattempt_1547734161165_0010_01 targetAllocationTags=[sleeper]. 
> nodePartition= 
> {code} 
> However, I think it makes more sense to consider "*" or the 
> {{default-node-label-expression}} of the 

[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

2019-01-24 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751887#comment-16751887
 ] 

Wangda Tan commented on YARN-9195:
--

Thanks [~ssy],  

Could u rename the patch to YARN-9175.001.patch? (According to 
[https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute]).  

And once you upload the patch, you can change the Jira to "Patch Available" so 
Jenkins will run UT. 

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -
>
> Key: YARN-9195
> URL: https://issues.apache.org/jira/browse/YARN-9195
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.1.0
>Reporter: Shengyang Sha
>Priority: Critical
> Attachments: 
> cases_to_recreate_negative_pending_requests_scenario.diff, 
> patch.YARN-9195.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

2019-01-24 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751889#comment-16751889
 ] 

Wangda Tan commented on YARN-9195:
--

[~ssy] add you to contributor list so you can assign Jira to yourself in the 
future.

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -
>
> Key: YARN-9195
> URL: https://issues.apache.org/jira/browse/YARN-9195
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.1.0
>Reporter: Shengyang Sha
>Assignee: Shengyang Sha
>Priority: Critical
> Attachments: 
> cases_to_recreate_negative_pending_requests_scenario.diff, 
> patch.YARN-9195.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

2019-01-24 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-9195:


Assignee: Shengyang Sha

> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -
>
> Key: YARN-9195
> URL: https://issues.apache.org/jira/browse/YARN-9195
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.1.0
>Reporter: Shengyang Sha
>Assignee: Shengyang Sha
>Priority: Critical
> Attachments: 
> cases_to_recreate_negative_pending_requests_scenario.diff, 
> patch.YARN-9195.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9204) RM fails to start if absolute resource is specified for partition capacity in CS queues

2019-01-21 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748110#comment-16748110
 ] 

Wangda Tan commented on YARN-9204:
--

Cherry-picked to branch-3.1.2 as well, thanks [~yangjiandan]/ [~cheersyang].

>  RM fails to start if absolute resource is specified for partition capacity 
> in CS queues
> 
>
> Key: YARN-9204
> URL: https://issues.apache.org/jira/browse/YARN-9204
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Blocker
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-9204.001.patch, YARN-9204.002.patch, 
> YARN-9204.003.patch, YARN-9204.004.patch, YARN-9204.005.patch, 
> YARN-9204.006.patch
>
>
> When I set *yarn.scheduler.capacity..capacity* and 
> *yarn.scheduler.capacity..accessible-node-labels..capacity*
>   to absolute resource value, staring RM fails, and throw following 
> exception, and after diving into relate code, I found the logic of checking  
> absolute resource value maybe wrong.
> {code:java}
> 2019-01-17 20:25:45,716 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
> ResourceManager
> java.lang.NumberFormatException: For input string: "[memory=40960,vcore=48]"
> at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
> at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
> at java.lang.Float.parseFloat(Float.java:451)
> at 
> org.apache.hadoop.conf.Configuration.getFloat(Configuration.java:1606)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.internalGetLabeledQueue
> Capacity(CapacitySchedulerConfiguration.java:655)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getLabeledQueueCapacity
> (CapacitySchedulerConfiguration.java:670)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadCapacitiesByLabelsFromConf(CSQueueUti
> ls.java:135)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadUpdateAndCheckCapacities(CSQueueUtils
> .java:110)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupConfigurableCapacities(AbstractCS
> Queue.java:179)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java
> :356)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java
> :323)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setupQueueConfigs(ParentQueue.java:130)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.(ParentQueue.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySched
> ulerQueueManager.java:275)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(Capacit
> ySchedulerQueueManager.java:158)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.j
> ava:715)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java
> :360)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:4
> 25)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:817)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1218)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:317)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1500)
> 2019-01-17 20:25:45,719 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:
> {code}



--
This message was sent by 

[jira] [Updated] (YARN-8747) [UI2] YARN UI2 page loading failed due to js error under some time zone configuration

2019-01-21 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8747:
-
Fix Version/s: (was: 3.1.3)
   3.1.2

> [UI2] YARN UI2 page loading failed due to js error under some time zone 
> configuration
> -
>
> Key: YARN-8747
> URL: https://issues.apache.org/jira/browse/YARN-8747
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 3.1.1
>Reporter: collinma
>Assignee: collinma
>Priority: Critical
> Fix For: 2.10.0, 3.0.4, 3.1.2, 3.3.0, 3.2.1, 2.9.3
>
> Attachments: YARN-8747.001.patch, image-2018-09-05-18-54-03-991.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We deployed hadoop 3.1.1 on centos 7.2 servers whose timezone is configured 
> as GMT+8,  the web browser time zone is GMT+8 too. yarn ui page loaded failed 
> due to js error:
>  
> !image-2018-09-05-18-54-03-991.png!
> The moment-timezone js component raised that error. This has been fixed in 
> moment-timezone 
> v0.5.1([see|[https://github.com/moment/moment-timezone/issues/294]).] We need 
> to update moment-timezone version accordingly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9204) RM fails to start if absolute resource is specified for partition capacity in CS queues

2019-01-21 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-9204:
-
Fix Version/s: (was: 3.1.3)
   3.1.2

>  RM fails to start if absolute resource is specified for partition capacity 
> in CS queues
> 
>
> Key: YARN-9204
> URL: https://issues.apache.org/jira/browse/YARN-9204
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Blocker
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-9204.001.patch, YARN-9204.002.patch, 
> YARN-9204.003.patch, YARN-9204.004.patch, YARN-9204.005.patch, 
> YARN-9204.006.patch
>
>
> When I set *yarn.scheduler.capacity..capacity* and 
> *yarn.scheduler.capacity..accessible-node-labels..capacity*
>   to absolute resource value, staring RM fails, and throw following 
> exception, and after diving into relate code, I found the logic of checking  
> absolute resource value maybe wrong.
> {code:java}
> 2019-01-17 20:25:45,716 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
> ResourceManager
> java.lang.NumberFormatException: For input string: "[memory=40960,vcore=48]"
> at 
> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
> at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
> at java.lang.Float.parseFloat(Float.java:451)
> at 
> org.apache.hadoop.conf.Configuration.getFloat(Configuration.java:1606)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.internalGetLabeledQueue
> Capacity(CapacitySchedulerConfiguration.java:655)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getLabeledQueueCapacity
> (CapacitySchedulerConfiguration.java:670)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadCapacitiesByLabelsFromConf(CSQueueUti
> ls.java:135)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadUpdateAndCheckCapacities(CSQueueUtils
> .java:110)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupConfigurableCapacities(AbstractCS
> Queue.java:179)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java
> :356)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java
> :323)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setupQueueConfigs(ParentQueue.java:130)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.(ParentQueue.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySched
> ulerQueueManager.java:275)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(Capacit
> ySchedulerQueueManager.java:158)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.j
> ava:715)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java
> :360)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:4
> 25)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:817)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1218)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:317)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1500)
> 2019-01-17 20:25:45,719 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM

2019-01-21 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748109#comment-16748109
 ] 

Wangda Tan commented on YARN-9194:
--

Cherry-picked to branch-3.1.2 as well.

> Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and 
> NullPointerException happens in RM while shutdown a NM
> -
>
> Key: YARN-9194
> URL: https://issues.apache.org/jira/browse/YARN-9194
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, 
> YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, 
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the 
> InvalidStateTransitionException happens.
>  
> {code:java}
> 2019-01-13 00:41:57,127 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547311267249_0001_02 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> REGISTERED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM

2019-01-21 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-9194:
-
Fix Version/s: (was: 3.1.3)
   3.1.2

> Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and 
> NullPointerException happens in RM while shutdown a NM
> -
>
> Key: YARN-9194
> URL: https://issues.apache.org/jira/browse/YARN-9194
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, 
> YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, 
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the 
> InvalidStateTransitionException happens.
>  
> {code:java}
> 2019-01-13 00:41:57,127 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547311267249_0001_02 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> REGISTERED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8747) [UI2] YARN UI2 page loading failed due to js error under some time zone configuration

2019-01-21 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748108#comment-16748108
 ] 

Wangda Tan commented on YARN-8747:
--

Cherry-picked to branch-3.1.2 as well. Updated fix version

> [UI2] YARN UI2 page loading failed due to js error under some time zone 
> configuration
> -
>
> Key: YARN-8747
> URL: https://issues.apache.org/jira/browse/YARN-8747
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 3.1.1
>Reporter: collinma
>Assignee: collinma
>Priority: Critical
> Fix For: 2.10.0, 3.0.4, 3.1.2, 3.3.0, 3.2.1, 2.9.3
>
> Attachments: YARN-8747.001.patch, image-2018-09-05-18-54-03-991.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We deployed hadoop 3.1.1 on centos 7.2 servers whose timezone is configured 
> as GMT+8,  the web browser time zone is GMT+8 too. yarn ui page loaded failed 
> due to js error:
>  
> !image-2018-09-05-18-54-03-991.png!
> The moment-timezone js component raised that error. This has been fixed in 
> moment-timezone 
> v0.5.1([see|[https://github.com/moment/moment-timezone/issues/294]).] We need 
> to update moment-timezone version accordingly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >