[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2022-02-10 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490468#comment-17490468
 ] 

Andras Gyori commented on YARN-10821:
-

[~epayne] I think we can close this for now, as we did not encounter this issue 
again. Thank you for your help!

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11070) Minimum resource ratio is overridden by subsequent labels

2022-01-27 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11070:
---

 Summary: Minimum resource ratio is overridden by subsequent labels
 Key: YARN-11070
 URL: https://issues.apache.org/jira/browse/YARN-11070
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Andras Gyori
Assignee: Andras Gyori


effectiveMinRatioPerResource is used in the downscaling process when using 
absolute resources. It is correctly calculated for all labels, however, 
effectiveMinRatioPerResource is overridden in each iteration. This results that 
the normalisation ratios are used in further calculations based on the last 
label.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11067) Resource overcommitment due to incorrect resource normalisation logical order

2022-01-24 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11067:
---

 Summary: Resource overcommitment due to incorrect resource 
normalisation logical order
 Key: YARN-11067
 URL: https://issues.apache.org/jira/browse/YARN-11067
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andras Gyori
Assignee: Andras Gyori


A rather serious overcommitment issue was discovered when using ABSOLUTE 
resources as capacities. A minimal way to reproduce the issue is the following:
 # We have a cluster with 32 GB memory and 16 VCores. Create the following 
hierarchy with the corresponding capacities:

 ## root.capacity = [memory=54GiB, vcores=28]
 ## root.a.capacity = [memory=50GiB, vcores=20]
 ## root.a1.capacity = [memory=30GiB, vcores=15]
 ## root.a2.capacity = [memory=20GiB, vcores=5]
 # Remove a Node from the cluster (this is not even an unusual event), eg. a 
Node with resource [memory=8GiB, vcores=4]
 # Due to the normalised resource ratio is calculated BEFORE the effective 
resource of the queue is recalculated, it will create a cascade which results 
in an overcommitment in the queue hierarchy (see 
[https://github.com/apache/hadoop/blob/5ef335da1ed49e06cc8973412952e09ed08bb9c0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java#L1294)]

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths

2022-01-11 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11063:
---

 Summary: Support auto queue creation template wildcards for 
arbitrary queue depths
 Key: YARN-11063
 URL: https://issues.apache.org/jira/browse/YARN-11063
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Andras Gyori
Assignee: Andras Gyori


With the introduction of YARN-10632, we need to support more than one wildcard 
in queue templates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11017) Unify node label access in queues

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-11017:
---

Assignee: Andras Gyori

> Unify node label access in queues
> -
>
> Key: YARN-11017
> URL: https://issues.apache.org/jira/browse/YARN-11017
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Currently there are a handful of ways in which queues are able to access node 
> labels. A non-exhaustive list of these are:
>  # configuredNodeLabels
>  # getNodeLabelsForQueue()
>  # QueueCapacities#getNodePartitionsSet()
>  # ResourceUsage#getNodePartitionsSet()
>  # accessibleNodeLabels
> It is worth revisiting, as there already is a bug, which was implicitly 
> caused by this inconsistency (YARN-11016).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10944) AbstractCSQueue: Eliminate code duplication in overloaded versions of setMaxCapacity

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10944:
---

Assignee: Andras Gyori

> AbstractCSQueue: Eliminate code duplication in overloaded versions of 
> setMaxCapacity
> 
>
> Key: YARN-10944
> URL: https://issues.apache.org/jira/browse/YARN-10944
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>
> Methods are:
> - AbstractCSQueue#setMaxCapacity(float)
> - AbstractCSQueue#setMaxCapacity(java.lang.String, float)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10590:
---

Assignee: Andras Gyori  (was: Qi Zhu)

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10947) Simplify AbstractCSQueue#initializeQueueState

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10947:
---

Assignee: Andras Gyori

> Simplify AbstractCSQueue#initializeQueueState
> -
>
> Key: YARN-10947
> URL: https://issues.apache.org/jira/browse/YARN-10947
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469780#comment-17469780
 ] 

Andras Gyori commented on YARN-10590:
-

[~zhuqi] I would like to work on this if you do not mind!

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10590:

Summary: Fix legacy auto queue creation absolute resource calculation loss  
(was: Fix TestCapacitySchedulerAutoCreatedQueueBase with related absolute 
calculation loss)

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11059) Investigate whether legacy Auto Queue Creation in absolute mode works seamlessly when calling updateClusterResource

2022-01-06 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11059:
---

 Summary: Investigate whether legacy Auto Queue Creation in 
absolute mode works seamlessly when calling updateClusterResource
 Key: YARN-11059
 URL: https://issues.apache.org/jira/browse/YARN-11059
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Andras Gyori
Assignee: Andras Gyori


Due to this check in ParentQueue#getCapacityConfigurationTypeForQueues:
{code:java}
if (queues.iterator().hasNext() &&
!queues.iterator().next().getQueuePath().equals(
CapacitySchedulerConfiguration.ROOT) &&
(percentageIsSet ? 1 : 0) + (weightIsSet ? 1 : 0) + (absoluteMinResSet ?
1 :
0) > 1) {
  throw new IOException("Parent queue '" + getQueuePath()
  + "' have children queue used mixed of "
  + " weight mode, percentage and absolute mode, it is not allowed, please "
  + "double check, details:" + diagMsg.toString());
} {code}
I was unable to call updateClusterResource on a ManagedParentQueue, when its 
children are in absolute mode. updateClusterResource is called whenever a node 
is updated etc., therefore it could break any time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10918) Simplify method: CapacitySchedulerQueueManager#parseQueue

2022-01-05 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469139#comment-17469139
 ] 

Andras Gyori commented on YARN-10918:
-

I think the queue creation itself does not justify a separate queue factory (as 
its literally only construction of the queues). There are only 2 validations 
here, so I thought we should keep this as simple as possible.

> Simplify method: CapacitySchedulerQueueManager#parseQueue
> -
>
> Key: YARN-10918
> URL: https://issues.apache.org/jira/browse/YARN-10918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>
> Ideas for simplifying this method:
> - Define a queue factory
> - Separate validation logic



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10922) Investigation: Verify if legacy AQC works as documented

2022-01-04 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468496#comment-17468496
 ] 

Andras Gyori commented on YARN-10922:
-

[~tdomok] Is it closable? Is there any followup item we could take away from 
here?

> Investigation: Verify if legacy AQC works as documented
> ---
>
> Key: YARN-10922
> URL: https://issues.apache.org/jira/browse/YARN-10922
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Tamas Domok
>Priority: Minor
> Attachments: capacity-scheduler.xml
>
>
> Quoting from the Capacity Scheduler documentation: 
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> Section: "Dynamic Auto-Creation and Management of Leaf Queues"
> The task is to verify if legacy AQC works like this: 
> {quote}
> The parent queue which has been enabled for auto leaf queue creation, 
> supports the configuration of template parameters for automatic configuration 
> of the auto-created leaf queues. The auto-created queues support all of the 
> leaf queue configuration parameters except for Queue ACL, Absolute Resource 
> configurations. Queue ACLs are currently inherited from the parent queue i.e 
> they are not configurable on the leaf queue template
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10943) AbstractCSQueue: Create separate class for encapsulating Min / Max Resource

2022-01-04 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468495#comment-17468495
 ] 

Andras Gyori commented on YARN-10943:
-

[~snemeth] Not sure if its worth the effort implementing this. What is the 
advantage of this refactor? As Java8 does not have destructuring and an easy 
way to handle POJOs, I would refrain from this change.

> AbstractCSQueue: Create separate class for encapsulating Min / Max Resource
> ---
>
> Key: YARN-10943
> URL: https://issues.apache.org/jira/browse/YARN-10943
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>
> There are certain methods where min and max Resources are used in tandem.
>  Some examples of these kind of methods:
>  - getMinimumAbsoluteResource / getMaximumAbsoluteResource
>  - *updateConfigurableResourceLimits:*
>  - It invokes setConfiguredMinResource / setConfiguredMaxResource on 
> QueueResourceQuotas. That object could define a simple method that receives 
> the MinMaxResource alone.
>  - Validator methods are also receiving min/max resources as separate 
> parameters, which could be tied together.
>  - updateEffectiveResources: It performs operations with effective min/max 
> resources.
> Alternatively, 2 classes could be created:
>  - One for EffectiveMinMaxResource
>  - And another for AbsoluteMinMaxResource



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10906) AbstractCSQueue: Create QueueConfig object for generic queue-specific fields

2022-01-04 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468419#comment-17468419
 ] 

Andras Gyori commented on YARN-10906:
-

[~snemeth] I think this Jira is obsolete, as those two fixes have addressed the 
majority of the description.

> AbstractCSQueue: Create QueueConfig object for generic queue-specific fields
> 
>
> Key: YARN-10906
> URL: https://issues.apache.org/jira/browse/YARN-10906
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>
> This is about config fields in AbstractCSQueue.
> Document if a config is only coming from the Configuration object or being 
> altered or used for other purposes.
> Also, restrict the visibilty and surface of modification from subclasses as 
> much as we can.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10565) Refactor CS queue initialization to simplify weight mode calculation

2022-01-03 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468411#comment-17468411
 ] 

Andras Gyori commented on YARN-10565:
-

[~bteke] Is it still a viable fix? I think its description is obsolete. 

> Refactor CS queue initialization to simplify weight mode calculation
> 
>
> Key: YARN-10565
> URL: https://issues.apache.org/jira/browse/YARN-10565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10565.001.patch, YARN-10565.002.patch
>
>
> In YARN-10504 weight mode support was introduced to CS. This jira is a 
> followup to simplify and restructure the initialization, so that the weight 
> calculation/absolute/percentage mode is easier to understand and modify.
> To be refactored:
> * In ParentQueue.java#1099 the error message should be more specific, instead 
> of the {{LOG.error("Fatal issue found: e", e);}}
> * AutoCreatedLeafQueue.clearConfigurableFields should clear NORMALIZED_WEIGHT 
> just to be on the safe side
> * Uncomment the commented assertions in 
> TestCapacitySchedulerAutoCreatedQueueBase.validateEffectiveMinResource
> * Check whether the assertion modification in TestRMWebServices is absolutely 
> necessary or could be hiding a bug.
> * Same for TestRMWebServicesForCSWithPartitions.java
> Additional information:
> The original flow was modified to allow the dynamic weight-capacity 
> calculation. 
> This resulted in a new flow, which is now harder to understand.
> With a cleanup it could be made simpler, the duplicate calculations could be 
> avoided. 
> The changed functionality should either be explained (if deemed correct) or 
> fixed (see YARN-10590).
> Investigate how the CS reinit works, it could contain some possibly redundant 
> initialization code fragments.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10925) Simplify AbstractCSQueue#setupQueueConfigs

2022-01-03 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468409#comment-17468409
 ] 

Andras Gyori commented on YARN-10925:
-

[~bteke] [~snemeth] I find setupQueueConfigs already in a good enough state. 
What is your opinion about it?

> Simplify AbstractCSQueue#setupQueueConfigs
> --
>
> Key: YARN-10925
> URL: https://issues.apache.org/jira/browse/YARN-10925
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-20 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463003#comment-17463003
 ] 

Andras Gyori commented on YARN-10178:
-

Thank you [~epayne] for the thorough review, this is indeed convincing. Shall I 
ask someone to check it out and commit this?

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && 

[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2021-12-19 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462421#comment-17462421
 ] 

Andras Gyori commented on YARN-8737:


This issue will completely be fixed by YARN-10178.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-17 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461398#comment-17461398
 ] 

Andras Gyori edited comment on YARN-10178 at 12/17/21, 2:48 PM:


[~epayne] Added a modified patch for branch-2.10. IntelliJ gave a warning that 
this branch is using Java 7, thus using Stream API is forbidden.


was (Author: gandras):
[~epayne] Added a modified patch for branch-2.10, because IntelliJ warned that 
this branch is using Java 7, thus using Stream API is forbidden.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = 

[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-17 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10178:

Attachment: YARN-10178.branch-2.10.001.patch

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request, updatePending)
>   && 

[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-17 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10178:

Attachment: (was: YARN-10178.branch-2.1.0.001.patch)

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request, updatePending)
>   

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-17 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461398#comment-17461398
 ] 

Andras Gyori commented on YARN-10178:
-

[~epayne] Added a modified patch for branch-2.10, because IntelliJ warned that 
this branch is using Java 7, thus using Stream API is forbidden.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.1.0.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app 

[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract

2021-12-17 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10178:

Attachment: YARN-10178.branch-2.1.0.001.patch

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> --
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.1.0.001.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request, updatePending)
>   && 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-15 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460183#comment-17460183
 ] 

Andras Gyori commented on YARN-10178:
-

[~epayne] Uploaded a patch with the simplified logic of snapshot based sorting, 
excluding the old logic and the unnecessary parts. As no objection was made in 
terms of the Stream API, the latest patch is building on it.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal 

[jira] [Updated] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-15 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10178:

Attachment: YARN-10178.006.patch

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>   if (app.accept(cluster, request, updatePending)
>   && app.apply(cluster, request, updatePending)) { // 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456178#comment-17456178
 ] 

Andras Gyori commented on YARN-10178:
-

Thanks [~epayne] for the details. The root cause you are describing is in place 
I think. Its probably transitivity, that is violated (namely if q1 > q2 and q2 
> q3 then q1 > q3, but the time it reaches the q1, q3 comparison, the queues 
had already changed, thus breaking the TimSort requirements), though not 
entirely sure about that.

All in all, the snapshot idea seems to be the correct one. As for 
{noformat}
 I read online that even the stream method of List is not a deep copy. Is that 
true? If we are only making a reference of the queue list, then the resource 
usages of each queue can change and cause the sorted list to be wrong during 
sorting.{noformat}
I believe it is not a problem, as we are not making a copy, but creating new 
objects out of queues, and only taking floats out of them, which are value 
types. However, configuredMinResource is indeed a reference and mutable as 
well, so we might need to clone that with Resources.clone() (I think it is the 
standard convention).

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct 

[jira] [Comment Edited] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455343#comment-17455343
 ] 

Andras Gyori edited comment on YARN-10965 at 12/8/21, 3:56 PM:
---

As it is a crucial part of CapacityScheduler, it would be helpful if a few 
community members took a look on this and maybe on the design doc as well.
cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] [~adam.antal] 


was (Author: gandras):
As it is a crucial part of CapacityScheduler, it would be helpful if a few 
community members took a look on this and maybe on the design doc as well.
cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] 

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455343#comment-17455343
 ] 

Andras Gyori commented on YARN-10965:
-

As it is a crucial part of CapacityScheduler, it would be helpful if a few 
community members took a look on this and maybe on the design doc as well.
cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] 

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455323#comment-17455323
 ] 

Andras Gyori commented on YARN-10178:
-

We have faced the same issue in a production cluster recently. I agree with 
[~epayne] that this should be resolved as soon as possible. My feedback on the 
patch:
 * As this is a subtle concurrency issue, I have not been able to reproduce it 
yet, but I was wondering whether we could avoid creating the snapshot 
altogether, by modifying the original comparator to acquire the necessary 
values immediately, thus hopefully eliminating the possibility of violating the 
sorting's requirements. This would look as the following:

{code:java}
float q1AbsCapacity = q1.getQueueCapacities().getAbsoluteCapacity(p);
float q2AbsCapacity = q2.getQueueCapacities().getAbsoluteCapacity(p);
float q1AbsUsedCapacity = q1.getQueueCapacities().getAbsoluteUsedCapacity(p);
float q2AbsUsedCapacity = q2.getQueueCapacities().getAbsoluteUsedCapacity(p); 
float q1UsedCapacity = q1.getQueueCapacities().getUsedCapacity(p);
float q2UsedCapacity = q2.getQueueCapacities().getUsedCapacity(p); 
.{code}
 

 * We should not use the Stream API because of older branches. I suggest 
rewriting getAssignmentIterator:
{code:java}
@Override
public Iterator getAssignmentIterator(String partition) {
  // Since partitionToLookAt is a thread local variable, and every time we
  // copy and sort queues, so it's safe for multi-threading environment.
  PriorityUtilizationQueueOrderingPolicy.partitionToLookAt.set(partition);

  // Sort the snapshots instead of the queues directly, due to race conditions
  // See YARN-10178 for more information.
  List queueSnapshots = new ArrayList<>();
  for (CSQueue queue : queues) {
queueSnapshots.add(new QueueSnapshot(queue));
  }
  queueSnapshots.sort(new PriorityQueueComparator());

  List sortedQueues = new ArrayList<>();
  for (QueueSnapshot queueSnapshot : queueSnapshots) {
sortedQueues.add(queueSnapshot.queue);
  }

  return sortedQueues.iterator();
} {code}

 * We do not need to keep the old logic
 * Measuring performance is a delicate procedure. Including it in a unit test 
is incredibly volatile (On my local machine I have not been able to pass the 
test for example) especially when naive time measurement is involved. Not sure 
if we can easily reproduce it, but I think in this case the no test is better 
than a potentially intermittent test.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> 

[jira] [Reopened] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reopened YARN-11020:
-

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Attachment: (was: YARN-11020-branch-3.3.001.patch)

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455174#comment-17455174
 ] 

Andras Gyori commented on YARN-11020:
-

The container log fetching is missing from branch-3.2, so only backported it to 
branch-3.3.

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Attachment: YARN-11020-branch-3.3.001.patch

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10888:

Attachment: (was: capacity_scheduler_queue_capacity.html)

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero

2021-12-07 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455008#comment-17455008
 ] 

Andras Gyori commented on YARN-11016:
-

[~snemeth] Weight mode has been introduced in 3.4, therefore we do not need 
this fix there.

> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-11-30 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451027#comment-17451027
 ] 

Andras Gyori commented on YARN-11020:
-

Thanks [~adam.antal] for chiming in. I agree with you that it is a bug on YARN 
RM services side, as making a distinction between multiple and single entries 
in responses is a really bad practice. However, there is a slight, but non-zero 
possibility, that someone is already using this endpoint aside from UI2.

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-11-29 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Labels:   (was: ui2)

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-11-29 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Component/s: yarn-ui-v2
 (was: resourcemanager)

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-11-29 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11020:
---

 Summary: [UI2] No container is found for an application attempt 
with a single AM container
 Key: YARN-11020
 URL: https://issues.apache.org/jira/browse/YARN-11020
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Andras Gyori
Assignee: Andras Gyori


In UI2 for an application under the Logs tab, No container data available 
message is shown if the application attempt only submitted a single container 
(which is the AM container). 

The culprit of the issue is that the response from YARN is not consistent, 
because for a single container it looks like:
{noformat}
{
    "containerLogsInfo": {
        "containerLogInfo": [
            {
                "fileName": "prelaunch.out",
                "fileSize": "100",
                "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
            },
            {
                "fileName": "directory.info",
                "fileSize": "2296",
                "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
            },
            {
                "fileName": "stderr",
                "fileSize": "1722",
                "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
            },
            {
                "fileName": "prelaunch.err",
                "fileSize": "0",
                "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
            },
            {
                "fileName": "stdout",
                "fileSize": "0",
                "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
            },
            {
                "fileName": "syslog",
                "fileSize": "38551",
                "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
            },
            {
                "fileName": "launch_container.sh",
                "fileSize": "5013",
                "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
            }
        ],
        "logAggregationType": "AGGREGATED",
        "containerId": "container_1638174027957_0008_01_01",
        "nodeId": "da175178c179:43977"
    }
}{noformat}
As for applications with multiple containers it looks like:
{noformat}
{
    "containerLogsInfo": [{
        
    }, {  }]
}{noformat}
We can not change the response of the endpoint due to backward compatibility, 
therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-11-29 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Labels: ui2  (was: )

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: ui2
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11017) Unify node label access in queues

2021-11-25 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11017:
---

 Summary: Unify node label access in queues
 Key: YARN-11017
 URL: https://issues.apache.org/jira/browse/YARN-11017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Andras Gyori


Currently there are a handful of ways in which queues are able to access node 
labels. A non-exhaustive list of these are:
 # configuredNodeLabels
 # getNodeLabelsForQueue()
 # QueueCapacities#getNodePartitionsSet()
 # ResourceUsage#getNodePartitionsSet()
 # accessibleNodeLabels

It is worth revisiting, as there already is a bug, which was implicitly caused 
by this inconsistency (YARN-11016).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11016) Queue weight is incorrectly reset to zero

2021-11-25 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11016:

Description: 
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible 
node labels (the inherited 'test). During this event the ResourceUsage object 
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. The problem here is that values are reset 
for all node labels in QueueCapacities (even for 'test'), but we only load the 
values for the configured node labels (which we did not set, so it is defaulted 
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. It even explains why validation passes, because 
the validation endpoint instantiates a brand new CapacityScheduler for which 
these cascade of effects can not accumulate (as there are no multiple 
updateClusterResource calls)

This scenario manifests as an error when updating via mutation API:
{noformat}
Failed to re-init queues : Parent queue 'parent' have children queue used mixed 
of weight mode, percentage and absolute mode, it is not allowed, please double 
check, details:{noformat}

  was:
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible 
node labels (the inherited 'test). During this event the ResourceUsage object 
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. The problem here is that values are reset 
for all node labels in QueueCapacities (even for 'test'), but we only load the 
values for the configured node labels (which we did not set, so it is defaulted 
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. It even explains why validation passes, because 
the validation endpoint instantiates a brand new CapacityScheduler for which 
these cascade of effects can not accumulate (as there are no multiple 
updateClusterResource calls)

This scenario manifests as an error when updating via mutation API:
Failed to re-init queues : Parent queue 'parent' have children queue used mixed 
of  weight mode, percentage and absolute mode, it is not allowed, please double 
check, details:


> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop 

[jira] [Updated] (YARN-11016) Queue weight is incorrectly reset to zero

2021-11-25 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11016:

Description: 
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible 
node labels (the inherited 'test). During this event the ResourceUsage object 
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. The problem here is that values are reset 
for all node labels in QueueCapacities (even for 'test'), but we only load the 
values for the configured node labels (which we did not set, so it is defaulted 
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. It even explains why validation passes, because 
the validation endpoint instantiates a brand new CapacityScheduler for which 
these cascade of effects can not accumulate (as there are no multiple 
updateClusterResource calls)

This scenario manifests as an error when updating via mutation API:
Failed to re-init queues : Parent queue 'parent' have children queue used mixed 
of  weight mode, percentage and absolute mode, it is not allowed, please double 
check, details:

  was:
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible 
node labels (the inherited 'test). During this event the ResourceUsage object 
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. The problem here is that values are reset 
for all node labels in QueueCapacities (even for 'test'), but we only load the 
values for the configured node labels (which we did not set, so it is defaulted 
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. It even explains why validation passes, because 
the validation endpoint instantiates a brand new CapacityScheduler for which 
these cascade of effects can not accumulate (as there are no multiple 
updateClusterResource calls)


> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause 

[jira] [Created] (YARN-11016) Queue weight is incorrectly reset to zero

2021-11-25 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11016:
---

 Summary: Queue weight is incorrectly reset to zero
 Key: YARN-11016
 URL: https://issues.apache.org/jira/browse/YARN-11016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori


QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible 
node labels (the inherited 'test). During this event the ResourceUsage object 
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. The problem here is that values are reset 
for all node labels in QueueCapacities (even for 'test'), but we only load the 
values for the configured node labels (which we did not set, so it is defaulted 
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. It even explains why validation passes, because 
the validation endpoint instantiates a brand new CapacityScheduler for which 
these cascade of effects can not accumulate (as there are no multiple 
updateClusterResource calls)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11013) Provide a public wrapper of Configuration#substituteVars

2021-11-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11013:

Summary: Provide a public wrapper of Configuration#substituteVars  (was: 
Avoid breaking changes in Configuration)

> Provide a public wrapper of Configuration#substituteVars
> 
>
> Key: YARN-11013
> URL: https://issues.apache.org/jira/browse/YARN-11013
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> YARN-10838 and YARN-10911 introduced an unfortunate change in Configuration 
> that could potentially be backward incompatible (visibility of 
> substituteVars). Since it is easily circumvented, my proposal is to avoid 
> breaking changes if possible.
> One found issue is that Oozie defines substituteVars as a private method.
> https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/util/XConfiguration.java#L186



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11013) Avoid breaking changes in Configuration

2021-11-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11013:

Description: 
YARN-10838 and YARN-10911 introduced an unfortunate change in Configuration 
that could potentially be backward incompatible (visibility of substituteVars). 
Since it is easily circumvented, my proposal is to avoid breaking changes if 
possible.

One found issue is that Oozie defines substituteVars as a private method.
https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/util/XConfiguration.java#L186

  was:YARN-10838 and YARN-10911 introduced an unfortunate change in 
Configuration that could potentially be backward incompatible (visibility of 
substituteVars). Since it is easily circumvented, my proposal is to avoid 
breaking changes if possible.


> Avoid breaking changes in Configuration
> ---
>
> Key: YARN-11013
> URL: https://issues.apache.org/jira/browse/YARN-11013
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> YARN-10838 and YARN-10911 introduced an unfortunate change in Configuration 
> that could potentially be backward incompatible (visibility of 
> substituteVars). Since it is easily circumvented, my proposal is to avoid 
> breaking changes if possible.
> One found issue is that Oozie defines substituteVars as a private method.
> https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/util/XConfiguration.java#L186



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11013) Avoid breaking changes in Configuration

2021-11-22 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11013:
---

 Summary: Avoid breaking changes in Configuration
 Key: YARN-11013
 URL: https://issues.apache.org/jira/browse/YARN-11013
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andras Gyori
Assignee: Andras Gyori


YARN-10838 and YARN-10911 introduced an unfortunate change in Configuration 
that could potentially be backward incompatible (visibility of substituteVars). 
Since it is easily circumvented, my proposal is to avoid breaking changes if 
possible.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-11-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10965:

Description: With the introduction of YARN-10930 it is possible to unify 
queue resource calculation. In order to narrow down the scope of this patch, 
the base system is implemented here, without refactoring the existing resource 
calculation in updateClusterResource (which will be done in YARN-11000).  (was: 
With the introduction of YARN-10930 it is possible to unify queue resource 
calculation. In order to narrow down the scope of this patch, the base system 
is implemented here, without refactoring the existing resource calculation in 
updateClusterResource (which will be done in a different jira).)

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-11-21 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10965:

Description: With the introduction of YARN-10930 it is possible to unify 
queue resource calculation. In order to narrow down the scope of this patch, 
the base system is implemented here, without refactoring the existing resource 
calculation in updateClusterResource (which will be done in a different jira).  
(was: With the introduction of YARN-10930 it is possible to unify queue 
resource calculation. In order to narrow down the scope of this patch, only the 
PERCENTAGE capacity type and the base system is implemented here, without 
refactoring the existing resource calculation in updateClusterResource (which 
will be done in a different jira).)

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in a different jira).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-11-16 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10965:

Summary: Centralize queue resource calculation based on CapacityVectors  
(was: Introduce enhanced queue calculation)

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, only the 
> PERCENTAGE capacity type and the base system is implemented here, without 
> refactoring the existing resource calculation in updateClusterResource (which 
> will be done in a different jira).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS

2021-11-16 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10888:

Attachment: capacity_scheduler_queue_capacity.pdf

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: capacity_scheduler_queue_capacity.html, 
> capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11000) Replace queue resource calculation logic in updateClusterResource

2021-11-04 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11000:

Summary: Replace queue resource calculation logic in updateClusterResource  
(was: Replace queue calculation logic in updateClusterResource)

> Replace queue resource calculation logic in updateClusterResource
> -
>
> Key: YARN-11000
> URL: https://issues.apache.org/jira/browse/YARN-11000
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> YARN-10965 introduces a brand new queue calculation system. In order to 
> simplify the review process, this issue replaces the current logic with the 
> newly introduced one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11000) Replace queue calculation logic in updateClusterResource

2021-11-04 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11000:
---

 Summary: Replace queue calculation logic in updateClusterResource
 Key: YARN-11000
 URL: https://issues.apache.org/jira/browse/YARN-11000
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori


YARN-10965 introduces a brand new queue calculation system. In order to 
simplify the review process, this issue replaces the current logic with the 
newly introduced one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10996) Fix race condition of User object acquisitions

2021-11-02 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437181#comment-17437181
 ] 

Andras Gyori commented on YARN-10996:
-

It is important to check each location individually, as laborously replacing 
getUser with getUserAndAddIfAbsent could introduce a memory leak as:
 * users are only removed when an application attempt is removed belonging to 
the user and the user has no more application left
 * if we reinsert a user without an application, that user will never be 
removed again, thus introducing a leak

> Fix race condition of User object acquisitions
> --
>
> Key: YARN-10996
> URL: https://issues.apache.org/jira/browse/YARN-10996
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> UserManager is highly susceptible of removing users that are queried later. 
> This race condition produces NPE similar to YARN-10934. A non-exhaustive list 
> of these locations are:
> - LeafQueue.getTotalPendingResourcesConsideringUserLimit
> - UsersManager.computeUserLimit
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10997) Revisit allocation and reservation logging

2021-11-02 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10997:
---

 Summary: Revisit allocation and reservation logging
 Key: YARN-10997
 URL: https://issues.apache.org/jira/browse/YARN-10997
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Andras Gyori
Assignee: Andras Gyori


Accepted allocation proposal and reserved container logs are two exceedingly 
frequent events. Numerous user reported that these log entries had quickly 
filled the logs on a busy cluster and seen these entries only as noise.

It would be worthwhile to reduce the log level of these entries to DEBUG level.

Examples:
{noformat}
2021-10-30 02:28:57,409 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Allocation proposal accepted
2021-10-30 02:28:57,439 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 Reserved container=container_1635478503131_0069_01_78, on node=host: 
node:8041 #containers=1 available= used= with resource=
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10996) Fix race condition of User object acquisitions

2021-10-29 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10996:
---

 Summary: Fix race condition of User object acquisitions
 Key: YARN-10996
 URL: https://issues.apache.org/jira/browse/YARN-10996
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andras Gyori
Assignee: Andras Gyori


UserManager is highly susceptible of removing users that are queried later. 
This race condition produces NPE similar to YARN-10934. A non-exhaustive list 
of these locations are:

- LeafQueue.getTotalPendingResourcesConsideringUserLimit

- UsersManager.computeUserLimit

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10632) Make auto queue creation maximum allowed depth configurable

2021-10-29 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10632:

Summary: Make auto queue creation maximum allowed depth configurable  (was: 
Make auto queue creation depth configurable)

> Make auto queue creation maximum allowed depth configurable
> ---
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10993) Move domain specific logic out of CapacitySchedulerConfig

2021-10-29 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10993:
---

 Summary: Move domain specific logic out of CapacitySchedulerConfig
 Key: YARN-10993
 URL: https://issues.apache.org/jira/browse/YARN-10993
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Andras Gyori


CapacitySchedulerConfig should contain only getters/setters and parsing logic. 
Everything else should be moved outside of the class to its appropriate 
location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10984) Add tests to CapacitySchedulerConfiguration

2021-10-28 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435450#comment-17435450
 ] 

Andras Gyori commented on YARN-10984:
-

YARN-10985 added a lot of tests for ACL. These scenarios could be extended with 
a test case with an ACL of a double space eg. a _ _ b _.

> Add tests to CapacitySchedulerConfiguration
> ---
>
> Key: YARN-10984
> URL: https://issues.apache.org/jira/browse/YARN-10984
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10632) Make auto queue creation depth configurable

2021-10-28 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10632:

Summary: Make auto queue creation depth configurable  (was: Make maximum 
depth allowed to be configurable)

> Make auto queue creation depth configurable
> ---
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10632) Make maximum depth allowed to be configurable

2021-10-25 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433789#comment-17433789
 ] 

Andras Gyori commented on YARN-10632:
-

Thanks [~zhuqi]!

> Make maximum depth allowed to be configurable
> -
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10632) Make maximum depth allowed to be configurable

2021-10-25 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10632:
---

Assignee: Andras Gyori  (was: Qi Zhu)

> Make maximum depth allowed to be configurable
> -
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10632) Make maximum depth allowed to be configurable

2021-10-22 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432958#comment-17432958
 ] 

Andras Gyori commented on YARN-10632:
-

[~zhuqi] may I assign this issue to myself?

> Make maximum depth allowed to be configurable
> -
>
> Key: YARN-10632
> URL: https://issues.apache.org/jira/browse/YARN-10632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10632.001.patch, YARN-10632.002.patch, 
> YARN-10632.003.patch, YARN-10632.004.patch
>
>
> Now the max depth allowed are fixed to 2. But i think this should be 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10982) Replace all occurences of queuePath with the new QueuePath class

2021-10-19 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430539#comment-17430539
 ] 

Andras Gyori commented on YARN-10982:
-

My suggestion for this patch:
- Please add the isRoot method for the QueuePath class as well and try to 
change all queuePath.equals("root") and queuePath.equals(ROOT) occurences to 
this method.

> Replace all occurences of queuePath with the new QueuePath class
> 
>
> Key: YARN-10982
> URL: https://issues.apache.org/jira/browse/YARN-10982
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Tibor Kovács
>Priority: Major
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10982) Replace all occurences of queuePath with the new QueuePath class

2021-10-19 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10982:

Parent: YARN-10889
Issue Type: Sub-task  (was: Improvement)

> Replace all occurences of queuePath with the new QueuePath class
> 
>
> Key: YARN-10982
> URL: https://issues.apache.org/jira/browse/YARN-10982
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Tibor Kovács
>Priority: Major
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10982) Replace all occurences of queuePath with the new QueuePath class

2021-10-19 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10982:
---

 Summary: Replace all occurences of queuePath with the new 
QueuePath class
 Key: YARN-10982
 URL: https://issues.apache.org/jira/browse/YARN-10982
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Tibor Kovács


The QueuePath class was introduced in YARN-10897, however, its current adoption 
happened only for code changes after this JIRA. We need to adopt it 
retrospectively.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10795) Improve Capacity Scheduler reinitialisation performance

2021-10-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori resolved YARN-10795.
-
Resolution: Fixed

> Improve Capacity Scheduler reinitialisation performance
> ---
>
> Key: YARN-10795
> URL: https://issues.apache.org/jira/browse/YARN-10795
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Priority: Major
>
> Mostly due to CapacitySchedulerConfiguration#getPropsWithPrefix or similar 
> methods, the CapacityScheduler#reinit method has some quadratic complexity 
> part with respect to queue numbers. Over 1000+ queues, it is a matter of 
> minutes, which is too high to be a viable option when it is used in mutation 
> api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10795) Improve Capacity Scheduler reinitialisation performance

2021-10-06 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424902#comment-17424902
 ] 

Andras Gyori commented on YARN-10795:
-

Closing this epic, as all of its subtasks are done. With these improvements, I 
have been able to reduce a reinit with 5000+ queues from 4 mins to few seconds.

> Improve Capacity Scheduler reinitialisation performance
> ---
>
> Key: YARN-10795
> URL: https://issues.apache.org/jira/browse/YARN-10795
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Priority: Major
>
> Mostly due to CapacitySchedulerConfiguration#getPropsWithPrefix or similar 
> methods, the CapacityScheduler#reinit method has some quadratic complexity 
> part with respect to queue numbers. Over 1000+ queues, it is a matter of 
> minutes, which is too high to be a viable option when it is used in mutation 
> api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-30 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422799#comment-17422799
 ] 

Andras Gyori commented on YARN-1115:


[~epayne] Thank you for your explanation! I agree with the UGI approach and 
generally with your arguments! A documentation regarding this feature would 
eliminate the confusion in my opinion.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10954) Remove commented code block from CSQueueUtils#loadCapacitiesByLabelsFromConf

2021-09-29 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422159#comment-17422159
 ] 

Andras Gyori commented on YARN-10954:
-

I think this simple fix could be handled in a general cleanup jira.

> Remove commented code block from CSQueueUtils#loadCapacitiesByLabelsFromConf
> 
>
> Key: YARN-10954
> URL: https://issues.apache.org/jira/browse/YARN-10954
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10954) Remove commented code block from CSQueueUtils#loadCapacitiesByLabelsFromConf

2021-09-29 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10954:
---

Assignee: (was: Andras Gyori)

> Remove commented code block from CSQueueUtils#loadCapacitiesByLabelsFromConf
> 
>
> Key: YARN-10954
> URL: https://issues.apache.org/jira/browse/YARN-10954
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10965) Introduce enhanced queue calculation

2021-09-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10965:

Description: With the introduction of YARN-10930 it is possible to unify 
queue resource calculation. In order to narrow down the scope of this patch, 
only the PERCENTAGE capacity type and the base system is implemented here, 
without refactoring the existing resource calculation in updateClusterResource 
(which will be done in a different jira).  (was: With the introduction of 
YARN-10930 it is possible to unify queue resource calculation. In order to 
narrow down the scope of this patch, only the PERCENTAGE capacity type and the 
base system is implemented here.)

> Introduce enhanced queue calculation
> 
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, only the 
> PERCENTAGE capacity type and the base system is implemented here, without 
> refactoring the existing resource calculation in updateClusterResource (which 
> will be done in a different jira).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10965) Introduce enhanced queue calculation

2021-09-23 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10965:
---

 Summary: Introduce enhanced queue calculation
 Key: YARN-10965
 URL: https://issues.apache.org/jira/browse/YARN-10965
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori


With the introduction of YARN-10930 it is possible to unify queue resource 
calculation. In order to narrow down the scope of this patch, only the 
PERCENTAGE capacity type and the base system is implemented here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS

2021-09-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10888:

Attachment: capacity_scheduler_queue_capacity.html

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: capacity_scheduler_queue_capacity.html
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10930) Introduce universal configured capacity vector

2021-09-21 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10930:

Summary: Introduce universal configured capacity vector  (was: Introduce 
universal capacity resource vector)

> Introduce universal configured capacity vector
> --
>
> Key: YARN-10930
> URL: https://issues.apache.org/jira/browse/YARN-10930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Attachments: capacity_scheduler_queue_capacity.html
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The proposal is to introduce a capacity resource vector that is universally 
> parsed for every queue. CapacityResourceVector is a way to unite the current 
> capacity modes (weight, percentage, absolute), while maintaining flexibility 
> and extendability.
> CapacityResourceVector is a good fit for the existing capacity configs, for 
> example:
> * percentage mode: root.example.capacity 50 is a syntactic sugar for 
> [memory=50%, vcores=50%, ]
> * absolute mode: root.example.capacity [memory=1024, vcores=2] is a natural 
> fit for the vector, there is no need for additional settings
> CapacityResourceVector will be used in a future refactor, to unify the 
> resource calculation and lift the limitation imposed on the queue hierarchy 
> capacity settings (eg. can not use both absolute resource and percentage in 
> the same hierarchy etc...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10949) Simplify AbstractCSQueue#updateMaxAppRelatedField and find a more meaningful name for this method

2021-09-20 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10949:
---

Assignee: Andras Gyori

> Simplify AbstractCSQueue#updateMaxAppRelatedField and find a more meaningful 
> name for this method
> -
>
> Key: YARN-10949
> URL: https://issues.apache.org/jira/browse/YARN-10949
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10954) Remove commented code block from CSQueueUtils#loadCapacitiesByLabelsFromConf

2021-09-20 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10954:
---

Assignee: Andras Gyori

> Remove commented code block from CSQueueUtils#loadCapacitiesByLabelsFromConf
> 
>
> Key: YARN-10954
> URL: https://issues.apache.org/jira/browse/YARN-10954
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10953) Make CapacityScheduler#getOrCreateQueueFromPlacementContext easier to comprehend

2021-09-20 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10953:
---

Assignee: Andras Gyori

> Make CapacityScheduler#getOrCreateQueueFromPlacementContext easier to 
> comprehend
> 
>
> Key: YARN-10953
> URL: https://issues.apache.org/jira/browse/YARN-10953
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>
> 1. Most of the method body is wrapped in an if-statement that checks if the 
> queue is null. We could negate this and return immediately if the queue != 
> null, so we don't need a large if statement.
> 2. Similarly in that large if body, there's a check for 
> fallbackContext.hasParentQueue(). If it's true, we are having yet another 
> large if-body. We should also negate this condition and return immediately if 
> it's false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-10 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412659#comment-17412659
 ] 

Andras Gyori edited comment on YARN-1115 at 9/10/21, 7:31 AM:
--

Thank you [~epayne] for working on this one. I have checked your latest patch a 
couple times and I think I understand the flow of the logic. However, it might 
only be me, but I have found the following a slightly confusing at first sight:
 * Submitting an app without a proxy user:
 ** user is the real user
 ** realUser is null
 * Submitting an app with a proxy user:
 ** user is the proxy user
 ** realUser is the real user

I might find it a bit more intuitive to define proxyUser instead, thereby 
reverting this part of the logic a bit. Again, this is only a subjective 
preference, so others could find it less intuitive.


was (Author: gandras):
Thank you [~epayne] for working on this one. I have checked you latest patch a 
couple times and I think I understand the flow of the logic. However, it might 
only be me, but I have found the following a slightly confusing at first sight:
 * Submitting an app without a proxy user:
 ** user is the real user
 ** realUser is null
 * Submitting an app with a proxy user:
 ** user is the proxy user
 ** realUser is the real user

I might find it a bit more intuitive to define proxyUser instead, thereby 
reverting this part of the logic a bit. Again, this is only a subjective 
preference, so others could find it less intuitive.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2021-09-09 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412659#comment-17412659
 ] 

Andras Gyori commented on YARN-1115:


Thank you [~epayne] for working on this one. I have checked you latest patch a 
couple times and I think I understand the flow of the logic. However, it might 
only be me, but I have found the following a slightly confusing at first sight:
 * Submitting an app without a proxy user:
 ** user is the real user
 ** realUser is null
 * Submitting an app with a proxy user:
 ** user is the proxy user
 ** realUser is the real user

I might find it a bit more intuitive to define proxyUser instead, thereby 
reverting this part of the logic a bit. Again, this is only a subjective 
preference, so others could find it less intuitive.

> Provide optional means for a scheduler to check real user ACLs
> --
>
> Key: YARN-1115
> URL: https://issues.apache.org/jira/browse/YARN-1115
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler
>Affects Versions: 2.8.5
>Reporter: Eric Payne
>Priority: Major
> Attachments: YARN-1115.001.patch
>
>
> In the framework for secure implementation using UserGroupInformation.doAs 
> (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html),
>  a trusted superuser can submit jobs on behalf of another user in a secure 
> way. In this framework, the superuser is referred to as the real user and the 
> proxied user is referred to as the effective user.
> Currently when a job is submitted as an effective user, the ACLs for the 
> effective user are checked against the queue on which the job is to be run. 
> Depending on an optional configuration, the scheduler should also check the 
> ACLs of the real user if the configuration to do so is set.
> For example, suppose my superuser name is super, and super is configured to 
> securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
> only allows ACLs for super, not for joe.
> When super proxies to joe in order to submit a job to the ops queue, it will 
> fail because joe, as the effective user, does not have ACLs on the ops queue.
> In many cases this is what you want, in order to protect queues that joe 
> should not be using.
> However, there are times when super may need to proxy to many users, and the 
> client running as super just wants to use the ops queue because the ops queue 
> is already dedicated to the client's purpose, and, to keep the ops queue 
> dedicated to that purpose, super doesn't want to open up ACLs to joe in 
> general on the ops queue. Without this functionality, in this case, the 
> client running as super needs to figure out which queue each user has ACLs 
> opened up for, and then coordinate with other tasks using those queues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10930) Introduce universal capacity resource vector

2021-08-31 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10930:

Attachment: capacity_scheduler_queue_capacity.html

> Introduce universal capacity resource vector
> 
>
> Key: YARN-10930
> URL: https://issues.apache.org/jira/browse/YARN-10930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: capacity_scheduler_queue_capacity.html
>
>
> The proposal is to introduce a capacity resource vector that is universally 
> parsed for every queue. CapacityResourceVector is a way to unite the current 
> capacity modes (weight, percentage, absolute), while maintaining flexibility 
> and extendability.
> CapacityResourceVector is a good fit for the existing capacity configs, for 
> example:
> * percentage mode: root.example.capacity 50 is a syntactic sugar for 
> [memory=50%, vcores=50%, ]
> * absolute mode: root.example.capacity [memory=1024, vcores=2] is a natural 
> fit for the vector, there is no need for additional settings
> CapacityResourceVector will be used in a future refactor, to unify the 
> resource calculation and lift the limitation imposed on the queue hierarchy 
> capacity settings (eg. can not use both absolute resource and percentage in 
> the same hierarchy etc...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10930) Introduce universal capacity resource vector

2021-08-31 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10930:
---

 Summary: Introduce universal capacity resource vector
 Key: YARN-10930
 URL: https://issues.apache.org/jira/browse/YARN-10930
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori


The proposal is to introduce a capacity resource vector that is universally 
parsed for every queue. CapacityResourceVector is a way to unite the current 
capacity modes (weight, percentage, absolute), while maintaining flexibility 
and extendability.
CapacityResourceVector is a good fit for the existing capacity configs, for 
example:
* percentage mode: root.example.capacity 50 is a syntactic sugar for 
[memory=50%, vcores=50%, ]
* absolute mode: root.example.capacity [memory=1024, vcores=2] is a natural fit 
for the vector, there is no need for additional settings
CapacityResourceVector will be used in a future refactor, to unify the resource 
calculation and lift the limitation imposed on the queue hierarchy capacity 
settings (eg. can not use both absolute resource and percentage in the same 
hierarchy etc...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10897) Introduce QueuePath class

2021-08-25 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10897:

Description: 
The same idioms regarding queue path strings are repeated over and over in the 
codebase. Including but not limited to:
* Get parent queue of a queue path
* Split queue path and iterate through it
* Traverse a queue path all the way to root

It also inherently provides some kind of type safety and documentation 
extension to the code (eg. instead of Map the Map communicates more clearly what we group items by).

  was:
The same idioms regarding queue path strings are repeated over and over in the 
codebase. Including but not limited to:
* Get parent queue of a queue path
* Split queue path and iterate through it
* Traverse a queue path all the way to root
It also inherently provides some kind of type safety and documentation 
extension to the code (eg. instead of Map the Map communicates more clearly what we group items by).


> Introduce QueuePath class
> -
>
> Key: YARN-10897
> URL: https://issues.apache.org/jira/browse/YARN-10897
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> The same idioms regarding queue path strings are repeated over and over in 
> the codebase. Including but not limited to:
> * Get parent queue of a queue path
> * Split queue path and iterate through it
> * Traverse a queue path all the way to root
> It also inherently provides some kind of type safety and documentation 
> extension to the code (eg. instead of Map the Map Object> communicates more clearly what we group items by).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10897) Introduce QueuePath class

2021-08-25 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10897:
---

 Summary: Introduce QueuePath class
 Key: YARN-10897
 URL: https://issues.apache.org/jira/browse/YARN-10897
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, yarn
Reporter: Andras Gyori
Assignee: Andras Gyori


The same idioms regarding queue path strings are repeated over and over in the 
codebase. Including but not limited to:
* Get parent queue of a queue path
* Split queue path and iterate through it
* Traverse a queue path all the way to root
It also inherently provides some kind of type safety and documentation 
extension to the code (eg. instead of Map the Map communicates more clearly what we group items by).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10522) Document for Flexible Auto Queue Creation in Capacity Scheduler.

2021-08-23 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403215#comment-17403215
 ] 

Andras Gyori commented on YARN-10522:
-

Thank you [~bteke] for pursuing this issue! I have the following additions to 
the latest patch:
 * Nit: This line is somewhat convoluted. I think it will be easier to 
understand to just make a note in a new sentence, like It is important to 
emphasize, that dynamic queues created in a flexible fashion only work with 
weights as their capacity.
{noformat}
but the created queues will be and can only be configured with weights as 
capacity.{noformat}

 * Nit: I think configured here is redundant, the feature itself is a 
configuration value.
{noformat}
 The Flexible Dynamic Queue Auto-Creation and Management feature allows a 
ParentQueue to be configured
{noformat}

 * Nit: pre-configured queues under the parent must be configured in the same 
way sounds better in my opinion.
{noformat}
The auto-created queues will have weights as capacity so the pre-configured 
queues under the parent must be configured to use the same
{noformat}

 * Nit: A parent queue  supports configuration of dinamically created leaf 
and parent .
{noformat}
The parent queue which has the flexible auto queue creation enabled supports 
the configuration dynamically created leaf and parent queues through template 
parameters
{noformat}

* Missing by
{noformat}
Specifies a queue property inherited auto-created leaf queues. Specifies a 
queue property inherited auto-created parent queues.
{noformat}

* Maybe an example for templates would be good, because it is not a 
straightforward feature at first glance.

> Document for Flexible Auto Queue Creation in Capacity Scheduler.
> 
>
> Key: YARN-10522
> URL: https://issues.apache.org/jira/browse/YARN-10522
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10522.001.patch
>
>
> We should update document to support this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix

2021-08-23 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403014#comment-17403014
 ] 

Andras Gyori commented on YARN-10838:
-

Thank you [~bteke] for taking care of this issue! As for the last point, I 
wanted to separate this introduction from the refactor part. Check YARN-10795 
for the related jiras.

> Implement an optimised version of Configuration getPropsWithPrefix
> --
>
> Key: YARN-10838
> URL: https://issues.apache.org/jira/browse/YARN-10838
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10838.001.patch, YARN-10838.002.patch, 
> YARN-10838.003.patch, YARN-10838.004.patch, YARN-10838.005.patch
>
>
> AutoCreatedQueueTemplate also has multiple call to 
> Configuration#getPropsWithPrefix. It must be eliminated in order to improve 
> the performance on reinitialisation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6221) Entities missing from ATS when summary log file info got returned to the ATS before the domain log

2021-07-26 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387376#comment-17387376
 ] 

Andras Gyori commented on YARN-6221:


Thank you [~ximz] I have no more comments on this one, +1 non-binding.

> Entities missing from ATS when summary log file info got returned to the ATS 
> before the domain log
> --
>
> Key: YARN-6221
> URL: https://issues.apache.org/jira/browse/YARN-6221
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sushmitha Sreenivasan
>Assignee: Xiaomin Zhang
>Priority: Critical
> Attachments: YARN-6221.02.patch, YARN-6221.02.patch, YARN-6221.patch, 
> YARN-6221.patch
>
>
> Events data missing for the following entities:
> curl -k --negotiate -u: 
> http://:8188/ws/v1/timeline/TEZ_APPLICATION_ATTEMPT/tez_appattempt_1487706062210_0012_01
> {"events":[],"entitytype":"TEZ_APPLICATION_ATTEMPT","entity":"tez_appattempt_1487706062210_0012_01","starttime":1487711606077,"domain":"Tez_ATS_application_1487706062210_0012","relatedentities":{"TEZ_DAG_ID":["dag_1487706062210_0012_2","dag_1487706062210_0012_1"]},"primaryfilters":{},"otherinfo":{}}
> {code:title=Timeline Server log entry}
> WARN  timeline.TimelineDataManager 
> (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { 
> id: tez_application_1487706062210_0012, type: TEZ_APPLICATION }
> org.apache.hadoop.yarn.exceptions.YarnException: Domain information of the 
> timeline entity { id: tez_application_1487706062210_0012, type: 
> TEZ_APPLICATION } doesn't exist.
> at 
> org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:122)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.doPostEntities(TimelineDataManager.java:356)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:316)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityLogInfo.doParse(LogInfo.java:204)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:156)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:113)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:682)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:657)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:870)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6221) Entities missing from ATS when summary log file info got returned to the ATS before the domain log

2021-07-26 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387330#comment-17387330
 ] 

Andras Gyori commented on YARN-6221:


Thank you [~ximz] for working on this issue! The change looks good to me, with 
the following minor notes on the TestCase:
 # Please use log entries instead of sout.println
 # Please use fs.delete(path, false), because fs.delete(path) is deprecated

> Entities missing from ATS when summary log file info got returned to the ATS 
> before the domain log
> --
>
> Key: YARN-6221
> URL: https://issues.apache.org/jira/browse/YARN-6221
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sushmitha Sreenivasan
>Assignee: Xiaomin Zhang
>Priority: Critical
> Attachments: YARN-6221.patch, YARN-6221.patch
>
>
> Events data missing for the following entities:
> curl -k --negotiate -u: 
> http://:8188/ws/v1/timeline/TEZ_APPLICATION_ATTEMPT/tez_appattempt_1487706062210_0012_01
> {"events":[],"entitytype":"TEZ_APPLICATION_ATTEMPT","entity":"tez_appattempt_1487706062210_0012_01","starttime":1487711606077,"domain":"Tez_ATS_application_1487706062210_0012","relatedentities":{"TEZ_DAG_ID":["dag_1487706062210_0012_2","dag_1487706062210_0012_1"]},"primaryfilters":{},"otherinfo":{}}
> {code:title=Timeline Server log entry}
> WARN  timeline.TimelineDataManager 
> (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { 
> id: tez_application_1487706062210_0012, type: TEZ_APPLICATION }
> org.apache.hadoop.yarn.exceptions.YarnException: Domain information of the 
> timeline entity { id: tez_application_1487706062210_0012, type: 
> TEZ_APPLICATION } doesn't exist.
> at 
> org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:122)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.doPostEntities(TimelineDataManager.java:356)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:316)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityLogInfo.doParse(LogInfo.java:204)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:156)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:113)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:682)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:657)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:870)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix

2021-07-26 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10838:

Attachment: YARN-10838.004.patch

> Implement an optimised version of Configuration getPropsWithPrefix
> --
>
> Key: YARN-10838
> URL: https://issues.apache.org/jira/browse/YARN-10838
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10838.001.patch, YARN-10838.002.patch, 
> YARN-10838.003.patch, YARN-10838.004.patch
>
>
> AutoCreatedQueueTemplate also has multiple call to 
> Configuration#getPropsWithPrefix. It must be eliminated in order to improve 
> the performance on reinitialisation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix

2021-07-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10838:

Attachment: YARN-10838.003.patch

> Implement an optimised version of Configuration getPropsWithPrefix
> --
>
> Key: YARN-10838
> URL: https://issues.apache.org/jira/browse/YARN-10838
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10838.001.patch, YARN-10838.002.patch, 
> YARN-10838.003.patch
>
>
> AutoCreatedQueueTemplate also has multiple call to 
> Configuration#getPropsWithPrefix. It must be eliminated in order to improve 
> the performance on reinitialisation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix

2021-07-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10838:

Attachment: (was: YARN-10838.003.patch)

> Implement an optimised version of Configuration getPropsWithPrefix
> --
>
> Key: YARN-10838
> URL: https://issues.apache.org/jira/browse/YARN-10838
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10838.001.patch, YARN-10838.002.patch
>
>
> AutoCreatedQueueTemplate also has multiple call to 
> Configuration#getPropsWithPrefix. It must be eliminated in order to improve 
> the performance on reinitialisation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix

2021-07-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10838:

Attachment: YARN-10838.003.patch

> Implement an optimised version of Configuration getPropsWithPrefix
> --
>
> Key: YARN-10838
> URL: https://issues.apache.org/jira/browse/YARN-10838
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10838.001.patch, YARN-10838.002.patch
>
>
> AutoCreatedQueueTemplate also has multiple call to 
> Configuration#getPropsWithPrefix. It must be eliminated in order to improve 
> the performance on reinitialisation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10872) Replace getPropsWithPrefix calls in AutoCreatedQueueTemplate

2021-07-23 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10872:

Parent: YARN-10795
Issue Type: Sub-task  (was: Improvement)

> Replace getPropsWithPrefix calls in AutoCreatedQueueTemplate
> 
>
> Key: YARN-10872
> URL: https://issues.apache.org/jira/browse/YARN-10872
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> With the introduction of YARN-10838, it is now possible to optimise 
> AutoCreatedQueueTemplate and replace calls of getPropsWithPrefix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10872) Replace getPropsWithPrefix calls in AutoCreatedQueueTemplate

2021-07-23 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10872:
---

 Summary: Replace getPropsWithPrefix calls in 
AutoCreatedQueueTemplate
 Key: YARN-10872
 URL: https://issues.apache.org/jira/browse/YARN-10872
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori


With the introduction of YARN-10838, it is now possible to optimise 
AutoCreatedQueueTemplate and replace calls of getPropsWithPrefix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10657) We should make max application per queue to support node label.

2021-07-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10657:

Attachment: (was: YARN-10657.005.patch)

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch, 
> YARN-10657.003.patch, YARN-10657.004.patch, YARN-10657.005.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10657) We should make max application per queue to support node label.

2021-07-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10657:

Attachment: YARN-10657.005.patch

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch, 
> YARN-10657.003.patch, YARN-10657.004.patch, YARN-10657.005.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9551) TestTimelineClientV2Impl.testSyncCall fails intermittently

2021-07-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-9551:
--

Assignee: Andras Gyori  (was: Prabhu Joseph)

> TestTimelineClientV2Impl.testSyncCall fails intermittently
> --
>
> Key: YARN-9551
> URL: https://issues.apache.org/jira/browse/YARN-9551
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, test
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Andras Gyori
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> TestTimelineClientV2Impl.testSyncCall fails intermittent
> {code:java}
> Failed
> org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall
> Failing for the past 1 build (Since #24083 )
> Took 1.5 sec.
> Error Message
> TimelineEntities not published as desired expected:<3> but was:<4>
> Stacktrace
> java.lang.AssertionError: TimelineEntities not published as desired 
> expected:<3> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall(TestTimelineClientV2Impl.java:251)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Standard Output
> 2019-05-13 15:33:46,596 WARN  [main] util.NativeCodeLoader 
> (NativeCodeLoader.java:(60)) - Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 2019-05-13 15:33:47,763 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 0 : 1,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 1 : 2,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 2 : 3,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 3 : 4,
> 2019-05-13 15:33:47,765 INFO  [main] 

[jira] [Assigned] (YARN-6272) TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently

2021-07-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-6272:
--

Assignee: Andras Gyori  (was: Prabhu Joseph)

> TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently
> -
>
> Key: YARN-6272
> URL: https://issues.apache.org/jira/browse/YARN-6272
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.0.0-alpha4
>Reporter: Ray Chiang
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-6272-001.patch, YARN-6272.001.patch
>
>
> I'm seeing this unit test fail fairly often in trunk:
> testAMRMClientWithContainerResourceChange(org.apache.hadoop.yarn.client.api.impl.TestAMRMClient)
>   Time elapsed: 5.113 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<1> but was:<0>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at org.junit.Assert.assertEquals(Assert.java:542)
> at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.doContainerResourceChange(TestAMRMClient.java:1087)
> at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.testAMRMClientWithContainerResourceChange(TestAMRMClient.java:963)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10657) We should make max application per queue to support node label.

2021-07-21 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10657:

Attachment: YARN-10657.005.patch

> We should make max application per queue to support node label.
> ---
>
> Key: YARN-10657
> URL: https://issues.apache.org/jira/browse/YARN-10657
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10657.001.patch, YARN-10657.002.patch, 
> YARN-10657.003.patch, YARN-10657.004.patch, YARN-10657.005.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10641?focusedCommentId=17291708=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17291708
> As we discussed in above comment:
> We should deep into the label related max applications per queue.
> I think when node label enabled in queue, max applications should consider 
> the max capacity of all labels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   >