[jira] [Updated] (YARN-11103) SLS cleanup after previously merged SLS refactor jiras

2022-03-28 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11103:
--
Description: 
There have been some jiras that moved around SLS code in order to have a more 
readable SLSRunner.
Mostly, the code fragments were just moved to separate classes.
Most of the issues came up were just because our build system detected them as 
failures but they were part of the original code so they were not newly 
introduced issues.
There were some comments about fixing these, here are all of them I found: 
* 
https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512336
https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17513012&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17513012
https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17511762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17511762
https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17390981&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17390981
https://issues.apache.org/jira/browse/YARN-10547?focusedCommentId=17510839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17510839
https://issues.apache.org/jira/browse/YARN-11094?focusedCommentId=17512324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512324


> SLS cleanup after previously merged SLS refactor jiras
> --
>
> Key: YARN-11103
> URL: https://issues.apache.org/jira/browse/YARN-11103
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler-load-simulator
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>
> There have been some jiras that moved around SLS code in order to have a more 
> readable SLSRunner.
> Mostly, the code fragments were just moved to separate classes.
> Most of the issues came up were just because our build system detected them 
> as failures but they were part of the original code so they were not newly 
> introduced issues.
> There were some comments about fixing these, here are all of them I found: 
> * 
> https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512336
> https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17513012&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17513012
> https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17511762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17511762
> https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17390981&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17390981
> https://issues.apache.org/jira/browse/YARN-10547?focusedCommentId=17510839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17510839
> https://issues.apache.org/jira/browse/YARN-11094?focusedCommentId=17512324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512324



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11103) SLS cleanup after previously merged SLS refactor jiras

2022-03-28 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11103:
--
Description: 
There have been some jiras that moved around SLS code in order to have a more 
readable SLSRunner.
Mostly, the code fragments were just moved to separate classes.
Most of the issues came up were just because our build system detected them as 
failures but they were part of the original code so they were not newly 
introduced issues.
There were some comments about fixing these, here are all of them I found, so 
we need to fix these (if they are not yet fixed):
* 
https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512336
https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17513012&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17513012
https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17511762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17511762
https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17390981&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17390981
https://issues.apache.org/jira/browse/YARN-10547?focusedCommentId=17510839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17510839
https://issues.apache.org/jira/browse/YARN-11094?focusedCommentId=17512324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512324


  was:
There have been some jiras that moved around SLS code in order to have a more 
readable SLSRunner.
Mostly, the code fragments were just moved to separate classes.
Most of the issues came up were just because our build system detected them as 
failures but they were part of the original code so they were not newly 
introduced issues.
There were some comments about fixing these, here are all of them I found: 
* 
https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512336
https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17513012&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17513012
https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17511762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17511762
https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17390981&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17390981
https://issues.apache.org/jira/browse/YARN-10547?focusedCommentId=17510839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17510839
https://issues.apache.org/jira/browse/YARN-11094?focusedCommentId=17512324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512324



> SLS cleanup after previously merged SLS refactor jiras
> --
>
> Key: YARN-11103
> URL: https://issues.apache.org/jira/browse/YARN-11103
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler-load-simulator
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>
> There have been some jiras that moved around SLS code in order to have a more 
> readable SLSRunner.
> Mostly, the code fragments were just moved to separate classes.
> Most of the issues came up were just because our build system detected them 
> as failures but they were part of the original code so they were not newly 
> introduced issues.
> There were some comments about fixing these, here are all of them I found, so 
> we need to fix these (if they are not yet fixed):
> * 
> https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512336
> https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17513012&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17513012
> https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17511762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17511762
> https://issues.apache.org/jira/browse/YARN-10552?focusedCommentId=17390981&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17390981
> https://issues.apache.org/jira/browse/YARN-10547?focusedCommentId=17510839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17510839
> https://issues.apache.org/jira/browse/YARN-11094?focusedCommentId=17512324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512324



--
This message was sent b

[jira] [Created] (YARN-11103) SLS cleanup after previously merged SLS refactor jiras

2022-03-28 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-11103:
-

 Summary: SLS cleanup after previously merged SLS refactor jiras
 Key: YARN-11103
 URL: https://issues.apache.org/jira/browse/YARN-11103
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler-load-simulator
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11102) Fix spotbugs error in hadoop-sls module

2022-03-28 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-11102:
-

Assignee: Szilard Nemeth

> Fix spotbugs error in hadoop-sls module
> ---
>
> Key: YARN-11102
> URL: https://issues.apache.org/jira/browse/YARN-11102
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akira Ajisaka
>Assignee: Szilard Nemeth
>Priority: Major
>
> Fix the following Spotbugs error:
> - org.apache.hadoop.yarn.sls.AMRunner.setInputTraces(String[]) may expose 
> internal representation by storing an externally mutable object into 
> AMRunner.inputTraces At AMRunner.java:by storing an externally mutable object 
> into AMRunner.inputTraces At AMRunner.java:[line 267]
> - Write to static field org.apache.hadoop.yarn.sls.AMRunner.REMAINING_APPS 
> from instance method org.apache.hadoop.yarn.sls.AMRunner.startAM() At 
> AMRunner.java:from instance method 
> org.apache.hadoop.yarn.sls.AMRunner.startAM() At AMRunner.java:[line 116]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2022-03-28 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513547#comment-17513547
 ] 

Benjamin Teke commented on YARN-10559:
--

[~ananyo_rao] since this is pending for a while now and the patch doesn't apply 
to trunk anymore, do you plan to update it? If not, would you mind if I take 
over managing/updating it?

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch, 
> YARN-10559.0005.patch, YARN-10559.0006.patch, YARN-10559.0007.patch, 
> YARN-10559.0008.patch, YARN-10559.0009.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2022-03-28 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513547#comment-17513547
 ] 

Benjamin Teke edited comment on YARN-10559 at 3/28/22, 6:10 PM:


[~ananyo_rao] since this is pending for a while now and the patch doesn't apply 
to trunk anymore, do you plan to update it? If not, would you mind if I take 
over the task of managing/updating it?


was (Author: bteke):
[~ananyo_rao] since this is pending for a while now and the patch doesn't apply 
to trunk anymore, do you plan to update it? If not, would you mind if I take 
over managing/updating it?

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch, 
> YARN-10559.0005.patch, YARN-10559.0006.patch, YARN-10559.0007.patch, 
> YARN-10559.0008.patch, YARN-10559.0009.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10548) Decouple AM runner logic from SLSRunner

2022-03-28 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513456#comment-17513456
 ] 

Szilard Nemeth commented on YARN-10548:
---

Hi [~aajisaka],

Thanks for raising YARN-11102.
We agreed upon only moving the code in this jira and fix the issue later that 
are triggered based on the original code.
Since the code quality of SLS is in a pretty bad condition, there's simply no 
way to address all issues in one patch.
With YARN-10548, I only moved code around.
The old code was modifying the REMAINING_APPS static field from SLSRunner, too.
None of these warnings (spotbugs, javac) are triggered because of our code 
changes, but they are coming from the original code.

Also, [~quapaw] had a comment related to that 
[here|https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512336]
And 
[here|https://issues.apache.org/jira/browse/YARN-10548?focusedCommentId=17512463&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17512463]
 we agreed on a follow-up jira as well.

I will fix the Spotbugs and javac issues on YARN-11102.
I hope the picture is more clear based on this explanation.

Thanks.

> Decouple AM runner logic from SLSRunner
> ---
>
> Key: YARN-10548
> URL: https://issues.apache.org/jira/browse/YARN-10548
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10548.001.patch, YARN-10548.002.patch, 
> YARN-10548.003.patch
>
>
> SLSRunner has too many responsibilities.
>  One of them is to parse the job details from the SLS input formats and 
> launch the AMs and task containers.
>  The AM runner logic could be decoupled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11052) Improve code quality in TestRMWebServicesNodeLabels

2022-03-28 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11052:
--
Fix Version/s: 3.4.0

> Improve code quality in TestRMWebServicesNodeLabels
> ---
>
> Key: YARN-11052
> URL: https://issues.apache.org/jira/browse/YARN-11052
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Possible code improvements I've identified in this class:
> 1. In TestRMWebServicesNodeLabels#testNodeLabels: Missing HTTP response 
> status code checks for successful operations
> 2. Some methods are throwing too many types of Exceptions, e.g. "throws 
> JSONException, Exception" can be simplified to "throws Exception"
> 3. Repeated code fragments can be replaced. E.g.: 
> {code}
> assertEquals(MediaType.APPLICATION_JSON_TYPE + "; " + JettyUtils.UTF_8,
> response.getType().toString());
> {code}
> 4. Some variable names are confusing, e.g. "NodeLabelsInfo nlsifo"
> 5. There are many node label related endpoint calls, all copy-pasted, for 
> example: 
> {code}
> response =
> r.path("ws").path("v1").path("cluster")
> .path("add-node-labels").queryParam("user.name", userName)
> .accept(MediaType.APPLICATION_JSON)
> .entity(toJson(nodeLabelsInfo, NodeLabelsInfo.class),
> MediaType.APPLICATION_JSON)
> .post(ClientResponse.class);
> {code}
> This is just an example, there are many many endpoint calls duplicated like 
> this, they could be extracted to methods.
> 6. Duplicated validation code of all REST endpoints can be simplified
> 7. There are weird, repeated log strings that could be removed, like: 
> {code}
> LOG.info("posted node nodelabel")
> {code}
> 8. Constants could be added for labels, node ids, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10549) Decouple RM runner logic from SLSRunner

2022-03-28 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513269#comment-17513269
 ] 

Szilard Nemeth commented on YARN-10549:
---

Created a PR

> Decouple RM runner logic from SLSRunner
> ---
>
> Key: YARN-10549
> URL: https://issues.apache.org/jira/browse/YARN-10549
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
> Attachments: YARN-10549.001.patch, YARN-10549.002.patch, 
> YARN-10549.003.patch, YARN-10549.004.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SLSRunner has too many responsibilities.
>  One of them is to parse the job details from the SLS input formats and 
> launch the AMs and task containers.
>  The RM runner logic could be decoupled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10549) Decouple RM runner logic from SLSRunner

2022-03-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10549:
--
Labels: pull-request-available  (was: )

> Decouple RM runner logic from SLSRunner
> ---
>
> Key: YARN-10549
> URL: https://issues.apache.org/jira/browse/YARN-10549
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
> Attachments: YARN-10549.001.patch, YARN-10549.002.patch, 
> YARN-10549.003.patch, YARN-10549.004.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SLSRunner has too many responsibilities.
>  One of them is to parse the job details from the SLS input formats and 
> launch the AMs and task containers.
>  The RM runner logic could be decoupled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-28 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11073:


Assignee: Jian Chen

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Assignee: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-28 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513261#comment-17513261
 ] 

Akira Ajisaka commented on YARN-11073:
--

Thank you [~jchenjc22]. Yes, I agreed that the testing the preemption behavior 
is difficult. For unit testing, I think it's sufficient to verify the behavior 
of the resetCapacity method that you have changed.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org