[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Chen updated YARN-11073:
-----------------------------
    Description: 
When running a Hive job in a low-capacity queue on an idle cluster, preemption 
kicked in to preempt job containers even though there's no other job running 
and competing for resources. 

Let's take this scenario as an example:
 * cluster resource : <Memory:168GB, VCores:48>
 ** {_}*queue_low*{_}: min_capacity 1%
 ** queue_mid: min_capacity 19%
 ** queue_high: min_capacity 80%
 * CapacityScheduler with DRF

During the fifo preemption candidates selection process, the 
_preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
capacity is currently calculated as "Resources.multiply(totalPartitionResource, 
absCapacity)", so the guaranteed capacity of queue_low is:
 * {_}*queue_low*{_}: <Memory: (168*0.01)GB, VCores:(48*0.01)> = 
<Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long 
values, these Doubles values get casted into Long, and then the final result 
becomes *<Memory:1GB, VCores:0>*

Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
capacity based on active queues is also 0 based on the current algorithm in 
"{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
job containers running in {_}*queue_low*{_}. 

In order to work around this corner case, I made a small patch (for my own use 
case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
 # if the sum of absoluteCapacity/minCapacity of all active queues is zero, we 
should normalize their guaranteed capacity evenly
{code:java}
1.0f / num_of_queues{code}

 # if the sum of pre-normalized guaranteed capacity values ({_}MB or VCores{_}) 
of all active queues is zero, meaning we might have several queues like 
queue_low whose capacity value got casted into 0, we should normalize evenly as 
well like the first scenario (if they are all tiny, it really makes no big 
difference, for example, 1% vs 1.2%).
 # if one of the active queues has a zero pre-normalized guaranteed capacity 
value but its absoluteCapacity/minCapacity is *not* zero, then we should 
normalize based on the weight of their configured queue 
absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
but fair normalized value when _*queue_mid*_ is also active. 
{code:java}
minCapacity / (sum_of_min_capacity_of_active_queues)
{code}
 

This is how I currently work around this issue, it might need someone who's 
more familiar in this component to do a systematic review of the entire 
preemption process to fix it properly. Maybe we can always apply the 
weight-based approach using absoluteCapacity, or maybe we can rewrite the code 
of Resource to remove the casting, and so on.

  was:
When running a Hive job in a low-capacity queue on an idle cluster, preemption 
kicked in to preempt job containers even though there's no other job running 
and competing for resources. 

Let's take this scenario as an example:
 * cluster resource : <Memory:168GB, VCores:48>
 ** {_}*queue_low*{_}: min_capacity 1%
 ** queue_mid: min_capacity 19%
 ** queue_high: min_capacity 80%
 * CapacityScheduler with DRF

During the fifo preemption candidates selection process, the 
_preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
capacity is currently calculated as "Resources.multiply(totalPartitionResource, 
absCapacity)", so the guaranteed capacity of queue_low is:
 * {_}*queue_low*{_}: <Memory: (168*0.01)GB, VCores:(48*0.01)> = 
<Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long 
values, these Doubles values get casted into Long, and then the final result 
becomes *<Memory:1GB, VCores:0>*

Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
capacity based on active queues is also 0 based on the current algorithm in 
"{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
job containers running in {_}*queue_low*{_}. 

In order to work around this corner case, I made a small patch (for my own use 
case) around "{_}resetCapacity{_}" to consider a couple new scenarios. 
 # if the sum of absoluteCapacity/minCapacity of all active queues is zero, we 
should normalize their guaranteed capacity evenly
{code:java}
1.0f / num_of_queues{code}

 # if the sum of pre-normalized guaranteed capacity values ({_}MB or VCores{_}) 
of all active queues is zero, meaning we might have several queues like 
queue_low whose capacity value got casted into 0, we should normalize evenly as 
well like the first scenario (if they are all tiny, it really makes no big 
difference, for example, 1% vs 1.2%).
 # if one of the active queues has a zero pre-normalized guaranteed capacity 
value but its absoluteCapacity/minCapacity is *not* zero, then we should 
normalize based on the weight of their configured queue 
absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
but fair normalized value when _*queue_mid*_ is also active. 
{code:java}
minCapacity / (sum_of_min_capacity_of_active_queues)
{code}
This is how I currently work around this issue, it might need someone who's 
more familiar in this component to do a systematic review of the entire 
preemption process to fix it properly. Maybe we can always apply the 
weight-based approach using absoluteCapacity, or maybe we can rewrite the code 
of Resource to remove the casting...


> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> ------------------------------------------------------------------------------
>
>                 Key: YARN-11073
>                 URL: https://issues.apache.org/jira/browse/YARN-11073
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, scheduler preemption
>    Affects Versions: 2.10.1
>            Reporter: Jian Chen
>            Priority: Major
>         Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : <Memory:168GB, VCores:48>
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}: <Memory: (168*0.01)GB, VCores:(48*0.01)> = 
> <Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes *<Memory:1GB, VCores:0>*
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  # if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  # if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  # if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or maybe we can rewrite the 
> code of Resource to remove the casting, and so on.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to