[jira] [Updated] (YUNIKORN-3137) Fails to preempt more than 2 victims for a larger ask.

Sudipto Batal (Jira) Fri, 17 Oct 2025 08:52:45 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sudipto Batal updated YUNIKORN-3137:
------------------------------------
    Description: 
h3. Problem Description

If a large pod ({{{}ask{}}}) requires evicting multiple smaller pods to fit, 
the scheduler can only preempt up to two pods, preventing the {{ask}} from 
being scheduled even when the total ask is under the guaranteed limit.

Reference code: 
[preemption.go#L629-L642|https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L629-L642]

For example, if the ask is {*}{vcore: 300, memory: 300, pod: 1}{*}, and each 
victim of size {*}{vcore: 100, memory: 100, pod: 1}{*}, after two iterations, 
victimsTotalResource becomes {*}{vcore: 200, memory: 200, pod: 2}{*}. At this 
point, no additional victims are added to the finalVictims list due to the 
following condition:
{code:java}
if 
p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting(victimsTotalResource){code}
As a result, only two pods are evicted (for no reason), but the freed resources 
are still insufficient for the ask, leaving the large pod unscheduled.
h3. Reproduce

Please take a look at the attachments for the job and queue configurations
h4. Phase 1: Initial Allocation
 # {*}job-child1 → child1{*}: Request 10 pods × 100m CPU, 100Mi Memory each

 * 
 ** {*}Gets{*}: 6 pods × 100m CPU, 100Mi Memory = 600m CPU, 600Mi Memory 
(cluster max)
 ** {*}Remaining{*}: 4 pods pending (400m CPU, 400Mi Memory needed)

 # {*}job-child2 → child2{*}: Request 10 pods × 300m CPU, 300Mi Memory each

 * 
 ** {*}Gets{*}: 0 pods initially (no resources available)
 ** {*}Needs{*}: 300m CPU, 300Mi Memory to meet guarantee

h4. Phase 2: Preemption Attempt for Guarantee
 # {*}Preemption for child2 guarantee{*}: Try to free 300m CPU, 300Mi Memory
 ** {*}Victims{*}: should preempt 3 pods from child1 (3 × 100m CPU, 100Mi 
Memory = 300m CPU, 300Mi Memory)
 ** {color:#de350b}Only 2 pods are actually preempted due to the condition in 
preemption.go{color}
 ** {*}Freed resources{*}: 200m CPU, 200Mi Memory (insufficient for child2 
guarantee)
 ** {*}Result{*}: child2 gets 0 pods, guarantee not met

  was:
h3. Problem Description

If a large pod ({{{}ask{}}}) requires evicting multiple smaller pods to fit, 
the scheduler can only preempt up to two pods, preventing the {{ask}} from 
being scheduled even when the total ask is under the guaranteed limit.

Reference code: 
[preemption.go#L629-L642|https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L629-L642]

For example, if the ask is \{vcore: 300, memory: 300, pod: 1}, and each victim 
of size \{vcore: 100, memory: 100, pod: 1}, after two iterations, 
victimsTotalResource becomes \{vcore: 200, memory: 200, pod: 2}. At this point, 
no additional victims are added to the finalVictims list due to the following 
condition:


{code:java}
if 
p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting(victimsTotalResource){code}
As a result, only two pods are evicted (for no reason), but the freed resources 
are still insufficient for the ask, leaving the large pod unscheduled.
h3. Reproduce

Please take a look at the attachments for the job and queue configurations
h4. Phase 1: Initial Allocation
 # {*}job-child1 → child1{*}: Request 10 pods × 100m CPU, 100Mi Memory each

 * 
 ** {*}Gets{*}: 6 pods × 100m CPU, 100Mi Memory = 600m CPU, 600Mi Memory 
(cluster max)
 ** {*}Remaining{*}: 4 pods pending (400m CPU, 400Mi Memory needed)

 # {*}job-child2 → child2{*}: Request 10 pods × 300m CPU, 300Mi Memory each

 * 
 ** {*}Gets{*}: 0 pods initially (no resources available)
 ** {*}Needs{*}: 300m CPU, 300Mi Memory to meet guarantee

h4. Phase 2: Preemption Attempt for Guarantee
 # {*}Preemption for child2 guarantee{*}: Try to free 300m CPU, 300Mi Memory
 ** {*}Victims{*}: should preempt 3 pods from child1 (3 × 100m CPU, 100Mi 
Memory = 300m CPU, 300Mi Memory)
 ** {color:#de350b}Only 2 pods are actually preempted due to the condition in 
preemption.go{color}
 ** {*}Freed resources{*}: 200m CPU, 200Mi Memory (insufficient for child2 
guarantee)
 ** {*}Result{*}: child2 gets 0 pods, guarantee not met


> Fails to preempt more than 2 victims for a larger ask.
> ------------------------------------------------------
>
>                 Key: YUNIKORN-3137
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3137
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>         Environment: Kind
>            Reporter: Sudipto Batal
>            Priority: Major
>             Fix For: 1.7.0
>
>         Attachments: job-child1.yaml, job-child2.yaml, queues.yaml
>
>
> h3. Problem Description
> If a large pod ({{{}ask{}}}) requires evicting multiple smaller pods to fit, 
> the scheduler can only preempt up to two pods, preventing the {{ask}} from 
> being scheduled even when the total ask is under the guaranteed limit.
> Reference code: 
> [preemption.go#L629-L642|https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L629-L642]
> For example, if the ask is {*}{vcore: 300, memory: 300, pod: 1}{*}, and each 
> victim of size {*}{vcore: 100, memory: 100, pod: 1}{*}, after two iterations, 
> victimsTotalResource becomes {*}{vcore: 200, memory: 200, pod: 2}{*}. At this 
> point, no additional victims are added to the finalVictims list due to the 
> following condition:
> {code:java}
> if 
> p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting(victimsTotalResource){code}
> As a result, only two pods are evicted (for no reason), but the freed 
> resources are still insufficient for the ask, leaving the large pod 
> unscheduled.
> h3. Reproduce
> Please take a look at the attachments for the job and queue configurations
> h4. Phase 1: Initial Allocation
>  # {*}job-child1 → child1{*}: Request 10 pods × 100m CPU, 100Mi Memory each
>  * 
>  ** {*}Gets{*}: 6 pods × 100m CPU, 100Mi Memory = 600m CPU, 600Mi Memory 
> (cluster max)
>  ** {*}Remaining{*}: 4 pods pending (400m CPU, 400Mi Memory needed)
>  # {*}job-child2 → child2{*}: Request 10 pods × 300m CPU, 300Mi Memory each
>  * 
>  ** {*}Gets{*}: 0 pods initially (no resources available)
>  ** {*}Needs{*}: 300m CPU, 300Mi Memory to meet guarantee
> h4. Phase 2: Preemption Attempt for Guarantee
>  # {*}Preemption for child2 guarantee{*}: Try to free 300m CPU, 300Mi Memory
>  ** {*}Victims{*}: should preempt 3 pods from child1 (3 × 100m CPU, 100Mi 
> Memory = 300m CPU, 300Mi Memory)
>  ** {color:#de350b}Only 2 pods are actually preempted due to the condition in 
> preemption.go{color}
>  ** {*}Freed resources{*}: 200m CPU, 200Mi Memory (insufficient for child2 
> guarantee)
>  ** {*}Result{*}: child2 gets 0 pods, guarantee not met



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-3137) Fails to preempt more than 2 victims for a larger ask.

Reply via email to