Sudipto Batal created YUNIKORN-3137:
---------------------------------------
Summary: Fails to preempt more than 2 victims for a larger ask.
Key: YUNIKORN-3137
URL: https://issues.apache.org/jira/browse/YUNIKORN-3137
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Environment: Kind
Reporter: Sudipto Batal
Fix For: 1.7.0
Attachments: job-child1.yaml, job-child2.yaml, queues.yaml
h3. Problem Description
If a large pod ({{{}ask{}}}) requires evicting multiple smaller pods to fit,
the scheduler can only preempt up to two pods, preventing the {{ask}} from
being scheduled even when the total ask is under the guaranteed limit.
Reference code:
[preemption.go#L629-L642|https://github.com/apache/yunikorn-core/blob/7511f30539c781b30568047df20a8127b0278260/pkg/scheduler/objects/preemption.go#L629-L642]
For example, if the ask is {{{}{vcore: 300, memory: 300, pod: 1}{}}}, and each
victim of size {{{}{vcore: 100, memory: 100, pod: 1}{}}}, after two iterations,
{{victimsTotalResource}} becomes {{{}{vcore: 200, memory: 200, pod: 2}{}}}. At
this point, no additional victims are added to the {{finalVictims}} list due to
the following condition:
{code:java}
if
p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting(victimsTotalResource){code}
{{{}{}}}As a result, only two pods are evicted (for no reason), but the freed
resources are still insufficient for the ask, leaving the large pod unscheduled.
h3. Reproduce
Please take a look at the attachments for the job and queue configurations
h4. Phase 1: Initial Allocation
# {*}job-child1 → child1{*}: Request 10 pods × 100m CPU, 100Mi Memory each
** {*}Gets{*}: 6 pods × 100m CPU, 100Mi Memory = 600m CPU, 600Mi Memory
(cluster max)
** {*}Remaining{*}: 4 pods pending (400m CPU, 400Mi Memory needed)
# {*}job-child2 → child2{*}: Request 10 pods × 300m CPU, 300Mi Memory each
** {*}Gets{*}: 0 pods initially (no resources available)
** {*}Needs{*}: 300m CPU, 300Mi Memory to meet guarantee
h4. Phase 2: Preemption Attempt for Guarantee
# {*}Preemption for child2 guarantee{*}: Try to free 300m CPU, 300Mi Memory
** {*}Victims{*}: should preempt 3 pods from child1 (3 × 100m CPU, 100Mi
Memory = 300m CPU, 300Mi Memory)
** {color:#de350b}Only 2 pods are actually preempted due to the condition in
preemption.go{color}
** {*}Freed resources{*}: 200m CPU, 200Mi Memory (insufficient for child2
guarantee)
** {*}Result{*}: child2 gets 0 pods, guarantee not met
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]