Paul Santa Clara created YUNIKORN-2678: ------------------------------------------
Summary: Yunikorn does not appear to be considering Guaranteed resources when allocating Pending Pods. Key: YUNIKORN-2678 URL: https://issues.apache.org/jira/browse/YUNIKORN-2678 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Affects Versions: 1.5.1 Environment: EKS 1.29 Reporter: Paul Santa Clara Attachments: jira-queues.yaml, jira-tier0-screenshot.png, jira-tier1-screenshot.png, jira-tier2-screenshot.png, jira-tier3-screenshot.png Please see the attached queue configuration(jira-queues.yaml). I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100 pods in Tier3. Each Pod will require 1 VCore. Initially, there will be 0 suitable nodes to run the Pods and all will be Pending. Karpenter will soon provision Nodes and Yunikorn will react by binding the Pods. Given this [code|https://github.com/apache/yunikorn-core/blob/a786feb5761be28e802d08976d224c40639cd86b/pkg/scheduler/objects/sorters.go#L81C74-L81C95], I would expect Yunikorn to distribute the allocations such that each of the Tier’ed queues reaches its Guarantees. Instead, I observed a roughly even distribution of allocation across all of the queues. Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically overshoots them. {code:java} > kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l 86 > kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l 83 > kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l 78 > kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l 77 {code} Please see attached screen shots for queue usage. Note, this situation can also be reproduced without the use of Karpenter by simply setting Yunikorn's `service.schedulingInterval` to a high duration, say 1m. Doing so will force Yunikorn to react to 400 Pods -across 4 queues- at roughly the same time forcing prioritization of queue allocations. Test code to generate Pods: {code:java} from kubernetes import client, config config.load_kube_config() v1 = client.CoreV1Api() def create_pod_manifest(tier, exec,): pod_manifest = { 'apiVersion': 'v1', 'kind': 'Pod', 'metadata': { 'name': f"rolling-test-tier-{tier}-exec-{exec}", 'namespace': 'finance', 'labels': { 'applicationId': f"MyOwnApplicationId-tier-{tier}", 'queue': f"root.tiers.{tier}" }, "yunikorn.apache.org/user.info": '{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}' }, 'spec': { "affinity": { "nodeAffinity" : { "requiredDuringSchedulingIgnoredDuringExecution" : { "nodeSelectorTerms" : [ { "matchExpressions" : [ { "key" : "di.rbx.com/dedicated", "operator" : "In", "values" : ["spark"] } ] } ] } }, }, "tolerations" : [ { "effect" : "NoSchedule", "key": "dedicated", "operator" : "Equal", "value" : "spark" }, ], "schedulerName": "yunikorn", 'restartPolicy': 'Always', 'containers': [{ "name": "ubuntu", 'image': 'ubuntu', "command": ["sleep", "604800"], "imagePullPolicy": "IfNotPresent", "resources" : { "limits" : { 'cpu' : "1" }, "requests" : { 'cpu' : "1" } } }] } } return pod_manifest for i in range(0,4): tier = str(i) for j in range(0,100): exec = str(j) pod_manifest = create_pod_manifest(tier, exec) print(pod_manifest) api_response = v1.create_namespaced_pod(body=pod_manifest, namespace="finance") print(f"creating tier( {tier} ) exec( {exec} )") {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org