Adam Novak created YUNIKORN-1185: ------------------------------------ Summary: Small applications starve large ones in the same FIFO queue Key: YUNIKORN-1185 URL: https://issues.apache.org/jira/browse/YUNIKORN-1185 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Adam Novak
Even when I set my queue to use a {{fifo}} application sort policy, applications that enter the queue later are able to run before applications that are submitted earlier; the queue does not behave like a first-in, first-out queue. Specifically, this happens when the later applications are smaller than the earlier ones. If enough small jobs applications are available in the queue to immediately fill any space that opens up, they will schedule as soon as space is available. YuniKorn doesn't wait for enough space to become free to schedule waiting large applications, no matter how much older they are than the things that are passing them in the queue. The result of this is that a steady supply of small applications can keep a larger application waiting indefinitely, causing starvation. The relevant code seems to be [in Queue's tryAllocate method|[https://github.com/apache/yunikorn-core/blob/73d55282f052f53852cc156d626c155ca5dddca2/pkg/scheduler/objects/queue.go#L1069-L1070]|https://github.com/apache/yunikorn-core/blob/73d55282f052f53852cc156d626c155ca5dddca2/pkg/scheduler/objects/queue.go#L1069-L1070].]. YuniKorn goes through all the applications in the queue in order, and greedily schedules work items until no more fit. If no space large enough to fit any work form the first application currently exists, it will always fill what space there is with work from applications later in the queue. It will never wait to drain out space on a node to fit work from that first application. How can I configure or modify YuniKorn to prevent starvation, and make the applications in a queue execute in order, or at least not arbitrarily far out of order? (I already tried the {{stateaware}} queue sort, but it doesn't seem to work well with applications as small as mine. It appeared to run only one application at a time, because my applications finish so fast.) h4. Replication First, have a Kubernetes cluster with a node {{k1.kube}} with 96 cores. Next, set up YuniKorn 0.12.2 with this {{values.yml}} for the Helm chart: {code:java} embedAdmissionController: false configuration: | partitions: - name: default placementrules: - name: tag value: namespace create: true queues: - name: root submitacl: '*' childtemplate: properties: application.sort.policy: fifo {code} Then, run this script: {code:java} #!/usr/bin/env bash # test-yunikorn.sh: Make sure YuniKorn prevents starvation set -e# Set this to annotate jobs other than the middle job OTHER_JOB_ANNOTATIONS='' # And similarly for the middle job MIDDLE_JOB_ANNOTATIONS=''# Where should we run? #NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}' NODE_SELECTOR='affinity: {"nodeAffinity": {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": [{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", "values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'# How many 10-core jobs do we need to fill everywhere we will run? SCALE="30"# Clean up kubectl delete job -l app=yunikorntest || true# Make 10 core jobs that will block out our test job for at least 2 minutes # Make sure they don't all finish at once. rm -f jobs_before.yml for NUM in $(seq 1 ${SCALE}) ; do cat >>jobs_before.yml <<EOF apiVersion: batch/v1 kind: Job metadata: name: presleep${NUM} labels: app: yunikorntest ${OTHER_JOB_ANNOTATIONS} spec: template: metadata: labels: app: yunikorntest applicationId: before-${NUM} spec: schedulerName: yunikorn ${NODE_SELECTOR} containers: - name: main image: ubuntu:20.04 command: ["sleep", "$(( $RANDOM % 20 + 120 ))"] resources: limits: memory: 300M cpu: 10000m ephemeral-storage: 1G requests: memory: 300M cpu: 10000m ephemeral-storage: 1G restartPolicy: Never ttlSecondsAfterFinished: 1000 --- EOF done# How many jobs do we need to fill the cluster to compete against? COMPETING_JOBS="$((SCALE*20))"# And 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes # We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working. rm -f jobs_after.yml for NUM in $(seq 1 ${COMPETING_JOBS}) ; do cat >>jobs_after.yml <<EOF apiVersion: batch/v1 kind: Job metadata: name: postsleep${NUM} labels: app: yunikorntest ${OTHER_JOB_ANNOTATIONS} spec: template: metadata: labels: app: yunikorntest applicationId: after-${NUM} spec: schedulerName: yunikorn ${NODE_SELECTOR} containers: - name: main image: ubuntu:20.04 command: ["sleep", "$(( $RANDOM % 20 + 60 ))"] resources: limits: memory: 300M cpu: 10000m ephemeral-storage: 1G requests: memory: 300M cpu: 10000m ephemeral-storage: 1G restartPolicy: Never ttlSecondsAfterFinished: 1000 --- EOF done# And the test job itself between them. rm -f job_middle.yml cat >job_middle.yml <<EOF apiVersion: batch/v1 kind: Job metadata: name: middle labels: app: yunikorntest ${MIDDLE_JOB_ANNOTATIONS} spec: template: metadata: labels: app: yunikorntest applicationId: middle spec: schedulerName: yunikorn ${NODE_SELECTOR} containers: - name: main image: ubuntu:20.04 command: ["sleep", "1"] resources: limits: memory: 300M cpu: 50000m ephemeral-storage: 1G requests: memory: 300M cpu: 50000m ephemeral-storage: 1G restartPolicy: Never ttlSecondsAfterFinished: 1000 EOFkubectl apply -f jobs_before.yml sleep 10 kubectl apply -f job_middle.yml sleep 10 CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')" kubectl apply -f jobs_after.yml # Wait for it to finish echo "Waiting for middle job to finish..." COMPLETION_TIME="" while [[ -z "${COMPLETION_TIME}" ]] ; do sleep 10 JOB_STATE="$(kubectl get job middle -o jsonpath='{.status.succeeded}' || true)" if [[ "${JOB_STATE}" == "1" ]] ; then COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}' || true)" fi done echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}" {code} You will see that YuniKorn will run the vast majority of the "postsleep" jobs before allowing the "middle" job to schedule and run, even though the "middle" job was submitted to the queue first. By increasing the number of "postsleep" jobs submitted, you can starve the "middle" job for an arbitrarily long amount of time. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org