[ https://issues.apache.org/jira/browse/YUNIKORN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527813#comment-17527813 ]
Craig Condit commented on YUNIKORN-1185: ---------------------------------------- [~adamnovak] could you recreate this and submit scheduler logs when running in DEBUG mode? That would be helpful to see what's really going on here. > Small applications starve large ones in the same FIFO queue > ----------------------------------------------------------- > > Key: YUNIKORN-1185 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1185 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Adam Novak > Priority: Major > > Even when I set my queue to use a {{fifo}} application sort policy, > applications that enter the queue later are able to run before applications > that are submitted earlier; the queue does not behave like a first-in, > first-out queue. > Specifically, this happens when the later applications are smaller than the > earlier ones. If enough small jobs applications are available in the queue to > immediately fill any space that opens up, they will schedule as soon as space > is available. YuniKorn doesn't wait for enough space to become free to > schedule waiting large applications, no matter how much older they are than > the things that are passing them in the queue. > The result of this is that a steady supply of small applications can keep a > larger application waiting indefinitely, causing starvation. > The relevant code seems to be [in Queue's tryAllocate method|#L1069-L1070]. > YuniKorn goes through all the applications in the queue in order, and > greedily schedules work items until no more fit. If no space large enough to > fit any work form the first application currently exists, it will always fill > what space there is with work from applications later in the queue. It will > never wait to drain out space on a node to fit work from that first > application. > How can I configure or modify YuniKorn to prevent starvation, and make the > applications in a queue execute in order, or at least not arbitrarily far out > of order? > (I already tried the {{stateaware}} queue sort, but it doesn't seem to work > well with applications as small as mine. It appeared to run only one > application at a time, because my applications finish so fast.) > h4. Replication > First, have a Kubernetes cluster with a node {{k1.kube}} with 96 cores. > Next, set up YuniKorn 0.12.2 with this {{values.yml}} for the Helm chart: > > {code:java} > embedAdmissionController: false > configuration: | > partitions: > - > name: default > placementrules: > - name: tag > value: namespace > create: true > queues: > - name: root > submitacl: '*' > childtemplate: > properties: > application.sort.policy: fifo {code} > Then, run this script: > {code:java} > #!/usr/bin/env bash > # test-yunikorn.sh: Make sure YuniKorn prevents starvation > set -e# Set this to annotate jobs other than the middle job > OTHER_JOB_ANNOTATIONS='' > # And similarly for the middle job > MIDDLE_JOB_ANNOTATIONS=''# Where should we run? > #NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}' > NODE_SELECTOR='affinity: {"nodeAffinity": > {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": > [{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", > "values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'# How many 10-core jobs do > we need to fill everywhere we will run? > SCALE="30"# Clean up > kubectl delete job -l app=yunikorntest || true# Make 10 core jobs that will > block out our test job for at least 2 minutes > # Make sure they don't all finish at once. > rm -f jobs_before.yml > for NUM in $(seq 1 ${SCALE}) ; do > cat >>jobs_before.yml <<EOF > apiVersion: batch/v1 > kind: Job > metadata: > name: presleep${NUM} > labels: > app: yunikorntest > ${OTHER_JOB_ANNOTATIONS} > spec: > template: > metadata: > labels: > app: yunikorntest > applicationId: before-${NUM} > spec: > schedulerName: yunikorn > ${NODE_SELECTOR} > containers: > - name: main > image: ubuntu:20.04 > command: ["sleep", "$(( $RANDOM % 20 + 120 ))"] > resources: > limits: > memory: 300M > cpu: 10000m > ephemeral-storage: 1G > requests: > memory: 300M > cpu: 10000m > ephemeral-storage: 1G > restartPolicy: Never > ttlSecondsAfterFinished: 1000 > --- > EOF > done# How many jobs do we need to fill the cluster to compete against? > COMPETING_JOBS="$((SCALE*20))"# And 10 core jobs that, if they all pass it, > will keep it blocked out for 20 minutes > # We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is > working. > rm -f jobs_after.yml > for NUM in $(seq 1 ${COMPETING_JOBS}) ; do > cat >>jobs_after.yml <<EOF > apiVersion: batch/v1 > kind: Job > metadata: > name: postsleep${NUM} > labels: > app: yunikorntest > ${OTHER_JOB_ANNOTATIONS} > spec: > template: > metadata: > labels: > app: yunikorntest > applicationId: after-${NUM} > spec: > schedulerName: yunikorn > ${NODE_SELECTOR} > containers: > - name: main > image: ubuntu:20.04 > command: ["sleep", "$(( $RANDOM % 20 + 60 ))"] > resources: > limits: > memory: 300M > cpu: 10000m > ephemeral-storage: 1G > requests: > memory: 300M > cpu: 10000m > ephemeral-storage: 1G > restartPolicy: Never > ttlSecondsAfterFinished: 1000 > --- > EOF > done# And the test job itself between them. > rm -f job_middle.yml > cat >job_middle.yml <<EOF > apiVersion: batch/v1 > kind: Job > metadata: > name: middle > labels: > app: yunikorntest > ${MIDDLE_JOB_ANNOTATIONS} > spec: > template: > metadata: > labels: > app: yunikorntest > applicationId: middle > spec: > schedulerName: yunikorn > ${NODE_SELECTOR} > containers: > - name: main > image: ubuntu:20.04 > command: ["sleep", "1"] > resources: > limits: > memory: 300M > cpu: 50000m > ephemeral-storage: 1G > requests: > memory: 300M > cpu: 50000m > ephemeral-storage: 1G > restartPolicy: Never > ttlSecondsAfterFinished: 1000 > EOFkubectl apply -f jobs_before.yml > sleep 10 > kubectl apply -f job_middle.yml > sleep 10 > CREATION_TIME="$(kubectl get job middle -o > jsonpath='{.metadata.creationTimestamp}')" > kubectl apply -f jobs_after.yml > # Wait for it to finish > echo "Waiting for middle job to finish..." > COMPLETION_TIME="" > while [[ -z "${COMPLETION_TIME}" ]] ; do > sleep 10 > JOB_STATE="$(kubectl get job middle -o jsonpath='{.status.succeeded}' || > true)" > if [[ "${JOB_STATE}" == "1" ]] ; then > COMPLETION_TIME="$(kubectl get job middle -o > jsonpath='{.status.completionTime}' || true)" > fi > done > echo "Test large job was created at ${CREATION_TIME} and completed at > ${COMPLETION_TIME}" > {code} > You will see that YuniKorn will run the vast majority of the "postsleep" jobs > before allowing the "middle" job to schedule and run, even though the > "middle" job was submitted to the queue first. By increasing the number of > "postsleep" jobs submitted, you can starve the "middle" job for an > arbitrarily long amount of time. > > > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org