[ 
https://issues.apache.org/jira/browse/YUNIKORN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527813#comment-17527813
 ] 

Craig Condit commented on YUNIKORN-1185:
----------------------------------------

[~adamnovak] could you recreate this and submit scheduler logs when running in 
DEBUG mode? That would be helpful to see what's really going on here. 



> Small applications starve large ones in the same FIFO queue
> -----------------------------------------------------------
>
>                 Key: YUNIKORN-1185
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1185
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Adam Novak
>            Priority: Major
>
> Even when I set my queue to use a {{fifo}} application sort policy, 
> applications that enter the queue later are able to run before applications 
> that are submitted earlier; the queue does not behave like a first-in, 
> first-out queue.
> Specifically, this happens when the later applications are smaller than the 
> earlier ones. If enough small jobs applications are available in the queue to 
> immediately fill any space that opens up, they will schedule as soon as space 
> is available. YuniKorn doesn't wait for enough space to become free to 
> schedule waiting large applications, no matter how much older they are than 
> the things that are passing them in the queue.
> The result of this is that a steady supply of small applications can keep a 
> larger application waiting indefinitely, causing starvation.
> The relevant code seems to be [in Queue's tryAllocate method|#L1069-L1070]. 
> YuniKorn goes through all the applications in the queue in order, and 
> greedily schedules work items until no more fit. If no space large enough to 
> fit any work form the first application currently exists, it will always fill 
> what space there is with work from applications later in the queue. It will 
> never wait to drain out space on a node to fit work from that first 
> application.
> How can I configure or modify YuniKorn to prevent starvation, and make the 
> applications in a queue execute in order, or at least not arbitrarily far out 
> of order?
> (I already tried the {{stateaware}} queue sort, but it doesn't seem to work 
> well with applications as small as mine. It appeared to run only one 
> application at a time, because my applications finish so fast.)
> h4. Replication
> First, have a Kubernetes cluster with a node {{k1.kube}} with 96 cores.
> Next, set up YuniKorn 0.12.2 with this {{values.yml}} for the Helm chart:
>  
> {code:java}
> embedAdmissionController: false
> configuration: |
>   partitions:
>     -
>       name: default
>       placementrules:
>         - name: tag
>           value: namespace
>           create: true
>       queues:
>         - name: root
>           submitacl: '*'
>           childtemplate:
>            properties:
>              application.sort.policy: fifo {code}
> Then, run this script:
> {code:java}
> #!/usr/bin/env bash
> # test-yunikorn.sh: Make sure YuniKorn prevents starvation
> set -e# Set this to annotate jobs other than the middle job
> OTHER_JOB_ANNOTATIONS=''
> # And similarly for the middle job
> MIDDLE_JOB_ANNOTATIONS=''# Where should we run?
> #NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}'
> NODE_SELECTOR='affinity: {"nodeAffinity": 
> {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": 
> [{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", 
> "values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'# How many 10-core jobs do 
> we need to fill everywhere we will run?
> SCALE="30"# Clean up
> kubectl delete job -l app=yunikorntest || true# Make 10 core jobs that will 
> block out our test job for at least 2 minutes
> # Make sure they don't all finish at once.
> rm -f jobs_before.yml
> for NUM in $(seq 1 ${SCALE}) ; do
> cat >>jobs_before.yml <<EOF
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: presleep${NUM}
>   labels:
>     app: yunikorntest
>   ${OTHER_JOB_ANNOTATIONS}
> spec:
>   template:
>     metadata:
>       labels:
>         app: yunikorntest
>         applicationId: before-${NUM}
>     spec:
>       schedulerName: yunikorn
>       ${NODE_SELECTOR}
>       containers:
>       - name: main
>         image: ubuntu:20.04
>         command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
>         resources:
>           limits:
>             memory: 300M
>             cpu: 10000m
>             ephemeral-storage: 1G
>           requests:
>             memory: 300M
>             cpu: 10000m
>             ephemeral-storage: 1G
>       restartPolicy: Never
>   ttlSecondsAfterFinished: 1000
> ---
> EOF
> done# How many jobs do we need to fill the cluster to compete against?
> COMPETING_JOBS="$((SCALE*20))"# And 10 core jobs that, if they all pass it, 
> will keep it blocked out for 20 minutes
> # We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is 
> working.
> rm -f jobs_after.yml
> for NUM in $(seq 1 ${COMPETING_JOBS}) ; do
> cat >>jobs_after.yml <<EOF
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: postsleep${NUM}
>   labels:
>     app: yunikorntest
>   ${OTHER_JOB_ANNOTATIONS}
> spec:
>   template:
>     metadata:
>       labels:
>         app: yunikorntest
>         applicationId: after-${NUM}
>     spec:
>       schedulerName: yunikorn
>       ${NODE_SELECTOR}
>       containers:
>       - name: main
>         image: ubuntu:20.04
>         command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
>         resources:
>           limits:
>             memory: 300M
>             cpu: 10000m
>             ephemeral-storage: 1G
>           requests:
>             memory: 300M
>             cpu: 10000m
>             ephemeral-storage: 1G
>       restartPolicy: Never
>   ttlSecondsAfterFinished: 1000
> ---
> EOF
> done# And the test job itself between them.
> rm -f job_middle.yml
> cat >job_middle.yml <<EOF
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: middle
>   labels:
>     app: yunikorntest
>   ${MIDDLE_JOB_ANNOTATIONS}
> spec:
>   template:
>     metadata:
>       labels:
>         app: yunikorntest
>         applicationId: middle
>     spec:
>       schedulerName: yunikorn
>       ${NODE_SELECTOR}
>       containers:
>       - name: main
>         image: ubuntu:20.04
>         command: ["sleep", "1"]
>         resources:
>           limits:
>             memory: 300M
>             cpu: 50000m
>             ephemeral-storage: 1G
>           requests:
>             memory: 300M
>             cpu: 50000m
>             ephemeral-storage: 1G
>       restartPolicy: Never
>   ttlSecondsAfterFinished: 1000
> EOFkubectl apply -f jobs_before.yml
> sleep 10
> kubectl apply -f job_middle.yml
> sleep 10
> CREATION_TIME="$(kubectl get job middle -o 
> jsonpath='{.metadata.creationTimestamp}')"
> kubectl apply -f jobs_after.yml
> # Wait for it to finish
> echo "Waiting for middle job to finish..."
> COMPLETION_TIME=""
> while [[ -z "${COMPLETION_TIME}" ]] ; do
>     sleep 10
>     JOB_STATE="$(kubectl get job middle -o jsonpath='{.status.succeeded}' || 
> true)"
>     if [[ "${JOB_STATE}" == "1" ]] ; then
>         COMPLETION_TIME="$(kubectl get job middle -o 
> jsonpath='{.status.completionTime}' || true)"
>     fi
> done
> echo "Test large job was created at ${CREATION_TIME} and completed at 
> ${COMPLETION_TIME}"
> {code}
> You will see that YuniKorn will run the vast majority of the "postsleep" jobs 
> before allowing the "middle" job to schedule and run, even though the 
> "middle" job was submitted to the queue first. By increasing the number of 
> "postsleep" jobs submitted, you can starve the "middle" job for an 
> arbitrarily long amount of time.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to