[jira] [Created] (YUNIKORN-1185) Small applications starve large ones in the same FIFO queue

Adam Novak (Jira) Mon, 25 Apr 2022 14:18:05 -0700

Adam Novak created YUNIKORN-1185:
------------------------------------

             Summary: Small applications starve large ones in the same FIFO 
queue
                 Key: YUNIKORN-1185
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1185
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Adam Novak



Even when I set my queue to use a {{fifo}} application sort policy, 
applications that enter the queue later are able to run before applications 
that are submitted earlier; the queue does not behave like a first-in, 
first-out queue.

Specifically, this happens when the later applications are smaller than the 
earlier ones. If enough small jobs applications are available in the queue to 
immediately fill any space that opens up, they will schedule as soon as space 
is available. YuniKorn doesn't wait for enough space to become free to schedule 
waiting large applications, no matter how much older they are than the things 
that are passing them in the queue.

The result of this is that a steady supply of small applications can keep a 
larger application waiting indefinitely, causing starvation.

The relevant code seems to be [in Queue's tryAllocate 
method|[https://github.com/apache/yunikorn-core/blob/73d55282f052f53852cc156d626c155ca5dddca2/pkg/scheduler/objects/queue.go#L1069-L1070]|https://github.com/apache/yunikorn-core/blob/73d55282f052f53852cc156d626c155ca5dddca2/pkg/scheduler/objects/queue.go#L1069-L1070].].
 YuniKorn goes through all the applications in the queue in order, and greedily 
schedules work items until no more fit. If no space large enough to fit any 
work form the first application currently exists, it will always fill what 
space there is with work from applications later in the queue. It will never 
wait to drain out space on a node to fit work from that first application.

How can I configure or modify YuniKorn to prevent starvation, and make the 
applications in a queue execute in order, or at least not arbitrarily far out 
of order?

(I already tried the {{stateaware}} queue sort, but it doesn't seem to work 
well with applications as small as mine. It appeared to run only one 
application at a time, because my applications finish so fast.)
h4. Replication

First, have a Kubernetes cluster with a node {{k1.kube}} with 96 cores.

Next, set up YuniKorn 0.12.2 with this {{values.yml}} for the Helm chart:

 
{code:java}
embedAdmissionController: false
configuration: |
  partitions:
    -
      name: default
      placementrules:
        - name: tag
          value: namespace
          create: true
      queues:
        - name: root
          submitacl: '*'
          childtemplate:
           properties:
             application.sort.policy: fifo {code}
Then, run this script:
{code:java}
#!/usr/bin/env bash
# test-yunikorn.sh: Make sure YuniKorn prevents starvation
set -e# Set this to annotate jobs other than the middle job
OTHER_JOB_ANNOTATIONS=''
# And similarly for the middle job
MIDDLE_JOB_ANNOTATIONS=''# Where should we run?
#NODE_SELECTOR='nodeSelector: {"kubernetes.io/hostname": "k1.kube"}'
NODE_SELECTOR='affinity: {"nodeAffinity": 
{"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": 
[{"matchExpressions": [{"key": "kubernetes.io/hostname", "operator": "In", 
"values": ["k1.kube", "k2.kube", "k3.kube"]}]}]}}}'# How many 10-core jobs do 
we need to fill everywhere we will run?
SCALE="30"# Clean up
kubectl delete job -l app=yunikorntest || true# Make 10 core jobs that will 
block out our test job for at least 2 minutes
# Make sure they don't all finish at once.
rm -f jobs_before.yml
for NUM in $(seq 1 ${SCALE}) ; do
cat >>jobs_before.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: presleep${NUM}
  labels:
    app: yunikorntest
  ${OTHER_JOB_ANNOTATIONS}
spec:
  template:
    metadata:
      labels:
        app: yunikorntest
        applicationId: before-${NUM}
    spec:
      schedulerName: yunikorn
      ${NODE_SELECTOR}
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 120 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  ttlSecondsAfterFinished: 1000
---
EOF
done# How many jobs do we need to fill the cluster to compete against?
COMPETING_JOBS="$((SCALE*20))"# And 10 core jobs that, if they all pass it, 
will keep it blocked out for 20 minutes
# We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is 
working.
rm -f jobs_after.yml
for NUM in $(seq 1 ${COMPETING_JOBS}) ; do
cat >>jobs_after.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: postsleep${NUM}
  labels:
    app: yunikorntest
  ${OTHER_JOB_ANNOTATIONS}
spec:
  template:
    metadata:
      labels:
        app: yunikorntest
        applicationId: after-${NUM}
    spec:
      schedulerName: yunikorn
      ${NODE_SELECTOR}
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep",  "$(( $RANDOM % 20 + 60 ))"]
        resources:
          limits:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 10000m
            ephemeral-storage: 1G
      restartPolicy: Never
  ttlSecondsAfterFinished: 1000
---
EOF
done# And the test job itself between them.
rm -f job_middle.yml
cat >job_middle.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: middle
  labels:
    app: yunikorntest
  ${MIDDLE_JOB_ANNOTATIONS}
spec:
  template:
    metadata:
      labels:
        app: yunikorntest
        applicationId: middle
    spec:
      schedulerName: yunikorn
      ${NODE_SELECTOR}
      containers:
      - name: main
        image: ubuntu:20.04
        command: ["sleep", "1"]
        resources:
          limits:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
          requests:
            memory: 300M
            cpu: 50000m
            ephemeral-storage: 1G
      restartPolicy: Never
  ttlSecondsAfterFinished: 1000
EOFkubectl apply -f jobs_before.yml
sleep 10
kubectl apply -f job_middle.yml
sleep 10
CREATION_TIME="$(kubectl get job middle -o 
jsonpath='{.metadata.creationTimestamp}')"
kubectl apply -f jobs_after.yml
# Wait for it to finish
echo "Waiting for middle job to finish..."
COMPLETION_TIME=""
while [[ -z "${COMPLETION_TIME}" ]] ; do
    sleep 10
    JOB_STATE="$(kubectl get job middle -o jsonpath='{.status.succeeded}' || 
true)"
    if [[ "${JOB_STATE}" == "1" ]] ; then
        COMPLETION_TIME="$(kubectl get job middle -o 
jsonpath='{.status.completionTime}' || true)"
    fi
done
echo "Test large job was created at ${CREATION_TIME} and completed at 
${COMPLETION_TIME}"
{code}
You will see that YuniKorn will run the vast majority of the "postsleep" jobs 
before allowing the "middle" job to schedule and run, even though the "middle" 
job was submitted to the queue first. By increasing the number of "postsleep" 
jobs submitted, you can starve the "middle" job for an arbitrarily long amount 
of time.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

[jira] [Created] (YUNIKORN-1185) Small applications starve large ones in the same FIFO queue

Reply via email to