Ayub Pathan created YUNIKORN-516: ------------------------------------ Summary: Yunikorn scheduler seems to be in deadlock state Key: YUNIKORN-516 URL: https://issues.apache.org/jira/browse/YUNIKORN-516 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Ayub Pathan
Apply below job templates to reproduce the issue. # First application with gang scheduling annotations {noformat} apiVersion: batch/v1 kind: Job metadata: name: batch-sleep-job-1 spec: completions: 2 parallelism: 2 template: metadata: labels: app: sleep applicationId: "batch-sleep-job-1" queue: root.sandbox annotations: yunikorn.apache.org/task-group-name: tg1 yunikorn.apache.org/task-groups: |- [{ "name": "tg1", "minMember": 2, "minResource": { "cpu": "100m", "memory": "500M" }, "nodeSelector": {}, "tolerations": [] }] spec: schedulerName: yunikorn restartPolicy: Never containers: - name: sleep300 image: "alpine:latest" command: ["sleep", "300"] resources: requests: cpu: "100m" memory: "500M" {noformat} 2. First application to the same task group {noformat} apiVersion: batch/v1 kind: Job metadata: name: batch-sleep-job-2 spec: completions: 4 parallelism: 4 template: metadata: labels: app: sleep applicationId: "batch-sleep-job-2" queue: root.sandbox annotations: yunikorn.apache.org/task-group-name: tg1 yunikorn.apache.org/task-groups: |- [{ "name": "tg1", "minMember": 2, "minResource": { "cpu": "100m", "memory": "500M" }, "nodeSelector": {}, "tolerations": [] }] spec: schedulerName: yunikorn restartPolicy: Never containers: - name: sleep300 image: "alpine:latest" command: ["sleep", "300"] resources: requests: cpu: "100m" memory: "500M"{noformat} 3. Third application to the same task group {noformat} apiVersion: batch/v1 kind: Job metadata: name: batch-sleep-job-3 spec: completions: 10 parallelism: 10 template: metadata: labels: app: sleep applicationId: "batch-sleep-job-3" queue: root.sandbox annotations: yunikorn.apache.org/task-group-name: tg1 yunikorn.apache.org/task-groups: |- [{ "name": "tg1", "minMember": 3, "minResource": { "cpu": "100m", "memory": "500M" }, "nodeSelector": {}, "tolerations": [] }] spec: schedulerName: yunikorn restartPolicy: Never containers: - name: sleep300 image: "alpine:latest" command: ["sleep", "300"] resources: requests: cpu: "100m" memory: "500M" {noformat} Now it can be seen that, the 3rd application is in pending state even though the place holder apps are created and terminated. {noformat} NAME↑ READY STATUS RS CPU MEM %CPU/R %MEM/R %CPU/L %MEM/L IP NODE QOS AGE │ │ batch-sleep-job-1-7lrd5 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.142.208 ip-10-192-143-108.ca-central-1.compute.internal BU 18m │ │ batch-sleep-job-1-lw4t9 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.134.213 ip-10-192-136-201.ca-central-1.compute.internal BU 18m │ │ batch-sleep-job-2-c95dg 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.142.210 ip-10-192-143-108.ca-central-1.compute.internal BU 17m │ │ batch-sleep-job-2-vnfjb 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.142.211 ip-10-192-143-108.ca-central-1.compute.internal BU 17m │ │ batch-sleep-job-2-x4mcz 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.134.216 ip-10-192-136-201.ca-central-1.compute.internal BU 17m │ │ batch-sleep-job-2-ztnfq 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.134.217 ip-10-192-136-201.ca-central-1.compute.internal BU 17m │ │ batch-sleep-job-3-7tp5t 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-59mnj 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-bm4fd 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-c4mxg 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-cljfj 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-gcvnp 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-gwgnn 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-kj88t 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-p8c7w 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │ │ batch-sleep-job-3-td575 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m{noformat} Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference. This is observed with v0.10 build. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org