Ayub Pathan created YUNIKORN-516:
------------------------------------

             Summary: Yunikorn scheduler seems to be in deadlock state
                 Key: YUNIKORN-516
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-516
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Ayub Pathan


Apply below job templates to reproduce the issue.
 # First application with gang scheduling annotations
  

{noformat}
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-sleep-job-1
spec:
  completions: 2
  parallelism: 2
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "batch-sleep-job-1"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: tg1
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "tg1",
              "minMember": 2,
              "minResource": {
                "cpu": "100m",
                "memory": "500M"
              },
              "nodeSelector": {},
              "tolerations": []
          }]
    spec:
      schedulerName: yunikorn
      restartPolicy: Never
      containers:
        - name: sleep300
          image: "alpine:latest"
          command: ["sleep", "300"]
          resources:
            requests:
              cpu: "100m"
              memory: "500M" {noformat}
 

2.  First application to the same task group
{noformat}
 apiVersion: batch/v1
kind: Job
metadata:
  name: batch-sleep-job-2
spec:
  completions: 4
  parallelism: 4
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "batch-sleep-job-2"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: tg1
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "tg1",
              "minMember": 2,
              "minResource": {
                "cpu": "100m",
                "memory": "500M"
              },
              "nodeSelector": {},
              "tolerations": []
          }]
    spec:
      schedulerName: yunikorn
      restartPolicy: Never
      containers:
        - name: sleep300
          image: "alpine:latest"
          command: ["sleep", "300"]
          resources:
            requests:
              cpu: "100m"
              memory: "500M"{noformat}
 

3. Third application to the same task group
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-sleep-job-3
spec:
  completions: 10
  parallelism: 10
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "batch-sleep-job-3"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: tg1
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "tg1",
              "minMember": 3,
              "minResource": {
                "cpu": "100m",
                "memory": "500M"
              },
              "nodeSelector": {},
              "tolerations": []
          }]
    spec:
      schedulerName: yunikorn
      restartPolicy: Never
      containers:
        - name: sleep300
          image: "alpine:latest"
          command: ["sleep", "300"]
          resources:
            requests:
              cpu: "100m"
              memory: "500M" {noformat}
Now it can be seen that, the 3rd application is in pending state even though 
the place holder apps are created and terminated.
{noformat}
NAME↑                    READY STATUS     RS CPU MEM %CPU/R  %MEM/R  %CPU/L  
%MEM/L IP                NODE                                              QOS  
AGE    │
│ batch-sleep-job-1-7lrd5  0/1   Completed   0 n/a n/a    n/a     n/a     n/a   
  n/a 100.100.142.208   ip-10-192-143-108.ca-central-1.compute.internal   BU   
18m    │
│ batch-sleep-job-1-lw4t9  0/1   Completed   0 n/a n/a    n/a     n/a     n/a   
  n/a 100.100.134.213   ip-10-192-136-201.ca-central-1.compute.internal   BU   
18m    │
│ batch-sleep-job-2-c95dg  0/1   Completed   0 n/a n/a    n/a     n/a     n/a   
  n/a 100.100.142.210   ip-10-192-143-108.ca-central-1.compute.internal   BU   
17m    │
│ batch-sleep-job-2-vnfjb  0/1   Completed   0 n/a n/a    n/a     n/a     n/a   
  n/a 100.100.142.211   ip-10-192-143-108.ca-central-1.compute.internal   BU   
17m    │
│ batch-sleep-job-2-x4mcz  0/1   Completed   0 n/a n/a    n/a     n/a     n/a   
  n/a 100.100.134.216   ip-10-192-136-201.ca-central-1.compute.internal   BU   
17m    │
│ batch-sleep-job-2-ztnfq  0/1   Completed   0 n/a n/a    n/a     n/a     n/a   
  n/a 100.100.134.217   ip-10-192-136-201.ca-central-1.compute.internal   BU   
17m    │
│ batch-sleep-job-3-7tp5t  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-59mnj  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-bm4fd  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-c4mxg  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-cljfj  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-gcvnp  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-gwgnn  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-kj88t  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-p8c7w  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m    │
│ batch-sleep-job-3-td575  0/0   Pending     0 n/a n/a    n/a     n/a     n/a   
  n/a n/a               n/a                                               BU   
16m{noformat}
Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference. 
This is observed with v0.10 build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to