Noah Yoshida created YUNIKORN-2521:
--------------------------------------

             Summary: Scheduler deadlock on EKS
                 Key: YUNIKORN-2521
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
             Project: Apache YuniKorn
          Issue Type: Bug
    Affects Versions: 1.5.0
         Environment: Yunikorn: 1.5
AWS EKS: v1.28.6-eks-508b6b3

            Reporter: Noah Yoshida
         Attachments: goroutine-dump.txt

Discussion on Yunikorn slack: 
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]



Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
All pods stay in Pending. There are no error logs inside of the Yunikorn 
scheduler indicating any issue. 

Additionally, the pods all have the correct annotations / labels from the 
admission service, so they are at least getting put into k8s correctly. 

The issue was seen intermittently on Yunikorn version 1.5 in EKS, using version 
`v1.28.6-eks-508b6b3`. 

At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes are 
added and removed pretty frequently as we do ML workloads. 

Attached is the goroutine dump. We were not able to get a statedump as the 
endpoint kept timing out. 

You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
have to delete any "Pending" pods that got stuck while the scheduler was 
deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to