[ https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848180#comment-17848180 ]
Peter Bacsko commented on YUNIKORN-2521: ---------------------------------------- It might be even two separate incidents: goroutine-4.3.out and goroutine-4-3-3.out don't contain {{registerNodes()}}. So at least it's partially solved. Anyway, I'm linking YUNIKORN-2629. > Scheduler deadlock > ------------------ > > Key: YUNIKORN-2521 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2521 > Project: Apache YuniKorn > Issue Type: Bug > Affects Versions: 1.5.0 > Environment: Yunikorn: 1.5 > AWS EKS: v1.28.6-eks-508b6b3 > Reporter: Noah Yoshida > Assignee: Peter Bacsko > Priority: Critical > Fix For: 1.6.0, 1.5.1 > > Attachments: 0001-YUNIKORN-2539-core.patch, > 0002-YUNIKORN-2539-k8shim.patch, 4_4_goroutine-1.txt, 4_4_goroutine-2.txt, > 4_4_goroutine-3.txt, 4_4_goroutine-4.txt, 4_4_goroutine-5-state-dump.txt, > 4_4_profile001.png, 4_4_profile002.png, 4_4_profile003.png, > 4_4_scheduler-logs.txt, deadlock_2024-04-18.log, goroutine-4-3-1.out, > goroutine-4-3-2.out, goroutine-4-3-3.out, goroutine-4-3.out, > goroutine-4-5.out, goroutine-dump.txt, goroutine-while-blocking-2.out, > goroutine-while-blocking.out, logs-potential-deadlock-2.txt, > logs-potential-deadlock.txt, logs-splunk-ordered.txt, logs-splunk.txt, > profile001-4-5.gif, profile012.gif, profile013.gif, running-logs-2.txt, > running-logs.txt > > > Discussion on Yunikorn slack: > [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179] > Occasionally, Yunikorn will deadlock and prevent any new pods from starting. > All pods stay in Pending. There are no error logs inside of the Yunikorn > scheduler indicating any issue. > Additionally, the pods all have the correct annotations / labels from the > admission service, so they are at least getting put into k8s correctly. > The issue was seen intermittently on Yunikorn version 1.5 in EKS, using > version `v1.28.6-eks-508b6b3`. > At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes > are added and removed pretty frequently as we do ML workloads. > Attached is the goroutine dump. We were not able to get a statedump as the > endpoint kept timing out. > You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also > have to delete any "Pending" pods that got stuck while the scheduler was > deadlocked as well, for them to get picked up by the new scheduler pod. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org