[ 
https://issues.apache.org/jira/browse/YUNIKORN-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833467#comment-17833467
 ] 

Peter Bacsko commented on YUNIKORN-2521:
----------------------------------------

I looked at the goroutine dump and it's weird to me. Probably I'm missing 
something... for example, I don't know if this is important or not, these are 
the memory addresses that are passed to an internal function:
{noformat}
sync.runtime_SemacquireRWMutexR(0xffffffffffffffff?, 0x7?, 0xc00a2fa8a0?)
sync.runtime_SemacquireRWMutexR(0x20?, 0xc0?, 0x24?)
sync.runtime_SemacquireRWMutexR(0xc004146f30?, 0x48?, 0x18?)
sync.runtime_SemacquireRWMutexR(0x6aa100000?, 0x90?, 0xc00a703f40?)
{noformat}

Not sure if what 0xffffffffffffffff is about or is that even normal (on the 
other hand, 0x20 is a very low address), but I don't want to go down the rabbit 
hole just yet.

I think we need multiple gorotuine dumps (maybe 3-4) taken with regular 
intervals to see what is stuck exactly. It could be that we captured a moment 
which is just perfectly normal.

[~nyoshida] you mentioned that the state dump never returns. It would be very 
useful to retrieve it and then make a goroutine dump while it's blocking. That 
can also reveal something.

> Scheduler deadlock
> ------------------
>
>                 Key: YUNIKORN-2521
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2521
>             Project: Apache YuniKorn
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>         Environment: Yunikorn: 1.5
> AWS EKS: v1.28.6-eks-508b6b3
>            Reporter: Noah Yoshida
>            Priority: Major
>         Attachments: goroutine-dump.txt
>
>
> Discussion on Yunikorn slack: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179]
> Occasionally, Yunikorn will deadlock and prevent any new pods from starting. 
> All pods stay in Pending. There are no error logs inside of the Yunikorn 
> scheduler indicating any issue. 
> Additionally, the pods all have the correct annotations / labels from the 
> admission service, so they are at least getting put into k8s correctly. 
> The issue was seen intermittently on Yunikorn version 1.5 in EKS, using 
> version `v1.28.6-eks-508b6b3`. 
> At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes 
> are added and removed pretty frequently as we do ML workloads. 
> Attached is the goroutine dump. We were not able to get a statedump as the 
> endpoint kept timing out. 
> You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also 
> have to delete any "Pending" pods that got stuck while the scheduler was 
> deadlocked as well, for them to get picked up by the new scheduler pod. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to