Chia-Ping Tsai created YUNIKORN-793: ---------------------------------------
Summary: fix deadlock caused by listing queues with scheduling pending pods Key: YUNIKORN-793 URL: https://issues.apache.org/jira/browse/YUNIKORN-793 Project: Apache YuniKorn Issue Type: Bug Reporter: Chia-Ping Tsai `GetPartitionQueues` calls read lock multiple times. If there is a thread which is waiting write lock, it can exclude all new read locks. In short, the following execution order can cause dead lock. 1. hold read lock ---> thread 0 2. wait write lock ---> thread 1 is locked by thread 0 3. acquire read lock ---> thread 0 is locked by thread 1 see docs for more details ([https://pkg.go.dev/sync#RWMutex]) The pprof is shown below. {noformat} 1 @ 0x43ada5 0x44ca85 0x44ca6e 0x46de27 0x47ce25 0x47e590 0x47e522 0x9c5753 0x9ad231 0x9fbcfa 0x9e4986 0x9e4851 0x9de655 0x9ff792 0x471c61 # 0x46de26 sync.runtime_SemacquireMutex+0x46 /Users/chia7712/Library/go/default/src/runtime/sema.go:71 # 0x47ce24 sync.(*Mutex).lockSlow+0x104 /Users/chia7712/Library/go/default/src/sync/mutex.go:138 # 0x47e58f sync.(*Mutex).Lock+0x8f /Users/chia7712/Library/go/default/src/sync/mutex.go:81 # 0x47e521 sync.(*RWMutex).Lock+0x21 /Users/chia7712/Library/go/default/src/sync/rwmutex.go:111 # 0x9c5752 github.com/apache/incubator-yunikorn-core/pkg/scheduler/objects.(*Queue).incPendingResource+0x52 /Users/chia7712/go/pkg/mod/github.com/chia7712/incubator-yunikorn-core@v0.0.0-20210811001640-eaa6afb10b62/pkg/scheduler/objects/queue.go:454 1 @ 0x43ada5 0x44ca85 0x44ca6e 0x46de27 0x9c7eae 0x9c7e34 0x9c51f8 0x9c54ab 0xa59e45 0xa59e94 0x711024 0xa5b310 0x711024 0xa011b3 0x7145e3 0x70fb0d 0x471c61 # 0x46de26 sync.runtime_SemacquireMutex+0x46 /Users/chia7712/Library/go/default/src/runtime/sema.go:71 # 0x9c7ead sync.(*RWMutex).RLock+0xad /Users/chia7712/Library/go/default/src/sync/rwmutex.go:63 # 0x9c7e33 github.com/apache/incubator-yunikorn-core/pkg/scheduler/objects.(*Queue).IsLeafQueue+0x33 /Users/chia7712/go/pkg/mod/github.com/chia7712/incubator-yunikorn-core@v0.0.0-20210811001640-eaa6afb10b62/pkg/scheduler/objects/queue.go:667 # 0x9c51f7 github.com/apache/incubator-yunikorn-core/pkg/scheduler/objects.(*Queue).GetPartitionQueues+0x1f7 /Users/chia7712/go/pkg/mod/github.com/chia7712/incubator-yunikorn-core@v0.0.0-20210811001640-eaa6afb10b62/pkg/scheduler/objects/queue.go:426 # 0x9c54aa github.com/apache/incubator-yunikorn-core/pkg/scheduler/objects.(*Queue).GetPartitionQueues+0x4aa /Users/chia7712/go/pkg/mod/github.com/chia7712/incubator-yunikorn-core@v0.0.0-20210811001640-eaa6afb10b62/pkg/scheduler/objects/queue.go:416 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org