[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864000#comment-17864000
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2629:
-------------------------------------------------

[~jshmchenxi] The latest stack trace you attached shows no deadlock or even 
locking inside the core or shim code. You have a different issue, not related 
to deadlocks. Please open a new Jira for this.

There are 18 occurrences of calls that reference the semaphore code (locks):
 * 9 from K8s shared informers waiting for object updates to come from K8s
 * 9 from K8s network data readers

Those are expected. If no data is transmitted and being processed by the K8s 
informers they should sit there and wait.

No other code has any locks. When I look at the YuniKorn code references in the 
stack trace I can see an idle scheduler. Nothing is being processed on the 
K8shim side, and it is sleeping waiting for changes. The core side is also not 
scheduling and sleeping.

There is one go routine that jumps out for me: 

 
{code:java}
goroutine 19661710185 [IO wait]
...
created by golang.org/x/net/http2.(*ClientConn).goRun in goroutine 19661710184 
golang.org/x/net@v0.23.0/http2/transport.go:369 +0x2d
{code}
The go routine mentioned in the created by does not exist in the dump. Not sure 
if that just means it still needs to timeout or something else is happening but 
this is not the deadlock as per this jira.

 

> Adding a node can result in a deadlock
> --------------------------------------
>
>                 Key: YUNIKORN-2629
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 1.5.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.5.2
>
>         Attachments: updateNode_deadlock_trace.txt, 
> yunikorn-scheduler-20240627.log, yunikorn_stuck_stack_20240708.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>        dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>               nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>               if !ok {
>                       return
>               }
>               [...] removed for clarity
>               wg.Done()
>       })
>       defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>       if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>               Nodes: nodesToRegister,
>               RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>       }); err != nil {
>               log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>               return nil, err
>       }
>       // wait for all responses to accumulate
>       wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>               for {
>                       select {
>                       case event := <-getDispatcher().eventChan:
>                               switch v := event.(type) {
>                               case events.TaskEvent:
>                                       getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>                               case events.ApplicationEvent:
>                                       getEventHandler(EventTypeApp)(v)
>                               case events.SchedulerNodeEvent:
>                                       getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to