[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2629:
--------------------------------------------
    Attachment: updateNode_deadlock_trace.txt

> Adding a node can result in a deadlock
> --------------------------------------
>
>                 Key: YUNIKORN-2629
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 1.5.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Blocker
>         Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>        dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>               nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>               if !ok {
>                       return
>               }
>               [...] removed for clarity
>               wg.Done()
>       })
>       defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>       if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>               Nodes: nodesToRegister,
>               RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>       }); err != nil {
>               log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>               return nil, err
>       }
>       // wait for all responses to accumulate
>       wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>               for {
>                       select {
>                       case event := <-getDispatcher().eventChan:
>                               switch v := event.(type) {
>                               case events.TaskEvent:
>                                       getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>                               case events.ApplicationEvent:
>                                       getEventHandler(EventTypeApp)(v)
>                               case events.SchedulerNodeEvent:
>                                       getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to