Peter Bacsko created YUNIKORN-2629:
--------------------------------------

             Summary: Adding a node can result in a deadlock
                 Key: YUNIKORN-2629
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event 
interface{}) {
                nodeEvent, ok := event.(CachedSchedulerNodeEvent)
                if !ok {
                        return
                }
                [...] removed for clarity
                wg.Done()
        })
        defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
        api := ctx.apiProvider.GetAPIs().SchedulerAPI
        if err := api.UpdateNode(&si.NodeRequest{
                Nodes: nodesToRegister,
                RmID:  schedulerconf.GetSchedulerConf().ClusterID,
        }); err != nil {
                log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
                return nil, err
        }

        // wait for all responses to accumulate
        wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
                for {
                        select {
                        case event := <-getDispatcher().eventChan:
                                switch v := event.(type) {
                                case events.TaskEvent:
                                        getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
                                case events.ApplicationEvent:
                                        getEventHandler(EventTypeApp)(v)
                                case events.SchedulerNodeEvent:
                                        getEventHandler(EventTypeNode)(v)  
{noformat}

Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to