Peter Bacsko created YUNIKORN-2629: -------------------------------------- Summary: Adding a node can result in a deadlock Key: YUNIKORN-2629 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko
Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode(&si.NodeRequest{ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org