[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849878#comment-17849878
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2629:
-------------------------------------------------

I have walked through the draft PR and looked. As a side effect it fixes 
recoding an event on a node that has been rejected by the core from being 
added. That is a good change to have.

What I do not understand yet is why we need to have a lock on the context. 
Registering a node does not make any changes in the context. The only two 
things that make changes to the context are the applications and config map 
changes. It feels like the context lock is used to synchronise changes in the 
schedulerCache: i.e. make sure that subsequent calls from the context into the 
cache see the same scheduler cache. If that really is the reason we should make 
sure that it is handled in the cache.

Example: in the context the ForgetPod method calls GetPod on the cache and then 
ForgetPod on the cache. That should be one single call to ForgetPod in the 
cache removing the lookup, and the need for the context lock.

The fact that we now unlock the context while waiting for a response to come 
back makes me wonder if we need the context lock at all during that call stack.

> Adding a node can result in a deadlock
> --------------------------------------
>
>                 Key: YUNIKORN-2629
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 1.5.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>        dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>               nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>               if !ok {
>                       return
>               }
>               [...] removed for clarity
>               wg.Done()
>       })
>       defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>       if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>               Nodes: nodesToRegister,
>               RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>       }); err != nil {
>               log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>               return nil, err
>       }
>       // wait for all responses to accumulate
>       wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>               for {
>                       select {
>                       case event := <-getDispatcher().eventChan:
>                               switch v := event.(type) {
>                               case events.TaskEvent:
>                                       getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>                               case events.ApplicationEvent:
>                                       getEventHandler(EventTypeApp)(v)
>                               case events.SchedulerNodeEvent:
>                                       getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to