[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Fix Version/s: 1.5.2

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2629:
-
Labels: pull-request-available  (was: )

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-22 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2629:

Attachment: updateNode_deadlock_trace.txt

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-17 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Target Version: 1.6.0, 1.5.2

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Description: 
Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
   dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
if err := 
ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}
Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck, so {{registerNodes()}} will never progress.

  was:
Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
   dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}
Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck, so {{registerNodes()}} will never progress.


> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>  

[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Description: 
Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
   dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}
Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck, so {{registerNodes()}} will never progress.

  was:
Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
   dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}

Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck.


> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>  

[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Affects Version/s: 1.5.0

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   api := ctx.apiProvider.GetAPIs().SchedulerAPI
>   if err := api.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Description: 
Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
   dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}

Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck.

  was:
Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event 
interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}

Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck.


> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed