[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Fix Version/s: 1.5.2 > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2629: - Labels: pull-request-available (was: ) > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2629: Attachment: updateNode_deadlock_trace.txt > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Target Version: 1.6.0, 1.5.2 > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Description: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) if err := ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck, so {{registerNodes()}} will never progress. was: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck, so {{registerNodes()}} will never progress. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) >
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Description: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck, so {{registerNodes()}} will never progress. was: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { >
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Affects Version/s: 1.5.0 > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > api := ctx.apiProvider.GetAPIs().SchedulerAPI > if err := api.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Description: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. was: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed