[jira] [Updated] (YUNIKORN-2630) Release context lock in shim when processing config in the core
[ https://issues.apache.org/jira/browse/YUNIKORN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2630: Target Version: 1.6.0, 1.5.2 > Release context lock in shim when processing config in the core > --- > > Key: YUNIKORN-2630 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2630 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Labels: pull-request-available > > When an change comes in for a the configmaps we process the change under a > context lock as we need to merge the two configmaps. > We keep this lock even if all the work is done in the shim and processing has > been transferred to the core. This is unneeded as the core has its own > locking an serialisation of the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2630) Release context lock in shim when processing config in the core
[ https://issues.apache.org/jira/browse/YUNIKORN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2630: - Labels: pull-request-available (was: ) > Release context lock in shim when processing config in the core > --- > > Key: YUNIKORN-2630 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2630 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Labels: pull-request-available > > When an change comes in for a the configmaps we process the change under a > context lock as we need to merge the two configmaps. > We keep this lock even if all the work is done in the shim and processing has > been transferred to the core. This is unneeded as the core has its own > locking an serialisation of the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2630) Release context lock in shim when processing config in the core
Wilfred Spiegelenburg created YUNIKORN-2630: --- Summary: Release context lock in shim when processing config in the core Key: YUNIKORN-2630 URL: https://issues.apache.org/jira/browse/YUNIKORN-2630 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg When an change comes in for a the configmaps we process the change under a context lock as we need to merge the two configmaps. We keep this lock even if all the work is done in the shim and processing has been transferred to the core. This is unneeded as the core has its own locking an serialisation of the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2628) fix release announcement links
[ https://issues.apache.org/jira/browse/YUNIKORN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2628. - Fix Version/s: 1.6.0 Resolution: Fixed links are fixed after removing the {{..}} from the path > fix release announcement links > -- > > Key: YUNIKORN-2628 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2628 > Project: Apache YuniKorn > Issue Type: Task > Components: website >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > > In YUNIKORN-2595 a regression snuck in breaking the links to the release > announcements. > Need to reverse that path change for the release announcements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847123#comment-17847123 ] Wilfred Spiegelenburg commented on YUNIKORN-2629: - I think we need to look at the context lock in the k8shim in general. The context lock is held while we do none context work. There is no need to hold the lock if all we do is waiting for a response that might trigger post processing or not. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-site) branch master updated: [YUNIKORN-2628] revert relative path for release announcement (#430)
This is an automated email from the ASF dual-hosted git repository. wilfreds pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git The following commit(s) were added to refs/heads/master by this push: new a29e6105c3 [YUNIKORN-2628] revert relative path for release announcement (#430) a29e6105c3 is described below commit a29e6105c3e10e8f80d616bf2acf52dbeec81fac Author: Wilfred Spiegelenburg AuthorDate: Fri May 17 11:44:46 2024 +1000 [YUNIKORN-2628] revert relative path for release announcement (#430) Closes: #430 Signed-off-by: Wilfred Spiegelenburg --- src/pages/community/download.md | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/pages/community/download.md b/src/pages/community/download.md index e3ae0259e3..683a866263 100644 --- a/src/pages/community/download.md +++ b/src/pages/community/download.md @@ -33,11 +33,11 @@ We publish prebuilt docker images for everyone's convenience. The latest release of Apache YuniKorn is v1.5.0. -| Version | Release date | Source download | Docker images [...] -|-|--|---|-- [...] -| v1.5.0 | 2024-03-14 | [Download](https://www.apache.org/dyn/closer.lua/yunikorn/1.5.0/apache-yunikorn-1.5.0-src.tar.gz)[Checksum](https://downloads.apache.org/yunikorn/1.5.0/apache-yunikorn-1.5.0-src.tar.gz.sha512) & [Signature](https://downloads.apache.org/yunikorn/1.5.0/apache-yunikorn-1.5.0-src.tar.gz.asc) | [scheduler](https://hub.docker.com/layers/apache/yunikorn/scheduler-1.5.0/images/sha256-9cefd0df164b9c4d39f9e10b010eaf7d8f89b130de1648e94f75b9b95d300a00)[admission- [...] -| v1.4.0 | 2023-11-20 | [Download](https://archive.apache.org/dist/yunikorn/1.4.0/apache-yunikorn-1.4.0-src.tar.gz)[Checksum](https://archive.apache.org/dist/yunikorn/1.4.0/apache-yunikorn-1.4.0-src.tar.gz.sha512) & [Signature](https://archive.apache.org/dist/yunikorn/1.4.0/apache-yunikorn-1.4.0-src.tar.gz.asc) | [scheduler](https://hub.docker.com/layers/apache/yunikorn/scheduler-1.4.0/images/sha256-d013be8e3ad7eb8e51ce23951e6899a4b74088e52c3767f3fcc7efcdcc0904f5)[admission- [...] -| v1.3.0 | 2023-06-12 | [Download](https://archive.apache.org/dist/yunikorn/1.3.0/apache-yunikorn-1.3.0-src.tar.gz)[Checksum](https://archive.apache.org/dist/yunikorn/1.3.0/apache-yunikorn-1.3.0-src.tar.gz.sha512) & [Signature](https://archive.apache.org/dist/yunikorn/1.3.0/apache-yunikorn-1.3.0-src.tar.gz.asc) | [scheduler](https://hub.docker.com/layers/apache/yunikorn/scheduler-1.3.0/images/sha256-99a1973728c6684b1da7631dbf015daa1dbf519dbab1ffc8b23fccdfa7ffd0c5)[admission- [...] +| Version | Release date | Source download | Docker images [...] +|-|--|---|-- [...] +| v1.5.0 | 2024-03-14 | [Download](https://www.apache.org/dyn/closer.lua/yunikorn/1.5.0/apache-yunikorn-1.5.0-src.tar.gz)[Checksum](https://downloads.apache.org/yunikorn/1.5.0/apache-yunikorn-1.5.0-src.tar.gz.sha512) & [Signature](https://downloads.apache.org/yunikorn/1.5.0/apache-yunikorn-1.5.0-src.tar.gz.asc) |
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Description: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) if err := ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck, so {{registerNodes()}} will never progress. was: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck, so {{registerNodes()}} will never progress. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) >
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Description: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck, so {{registerNodes()}} will never progress. was: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { >
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Affects Version/s: 1.5.0 > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > api := ctx.apiProvider.GetAPIs().SchedulerAPI > if err := api.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Description: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. was: Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed
[jira] [Created] (YUNIKORN-2629) Adding a node can result in a deadlock
Peter Bacsko created YUNIKORN-2629: -- Summary: Adding a node can result in a deadlock Key: YUNIKORN-2629 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-release) annotated tag v1.5.1 updated (ab98307 -> 758038d)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a change to annotated tag v1.5.1 in repository https://gitbox.apache.org/repos/asf/yunikorn-release.git *** WARNING: tag v1.5.1 was modified! *** from ab98307 (commit) to 758038d (tag) tagging ab9830736c1617c8a765a113ccd58605e020f8f7 (commit) replaces v1.5.0 by Peter Bacsko on Thu May 16 13:21:03 2024 +0200 - Log - Apache YuniKorn v1.5.1 --- No new revisions were added by this update. Summary of changes: - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-release) branch branch-1.5 updated: Update CHANGELOG for 1.5.1
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a commit to branch branch-1.5 in repository https://gitbox.apache.org/repos/asf/yunikorn-release.git The following commit(s) were added to refs/heads/branch-1.5 by this push: new ab98307 Update CHANGELOG for 1.5.1 ab98307 is described below commit ab9830736c1617c8a765a113ccd58605e020f8f7 Author: Peter Bacsko AuthorDate: Wed May 8 15:13:27 2024 +0200 Update CHANGELOG for 1.5.1 --- release-top-level-artifacts/CHANGELOG | 238 +++--- 1 file changed, 19 insertions(+), 219 deletions(-) diff --git a/release-top-level-artifacts/CHANGELOG b/release-top-level-artifacts/CHANGELOG index 07b6a6e..52983c7 100644 --- a/release-top-level-artifacts/CHANGELOG +++ b/release-top-level-artifacts/CHANGELOG @@ -16,237 +16,37 @@ # -Release Notes - Apache YuniKorn - Version 1.5.0 +Release Notes - Apache YuniKorn - Version 1.5.1 ** Sub-task -* [YUNIKORN-1709] - Add event streaming logic -* [YUNIKORN-1950] - Improving test coverage for whole user/group enforcement feature - Phase 2 -* [YUNIKORN-1956] - Add wildcard user/group limit e2e tests -* [YUNIKORN-2037] - Document the performance using kwok -* [YUNIKORN-2089] - Move usedResource type and tests to their own files -* [YUNIKORN-2116] - Track user/group events -* [YUNIKORN-2118] - Add smoke test for event streaming -* [YUNIKORN-2119] - Add check for parent queue user/group limit lower than child queue -* [YUNIKORN-2132] - Show active event streaming in the state dump -* [YUNIKORN-2136] - limit max resource should be greater than zero -* [YUNIKORN-2145] - refactor: ApplicationSummary into its own file -* [YUNIKORN-2147] - Limit the number of concurrent event streams -* [YUNIKORN-2151] - Report resource used by placeholder pods in the app summary -* [YUNIKORN-2159] - Clean up AppManager implementation -* [YUNIKORN-2163] - Fix HTTP status codes in some REST handlers -* [YUNIKORN-2164] - Use ParseUint instead of ParseInt in getEvents() -* [YUNIKORN-2175] - Add queue headRoom for Rest API querying and improve logs -* [YUNIKORN-2176] - Add test for user & group max resource changes -* [YUNIKORN-2180] - Clean up scheduler state initialization -* [YUNIKORN-2188] - Improve state transition event to include the eventinfo -* [YUNIKORN-2201] - Evaluate the performance impact of Headroom() and CanRunApp() -* [YUNIKORN-2203] - Possible log spew in UGM code -* [YUNIKORN-2205] - remove the warning of processing nonexistent "namespace.guaranteed" -* [YUNIKORN-2209] - Remove limit checks in QueueTracker -* [YUNIKORN-2210] - Metrics: use WithLabelValues instead of With -* [YUNIKORN-2212] - Don't collect requests that hasn't been scheduled yet or already triggered scale up -* [YUNIKORN-2231] - Show node list when hovering mouse over the node utitutilization bar chart -* [YUNIKORN-2257] - Add rest API to retrieve node utilization for multiple resource types -* [YUNIKORN-2264] - Add missing Originator and PreemptionPolicy fields to SI Allocation -* [YUNIKORN-2265] - Populate Originator and PreemptionPolicy on existing allocations -* [YUNIKORN-2284] - ERROR message when stopping Service context -* [YUNIKORN-2285] - Don't re-calculate reservationKey -* [YUNIKORN-2292] - Flaky E2E Test: Orphan pods still exist after TearDownNamespace() -* [YUNIKORN-2293] - Flaky E2E Test: Failed asserts in LogTestClusterInfoWrapper() blocked the resources cleanup steps -* [YUNIKORN-2294] - Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status -* [YUNIKORN-2309] - Add pod status updater logic to the MockScheduler performance test -* [YUNIKORN-2312] - Cleanup BinPacking e2e test workload before removing namespace -* [YUNIKORN-2313] - Flaky E2E Test: "Verify_basic_preemption" experiences flakiness due to race condition -* [YUNIKORN-2316] - Update REST API docs for /ws/v1/scheduler/node-utilizations -* [YUNIKORN-2325] - Add a chart to display multi-type resource utilisation (Web) -* [YUNIKORN-2335] - Use go standard library min and max functions -* [YUNIKORN-2337] - Update documentation about event streaming -* [YUNIKORN-2339] - Remove Nodes Utilisation chart from Dashboard page (Web) -* [YUNIKORN-2366] - Shim: Update GetPodResources() to handle in-place pod resource updates -* [YUNIKORN-2370] - Proper event handling for failed headroom checks -* [YUNIKORN-2373] - Extend EventRecord type with user/group related data -* [YUNIKORN-2379] - Adjust layout of node utilization chart(Web) -* [YUNIKORN-2381] - Update the copyright years in NOTICE files to 2024 -* [YUNIKORN-2382] - Expose K8s supported versions on web -* [YUNIKORN-2390] - Improve mousehover result for node utilization chart(Web) -* [YUNIKORN-2395] - Remove Jaeger
[jira] [Resolved] (YUNIKORN-2612) Tagging for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2612. Fix Version/s: 1.5.1 Resolution: Fixed > Tagging for 1.5.1 > - > > Key: YUNIKORN-2612 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2612 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.5.1 > > > Tagging for updating dependencies (SI/core/k8shim). > No branching is needed because we'll deliver the release from branch-1.5 > directly as we did with incubator minor releases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2602) Fix spelling/grammar in configvalidator
[ https://issues.apache.org/jira/browse/YUNIKORN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chia-Ping Tsai resolved YUNIKORN-2602. -- Fix Version/s: 1.6.0 Resolution: Fixed > Fix spelling/grammar in configvalidator > --- > > Key: YUNIKORN-2602 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2602 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Peter Bacsko >Assignee: Yun Sun >Priority: Trivial > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > Let's fix some minor grammar issues in configvalidator.go. > Eg.: "existed" -> "existing", but there could be other mistakes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) branch master updated: [YUNIKORN-2602] Fix spelling/grammar in configvalidator.go (#869)
This is an automated email from the ASF dual-hosted git repository. chia7712 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/master by this push: new 9d7fddf9 [YUNIKORN-2602] Fix spelling/grammar in configvalidator.go (#869) 9d7fddf9 is described below commit 9d7fddf9a3618ec449e7ca974461d5a3745ac49f Author: YUN SUN AuthorDate: Thu May 16 18:05:28 2024 +0800 [YUNIKORN-2602] Fix spelling/grammar in configvalidator.go (#869) Closes: #869 Signed-off-by: Chia-Ping Tsai --- pkg/common/configs/configvalidator.go | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/pkg/common/configs/configvalidator.go b/pkg/common/configs/configvalidator.go index ec8e15a3..390b39b0 100644 --- a/pkg/common/configs/configvalidator.go +++ b/pkg/common/configs/configvalidator.go @@ -478,7 +478,7 @@ func checkPlacementFilter(filter Filter) error { } // Check a single limit entry -func checkLimit(limit Limit, existedUserName map[string]bool, existedGroupName map[string]bool, queue *QueueConfig) error { +func checkLimit(limit Limit, existingUserName map[string]bool, existingGroupName map[string]bool, queue *QueueConfig) error { if len(limit.Users) == 0 && len(limit.Groups) == 0 { return fmt.Errorf("empty user and group lists defined in limit '%v'", limit) } @@ -488,15 +488,15 @@ func checkLimit(limit Limit, existedUserName map[string]bool, existedGroupName m return fmt.Errorf("invalid limit user name '%s' in limit definition", name) } - if existedUserName[name] { + if existingUserName[name] { return fmt.Errorf("duplicated user name '%s', already exists", name) } - existedUserName[name] = true + existingUserName[name] = true // The user without wildcard should not happen after the wildcard user // It means the wildcard for user should be the last item for limits object list which including the username, // and we should only set one wildcard user for all limits - if existedUserName["*"] && name != "*" { + if existingUserName["*"] && name != "*" { return fmt.Errorf("should not set no wildcard user %s after wildcard user limit", name) } } @@ -505,15 +505,15 @@ func checkLimit(limit Limit, existedUserName map[string]bool, existedGroupName m return fmt.Errorf("invalid limit group name '%s' in limit definition", name) } - if existedGroupName[name] { - return fmt.Errorf("duplicated group name '%s' , already existed", name) + if existingGroupName[name] { + return fmt.Errorf("duplicated group name '%s'", name) } - existedGroupName[name] = true + existingGroupName[name] = true // The group without wildcard should not happen after the wildcard group // It means the wildcard for group should be the last item for limits object list which including the group name, // and we should only set one wildcard group for all limits - if existedGroupName["*"] && name != "*" { + if existingGroupName["*"] && name != "*" { return fmt.Errorf("should not set no wildcard group %s after wildcard group limit", name) } } @@ -522,7 +522,7 @@ func checkLimit(limit Limit, existedUserName map[string]bool, existedGroupName m // If there is no specific group mentioned the wildcard group limit would thus be the same as the queue limit. // For that reason we do not allow specifying only one group limit that is using the wildcard. // There must be at least one limit with a group name defined. - if existedGroupName["*"] && len(existedGroupName) == 1 { + if existingGroupName["*"] && len(existingGroupName) == 1 { return fmt.Errorf("should not specify only one group limit that is using the wildcard. " + "There must be at least one limit with a group name defined ") } @@ -578,11 +578,11 @@ func checkLimits(limits []Limit, obj string, queue *QueueConfig) error { zap.String("objName", obj), zap.Int("limitsLength", len(limits))) - existedUserName := make(map[string]bool) - existedGroupName := make(map[string]bool) + existingUserName := make(map[string]bool) + existingGroupName := make(map[string]bool) for _, limit := range limits { - if err := checkLimit(limit, existedUserName, existedGroupName, queue); err != nil { + if err :=
(yunikorn-web) annotated tag v1.5.1 updated (db71be7 -> 75d2434)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a change to annotated tag v1.5.1 in repository https://gitbox.apache.org/repos/asf/yunikorn-web.git *** WARNING: tag v1.5.1 was modified! *** from db71be7 (commit) to 75d2434 (tag) tagging db71be72bae18e08b4264d06d781a841503aa283 (commit) replaces v1.4.0-1 by Peter Bacsko on Thu May 16 11:56:36 2024 +0200 - Log - Apache YuniKorn v1.5.1 --- No new revisions were added by this update. Summary of changes: - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-scheduler-interface) annotated tag v1.5.1 updated (3ad69cd -> 1628488)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a change to annotated tag v1.5.1 in repository https://gitbox.apache.org/repos/asf/yunikorn-scheduler-interface.git *** WARNING: tag v1.5.1 was modified! *** from 3ad69cd (commit) to 1628488 (tag) tagging 3ad69cdbc247cb5ce54acff16e06331ed95cba8c (commit) replaces v1.5.0 by Peter Bacsko on Thu May 16 11:56:12 2024 +0200 - Log - Apache YuniKorn v1.5.1 --- No new revisions were added by this update. Summary of changes: - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-k8shim) annotated tag v1.5.1 updated (207e4031 -> d23c5134)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a change to annotated tag v1.5.1 in repository https://gitbox.apache.org/repos/asf/yunikorn-k8shim.git *** WARNING: tag v1.5.1 was modified! *** from 207e4031 (commit) to d23c5134 (tag) tagging 207e4031c6484c965fca4018b6b8176afc5956b4 (commit) replaces v1.5.0 by Peter Bacsko on Thu May 16 11:55:49 2024 +0200 - Log - Apache YuniKorn v1.5.1 --- No new revisions were added by this update. Summary of changes: - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) annotated tag v1.5.1 updated (4856bc8d -> f121aeaf)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a change to annotated tag v1.5.1 in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git *** WARNING: tag v1.5.1 was modified! *** from 4856bc8d (commit) to f121aeaf (tag) tagging 4856bc8d7d7bc41f6640e306435eeb885eee8a3f (commit) replaces v1.5.0 by Peter Bacsko on Thu May 16 11:54:57 2024 +0200 - Log - Apache YuniKorn v1.5.1 --- No new revisions were added by this update. Summary of changes: - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2627) Add K8s 1.30 to the e2e matrix
[ https://issues.apache.org/jira/browse/YUNIKORN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2627. - Fix Version/s: 1.6.0 Resolution: Fixed Upgrdaed kind to version 0.23 and added 1.30 as a new version to test with > Add K8s 1.30 to the e2e matrix > -- > > Key: YUNIKORN-2627 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2627 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Wilfred Spiegelenburg >Assignee: Tseng Hsi-Huang >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > k8s 1.30 support in kind is now available as part of the [0.23 > release|https://github.com/kubernetes-sigs/kind/releases/tag/v0.23.0] > Need to add 1.30 to the matrix for the next release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-k8shim) branch master updated: [YUNIKORN-2627] Add K8s 1.30 to the e2e matrix (#840)
This is an automated email from the ASF dual-hosted git repository. wilfreds pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-k8shim.git The following commit(s) were added to refs/heads/master by this push: new 5f80f49b [YUNIKORN-2627] Add K8s 1.30 to the e2e matrix (#840) 5f80f49b is described below commit 5f80f49b2ee5acb3432b2d5534dbe7f3d3bcc2fc Author: Tseng Hsi-Huang <9501...@gmail.com> AuthorDate: Thu May 16 17:28:41 2024 +1000 [YUNIKORN-2627] Add K8s 1.30 to the e2e matrix (#840) Closes: #840 Signed-off-by: Wilfred Spiegelenburg --- .github/workflows/pre-commit.yml | 2 +- Makefile | 2 +- scripts/run-e2e-tests.sh | 3 ++- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml index 4131fde9..afed3906 100644 --- a/.github/workflows/pre-commit.yml +++ b/.github/workflows/pre-commit.yml @@ -43,7 +43,7 @@ jobs: strategy: fail-fast: false matrix: -k8s: [v1.29.2, v1.28.7, v1.27.11, v1.26.14, v1.25.16, v1.24.17] +k8s: [v1.30.0, v1.29.2, v1.28.7, v1.27.11, v1.26.14, v1.25.16, v1.24.17] plugin: ['', '--plugin'] steps: - name: Checkout source code diff --git a/Makefile b/Makefile index 50b3a659..76dd81fc 100644 --- a/Makefile +++ b/Makefile @@ -155,7 +155,7 @@ KUBECTL_VERSION=v1.27.7 KUBECTL_BIN=$(TOOLS_DIR)/kubectl # kind -KIND_VERSION=v0.20.0 +KIND_VERSION=v0.23.0 KIND_BIN=$(TOOLS_DIR)/kind # helm diff --git a/scripts/run-e2e-tests.sh b/scripts/run-e2e-tests.sh index 07073c4e..02c21ec7 100755 --- a/scripts/run-e2e-tests.sh +++ b/scripts/run-e2e-tests.sh @@ -164,9 +164,10 @@ Examples: ${NAME} -a test -n yk8s -v kindest/node:v1.27.11 ${NAME} -a test -n yk8s -v kindest/node:v1.28.7 ${NAME} -a test -n yk8s -v kindest/node:v1.29.2 + ${NAME} -a test -n yk8s -v kindest/node:v1.30.0 Use a local helm chart path: -${NAME} -a test -n yk8s -v kindest/node:v1.29.2 -p ../yunikorn-release/helm-charts/yunikorn +${NAME} -a test -n yk8s -v kindest/node:v1.30.0 -p ../yunikorn-release/helm-charts/yunikorn EOF } - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2616) Remove unused bool return from PreemptionPredicates()
[ https://issues.apache.org/jira/browse/YUNIKORN-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2616: - Labels: pull-request-available (was: ) > Remove unused bool return from PreemptionPredicates() > - > > Key: YUNIKORN-2616 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2616 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Trivial > Labels: pull-request-available > > The predicate manager method {{PreemptionPredicates()}} returns two values an > int and boolean. The boolean is false if the integer is -1 and true for 0 or > llarger. There is no need for the boolean as the -1 already indicates the same -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2626) Add flag to helm chart to disable web container
[ https://issues.apache.org/jira/browse/YUNIKORN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846827#comment-17846827 ] Wilfred Spiegelenburg commented on YUNIKORN-2626: - I have no strong feelings either way. The default should be the web container on but that is it. Create a PR to make it possible: charts are [here|https://github.com/wilfred-s/yunikorn-release/tree/master/helm-charts/yunikorn] > Add flag to helm chart to disable web container > --- > > Key: YUNIKORN-2626 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2626 > Project: Apache YuniKorn > Issue Type: New Feature > Components: deployment >Reporter: Michael >Priority: Major > > For our use case we only really need the admission controller and scheduler. > The helm chart does currently not provide a way to disable deploying the web > container and it would be great if that is possible. > Is there any reason not to disable the web container? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org