[jira] [Created] (YUNIKORN-2654) Remove unused code in k8shim context
Wilfred Spiegelenburg created YUNIKORN-2654: --- Summary: Remove unused code in k8shim context Key: YUNIKORN-2654 URL: https://issues.apache.org/jira/browse/YUNIKORN-2654 Project: Apache YuniKorn Issue Type: Task Components: shim - kubernetes Reporter: Wilfred Spiegelenburg The NotifyApplicationComplete and NotifyApplicationFail function are not called by anything and are unused code. The K8shim does not trigger the application completion or failure. This is triggered by the core when the application no longer has any activity registered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance
[ https://issues.apache.org/jira/browse/YUNIKORN-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2653: - Labels: pull-request-available (was: ) > Gang scheduling K8s event formatting compliance > --- > > Key: YUNIKORN-2653 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2653 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > > The K8s events provide definitions and rules around the content of the fields > within the event. Adjust the content of gang scheduling related events to > comply with the rules. > Focussed on the reason and action fields only. > * 'reason' is the reason this event is generated. 'reason' should be short > and unique; it should be in UpperCamelCase format (starting with a capital > letter). > * 'action' explains what happened with regarding/ what action did the > ReportingController take in objects name; it should be in UpperCamelCase > format (starting with a capital letter). > No space or long text. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance
Wilfred Spiegelenburg created YUNIKORN-2653: --- Summary: Gang scheduling K8s event formatting compliance Key: YUNIKORN-2653 URL: https://issues.apache.org/jira/browse/YUNIKORN-2653 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The K8s events provide definitions and rules around the content of the fields within the event. Adjust the content of gang scheduling related events to comply with the rules. Focussed on the reason and action fields only. * 'reason' is the reason this event is generated. 'reason' should be short and unique; it should be in UpperCamelCase format (starting with a capital letter). * 'action' explains what happened with regarding/ what action did the ReportingController take in objects name; it should be in UpperCamelCase format (starting with a capital letter). No space or long text. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-182) fix lint issues
[ https://issues.apache.org/jira/browse/YUNIKORN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850896#comment-17850896 ] Wilfred Spiegelenburg commented on YUNIKORN-182: File a new Jira for this, it needs to be fixed in all our http servers we create in our code, those are spread over multiple repositories and all need to be checked: {code:java} pkg/cmd/admissioncontroller/main.go:143:15: G112: Potential Slowloris Attack because ReadHeaderTimeout is not configured in the http.Server (gosec) {code} This one should get an ignore from the lint side, we do not need crypt quality random here; {code:java} test/e2e/framework/helpers/common/utils.go:105:18: G404: Use of weak random number generator (math/rand instead of crypto/rand) (gosec) b[i] = letters[rand.Intn(len(letters))]{code} All the ineffective assigns and shadowing remarks can and should be fixed. Formatting issues can snd should be fixed The function length ones are dubious and we probably should just add the {{//nolint:funlen}} remark on them specially since they are almost all test functions. > fix lint issues > --- > > Key: YUNIKORN-182 > URL: https://issues.apache.org/jira/browse/YUNIKORN-182 > Project: Apache YuniKorn > Issue Type: Task > Components: build >Reporter: Wilfred Spiegelenburg >Assignee: Yun Sun >Priority: Minor > Labels: pull-request-available > > When we added the lint test most major issues were fixed. There are still a > lot of issues specially in tests that need to be fixed. > This is a container Jira to track that work on both the k8shim as the core > repos. > Work should be split into multiple parts (per linter?) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2581) Expose running placement rules in REST
[ https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850889#comment-17850889 ] Wilfred Spiegelenburg commented on YUNIKORN-2581: - code change committed, working on documentation before closing > Expose running placement rules in REST > -- > > Key: YUNIKORN-2581 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > Since introducing the use of placement rules always and the recovery rule the > queue config does not correctly show the running rules. > Also if a config update has been rejected, for any reason, the rules would > not be correct > Exposing the configured rules from the placement manager works around all > these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) branch master updated: [YUNIKORN-2581] Expose running placement rules in REST (#857)
This is an automated email from the ASF dual-hosted git repository. wilfreds pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/master by this push: new 81d69749 [YUNIKORN-2581] Expose running placement rules in REST (#857) 81d69749 is described below commit 81d697494d972d01c01135dd981e1ec2fc89f57d Author: Wilfred Spiegelenburg AuthorDate: Fri May 31 11:26:57 2024 +1000 [YUNIKORN-2581] Expose running placement rules in REST (#857) Add a new REST endpoint under the partition that returns the currently active placement rules: /ws/v1/partition/:partition/placementrules The call retrieves the active set from the placement manager. This will include the recovery rule. The rule interface is extended with a new private method ruleDAO() that must be implemented by all rules and returns the RuleDAO object. Add the currently active set of placement rules to the state dump. This handles multiple partitions. It wraps the list of RuleDAO objects in a RuleDAOInfo per partition. Closes: #857 Signed-off-by: Wilfred Spiegelenburg --- go.mod| 1 + go.sum| 2 + pkg/scheduler/partition.go| 9 ++- pkg/scheduler/placement/filter.go | 47 - pkg/scheduler/placement/filter_test.go| 65 +- pkg/scheduler/placement/fixed_rule.go | 19 + pkg/scheduler/placement/fixed_rule_test.go| 38 ++ pkg/scheduler/placement/placement.go | 23 +-- pkg/scheduler/placement/placement_test.go | 51 -- pkg/scheduler/placement/provided_rule.go | 17 + pkg/scheduler/placement/provided_rule_test.go | 33 + pkg/scheduler/placement/recovery_rule.go | 10 +++ pkg/scheduler/placement/recovery_rule_test.go | 9 +++ pkg/scheduler/placement/rule.go | 44 pkg/scheduler/placement/rule_test.go | 9 +++ pkg/scheduler/placement/tag_rule.go | 18 + pkg/scheduler/placement/tag_rule_test.go | 38 ++ pkg/scheduler/placement/testrule.go | 17 + pkg/scheduler/placement/user_rule.go | 17 + pkg/scheduler/placement/user_rule_test.go | 33 + pkg/webservice/dao/rule_info.go | 39 +++ pkg/webservice/handlers.go| 32 + pkg/webservice/handlers_test.go | 99 --- pkg/webservice/routes.go | 6 ++ pkg/webservice/state_dump.go | 2 + 25 files changed, 623 insertions(+), 55 deletions(-) diff --git a/go.mod b/go.mod index 65b2c5a8..5f180926 100644 --- a/go.mod +++ b/go.mod @@ -33,6 +33,7 @@ require ( github.com/prometheus/common v0.45.0 github.com/sasha-s/go-deadlock v0.3.1 go.uber.org/zap v1.26.0 + golang.org/x/exp v0.0.0-20240409090435-93d18d7e34b8 golang.org/x/net v0.21.0 golang.org/x/time v0.5.0 google.golang.org/grpc v1.58.3 diff --git a/go.sum b/go.sum index 35ee67e4..f04fef47 100644 --- a/go.sum +++ b/go.sum @@ -50,6 +50,8 @@ go.uber.org/multierr v1.10.0 h1:S0h4aNzvfcFsC3dRF1jLoaov7oRaKqRGC/pUEJ2yvPQ= go.uber.org/multierr v1.10.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y= go.uber.org/zap v1.26.0 h1:sI7k6L95XOKS281NhVKOFCUNIvv9e0w4BF8N3u+tCRo= go.uber.org/zap v1.26.0/go.mod h1:dtElttAiwGvoJ/vj4IwHBS/gXsEu/pZ50mUIRWuG0so= +golang.org/x/exp v0.0.0-20240409090435-93d18d7e34b8 h1:ESSUROHIBHg7USnszlcdmjBEwdMj9VUvU+OPk4yl2mc= +golang.org/x/exp v0.0.0-20240409090435-93d18d7e34b8/go.mod h1:/lliqkxwWAhPjf5oSOIJup2XcqJaw8RGS6k3TGEc7GI= golang.org/x/net v0.23.0 h1:7EYJ93RZ9vYSZAIb2x3lnuvqO5zneoD6IvWjuhfxjTs= golang.org/x/net v0.23.0/go.mod h1:JKghWKKOSdJwpW2GEx0Ja7fmaKnMsbu+MWVZTokSYmg= golang.org/x/sys v0.18.0 h1:DBdB3niSjOA/O0blCZBqDefyWNYveAYMNF1Wum0DYQ4= diff --git a/pkg/scheduler/partition.go b/pkg/scheduler/partition.go index 5662bd75..fc2c5404 100644 --- a/pkg/scheduler/partition.go +++ b/pkg/scheduler/partition.go @@ -481,14 +481,19 @@ func (pc *PartitionContext) getQueueInternal(name string) *objects.Queue { return queue } -// Get the queue info for the whole queue structure to pass to the webservice +// GetPartitionQueues builds the queue info for the whole queue structure to pass to the webservice func (pc *PartitionContext) GetPartitionQueues() dao.PartitionQueueDAOInfo { partitionQueueDAOInfo := pc.root.GetPartitionQueueDAOInfo(true) partitionQueueDAOInfo.Partition = common.GetPartitionNameWithoutClusterID(pc.Name) return partitionQueueDAOInfo } -// Create the recovery queue. +// GetPlacementRules returns the current active rule set as dao to expose to the webservice +func (pc *PartitionContext) GetPlacementRules() []*dao.RuleDAO {
[jira] [Resolved] (YUNIKORN-2567) Remove Application reference from applicationEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2567. Fix Version/s: 1.6.0 Resolution: Fixed > Remove Application reference from applicationEvents > --- > > Key: YUNIKORN-2567 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2567 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) branch master updated: [YUNIKORN-2567] Remove Application reference from applicationEvents (#881)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/master by this push: new 3917da66 [YUNIKORN-2567] Remove Application reference from applicationEvents (#881) 3917da66 is described below commit 3917da665fc023a86f0b66e234f972cc887dba3b Author: Peter Bacsko AuthorDate: Thu May 30 21:02:19 2024 +0200 [YUNIKORN-2567] Remove Application reference from applicationEvents (#881) Closes: #881 Signed-off-by: Peter Bacsko --- pkg/scheduler/objects/application.go | 12 +- pkg/scheduler/objects/application_events.go | 47 pkg/scheduler/objects/application_events_test.go | 138 +++ pkg/scheduler/objects/application_state.go | 4 +- pkg/scheduler/objects/application_test.go| 4 +- pkg/scheduler/objects/queue.go | 2 +- 6 files changed, 77 insertions(+), 130 deletions(-) diff --git a/pkg/scheduler/objects/application.go b/pkg/scheduler/objects/application.go index b76466be..80769855 100644 --- a/pkg/scheduler/objects/application.go +++ b/pkg/scheduler/objects/application.go @@ -187,8 +187,8 @@ func NewApplication(siApp *si.AddApplicationRequest, ugi security.UserGroup, eve app.user = ugi app.rmEventHandler = eventHandler app.rmID = rmID - app.appEvents = newApplicationEvents(app, events.GetEventSystem()) - app.appEvents.sendNewApplicationEvent() + app.appEvents = newApplicationEvents(events.GetEventSystem()) + app.appEvents.sendNewApplicationEvent(app.ApplicationID) return app } @@ -2106,12 +2106,12 @@ func (sa *Application) updateRunnableStatus(runnableInQueue, runnableByUserLimit log.Log(log.SchedApplication).Info("Application is now runnable in queue", zap.String("appID", sa.ApplicationID), zap.String("queue", sa.queuePath)) - sa.appEvents.sendAppRunnableInQueueEvent() + sa.appEvents.sendAppRunnableInQueueEvent(sa.ApplicationID) } else { log.Log(log.SchedApplication).Info("Maximum number of running applications reached the queue limit", zap.String("appID", sa.ApplicationID), zap.String("queue", sa.queuePath)) - sa.appEvents.sendAppNotRunnableInQueueEvent() + sa.appEvents.sendAppNotRunnableInQueueEvent(sa.ApplicationID) } } sa.runnableInQueue = runnableInQueue @@ -2123,14 +2123,14 @@ func (sa *Application) updateRunnableStatus(runnableInQueue, runnableByUserLimit zap.String("queue", sa.queuePath), zap.String("user", sa.user.User), zap.Strings("groups", sa.user.Groups)) - sa.appEvents.sendAppRunnableQuotaEvent() + sa.appEvents.sendAppRunnableQuotaEvent(sa.ApplicationID) } else { log.Log(log.SchedApplication).Info("Maximum number of running applications reached the user/group limit", zap.String("appID", sa.ApplicationID), zap.String("queue", sa.queuePath), zap.String("user", sa.user.User), zap.Strings("groups", sa.user.Groups)) - sa.appEvents.sendAppNotRunnableQuotaEvent() + sa.appEvents.sendAppNotRunnableQuotaEvent(sa.ApplicationID) } } sa.runnableByUserLimit = runnableByUserLimit diff --git a/pkg/scheduler/objects/application_events.go b/pkg/scheduler/objects/application_events.go index bb94ffbf..04fe51a0 100644 --- a/pkg/scheduler/objects/application_events.go +++ b/pkg/scheduler/objects/application_events.go @@ -20,7 +20,6 @@ package objects import ( "fmt" - "github.com/apache/yunikorn-core/pkg/common" "github.com/apache/yunikorn-core/pkg/events" "github.com/apache/yunikorn-scheduler-interface/lib/go/si" @@ -28,15 +27,14 @@ import ( type applicationEvents struct { eventSystem events.EventSystem - app *Application } func (evt *applicationEvents) sendPlaceholderLargerEvent(ph *Allocation, request *AllocationAsk) { if !evt.eventSystem.IsEventTrackingEnabled() { return } - message := fmt.Sprintf("Task group '%s' in application '%s': allocation resources '%s' are not matching placeholder '%s' allocation with ID '%s'", ph.GetTaskGroup(), evt.app.ApplicationID, request.GetAllocatedResource(), ph.GetAllocatedResource(), ph.GetAllocationKey()) - event :=
[jira] [Created] (YUNIKORN-2652) Expand getApplication() endpoint handler to optionally return resource usage
Rich Scott created YUNIKORN-2652: Summary: Expand getApplication() endpoint handler to optionally return resource usage Key: YUNIKORN-2652 URL: https://issues.apache.org/jira/browse/YUNIKORN-2652 Project: Apache YuniKorn Issue Type: Improvement Components: scheduler-interface Reporter: Rich Scott Some users would like to be able to see resource usage (preempted, placeholder resource, etc) for applications that have been completed. The `getApplication()` endpoint handler should be enhanced to take an optional parameter specifying that the user would like details about resources included in the response, and a new `ApplicationXXXDAOInfo` object that is a slight superset of `ApplicationDAOInfo` should be introduced, and can be used in the response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2567) Remove Application reference from applicationEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2567: - Labels: pull-request-available (was: ) > Remove Application reference from applicationEvents > --- > > Key: YUNIKORN-2567 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2567 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2643) utils.go WaitFor improvement
[ https://issues.apache.org/jira/browse/YUNIKORN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HUAN-IU LIOU updated YUNIKORN-2643: --- Summary: utils.go WaitFor improvement (was: utils.go WaitForCondition test coverage improvement ) > utils.go WaitFor improvement > - > > Key: YUNIKORN-2643 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2643 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: HUAN-IU LIOU >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2644) Improve FitInScore funtion's test coverage in resources.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2644: --- Fix Version/s: 1.6.0 > Improve FitInScore funtion's test coverage in resources.go > -- > > Key: YUNIKORN-2644 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2644 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2629: --- Fix Version/s: 1.5.2 > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850682#comment-17850682 ] Peter Bacsko edited comment on YUNIKORN-2629 at 5/30/24 11:04 AM: -- FIx has been merged to branch-1.5. For master, the change is likely going to be different. was (Author: pbacsko): FIx has been merged to branch-1.5. The fix for master is likely going to be different. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850682#comment-17850682 ] Peter Bacsko commented on YUNIKORN-2629: FIx has been merged to branch-1.5. The fix for master is likely going to be different. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Fix For: 1.5.2 > > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2651) Update the unchecked error for make lint warnings
[ https://issues.apache.org/jira/browse/YUNIKORN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2651: - Labels: pull-request-available (was: ) > Update the unchecked error for make lint warnings > - > > Key: YUNIKORN-2651 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2651 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Chia-Ping Tsai >Assignee: Yun Sun >Priority: Major > Labels: pull-request-available > > fix the lint about "unhandled error" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2642) Don't set resources on the recovery queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2642. Resolution: Fixed > Don't set resources on the recovery queue > - > > Key: YUNIKORN-2642 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2642 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > The resource constrainst can be set on dynamic queues based on application > tags. We should not set this on the recovery queue, because there's no quota > on them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2642) Don't set resources on the recovery queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2642: --- Fix Version/s: 1.6.0 1.5.2 Target Version: 1.6.0, 1.5.2 (was: 1.6.0) Merged to master & branch-1.5. > Don't set resources on the recovery queue > - > > Key: YUNIKORN-2642 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2642 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > The resource constrainst can be set on dynamic queues based on application > tags. We should not set this on the recovery queue, because there's no quota > on them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) branch branch-1.5 updated: [YUNIKORN-2642] Don't set resources on the recovery queue (#879)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a commit to branch branch-1.5 in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/branch-1.5 by this push: new bb07ef0d [YUNIKORN-2642] Don't set resources on the recovery queue (#879) bb07ef0d is described below commit bb07ef0d22b4067539ac4d2db8a15f1bd031b40c Author: pbacsko AuthorDate: Thu May 30 09:46:23 2024 +0200 [YUNIKORN-2642] Don't set resources on the recovery queue (#879) --- pkg/scheduler/objects/queue.go | 5 + pkg/scheduler/partition_test.go | 16 2 files changed, 21 insertions(+) diff --git a/pkg/scheduler/objects/queue.go b/pkg/scheduler/objects/queue.go index 84d9d7b5..a5edece5 100644 --- a/pkg/scheduler/objects/queue.go +++ b/pkg/scheduler/objects/queue.go @@ -729,6 +729,11 @@ func (sq *Queue) AddApplication(app *Application) { appID := app.ApplicationID sq.applications[appID] = app sq.queueEvents.sendNewApplicationEvent(sq.QueuePath, appID) + if common.IsRecoveryQueue(sq.QueuePath) { + // don't set tag-based resources on the recovery queue + return + } + // YUNIKORN-199: update the quota from the namespace // get the tag with the quota quota := app.GetTag(siCommon.AppTagNamespaceResourceQuota) diff --git a/pkg/scheduler/partition_test.go b/pkg/scheduler/partition_test.go index 5a275c0e..bd1b5900 100644 --- a/pkg/scheduler/partition_test.go +++ b/pkg/scheduler/partition_test.go @@ -979,6 +979,22 @@ func TestAddAppForced(t *testing.T) { assert.Equal(t, common.RecoveryQueueFull, partApp3.GetQueuePath(), "wrong queue path for app3") assert.Check(t, recoveryQueue == partApp3.GetQueue(), "wrong queue for app3") assert.Equal(t, 3, len(recoveryQueue.GetCopyOfApps()), "wrong queue length") + + // add recovered forced apps with resource tags + app4 := newApplicationTags("app-4", "default", common.RecoveryQueueFull, map[string]string{ + siCommon.AppTagCreateForce: "true", + siCommon.AppTagNamespaceResourceGuaranteed: "{\"resources\":{\"vcore\":{\"value\":111}}}"}) + err = partition.AddApplication(app4) + assert.NilError(t, err, "app4 could not be added") + assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, "guaranteed resource should be unset") + assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource should be unset") + app5 := newApplicationTags("app-5", "default", common.RecoveryQueueFull, map[string]string{ + siCommon.AppTagCreateForce:"true", + siCommon.AppTagNamespaceResourceQuota: "{\"resources\":{\"vcore\":{\"value\":111}}}"}) + err = partition.AddApplication(app5) + assert.NilError(t, err, "app5 could not be added") + assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, "guaranteed resource should be unset") + assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource should be unset") } func TestAddAppForcedWithPlacement(t *testing.T) { - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) branch master updated: [YUNIKORN-2642] Don't set resources on the recovery queue (#878)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/master by this push: new 37d7e5c5 [YUNIKORN-2642] Don't set resources on the recovery queue (#878) 37d7e5c5 is described below commit 37d7e5c5431c3aec9686b8bfb787cdccf0834549 Author: Peter Bacsko AuthorDate: Thu May 30 09:47:13 2024 +0200 [YUNIKORN-2642] Don't set resources on the recovery queue (#878) Closes: #878 Signed-off-by: Peter Bacsko --- pkg/scheduler/partition.go | 5 +++-- pkg/scheduler/partition_test.go | 16 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/pkg/scheduler/partition.go b/pkg/scheduler/partition.go index 11ee05ae..5662bd75 100644 --- a/pkg/scheduler/partition.go +++ b/pkg/scheduler/partition.go @@ -320,8 +320,9 @@ func (pc *PartitionContext) AddApplication(app *objects.Application) error { queue := pc.getQueueInternal(queueName) // create the queue if necessary + isRecoveryQueue := common.IsRecoveryQueue(queueName) if queue == nil { - if common.IsRecoveryQueue(queueName) { + if isRecoveryQueue { queue, err = pc.createRecoveryQueue() if err != nil { return fmt.Errorf("failed to create recovery queue %s for application %s", common.RecoveryQueueFull, appID) @@ -341,7 +342,7 @@ func (pc *PartitionContext) AddApplication(app *objects.Application) error { guaranteedRes := app.GetGuaranteedResource() maxRes := app.GetMaxResource() - if guaranteedRes != nil || maxRes != nil { + if !isRecoveryQueue && (guaranteedRes != nil || maxRes != nil) { // set resources based on tags, but only if the queue is dynamic (unmanaged) if queue.IsManaged() { log.Log(log.SchedQueue).Warn("Trying to set resources on a queue that is not an unmanaged leaf", diff --git a/pkg/scheduler/partition_test.go b/pkg/scheduler/partition_test.go index eabb1ad4..e99d5761 100644 --- a/pkg/scheduler/partition_test.go +++ b/pkg/scheduler/partition_test.go @@ -964,6 +964,22 @@ func TestAddAppForced(t *testing.T) { assert.Equal(t, common.RecoveryQueueFull, partApp3.GetQueuePath(), "wrong queue path for app3") assert.Check(t, recoveryQueue == partApp3.GetQueue(), "wrong queue for app3") assert.Equal(t, 3, len(recoveryQueue.GetCopyOfApps()), "wrong queue length") + + // add recovered forced apps with resource tags + app4 := newApplicationTags("app-4", "default", common.RecoveryQueueFull, map[string]string{ + siCommon.AppTagCreateForce: "true", + siCommon.AppTagNamespaceResourceGuaranteed: "{\"resources\":{\"vcore\":{\"value\":111}}}"}) + err = partition.AddApplication(app4) + assert.NilError(t, err, "app4 could not be added") + assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, "guaranteed resource should be unset") + assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource should be unset") + app5 := newApplicationTags("app-5", "default", common.RecoveryQueueFull, map[string]string{ + siCommon.AppTagCreateForce:"true", + siCommon.AppTagNamespaceResourceQuota: "{\"resources\":{\"vcore\":{\"value\":111}}}"}) + err = partition.AddApplication(app5) + assert.NilError(t, err, "app5 could not be added") + assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, "guaranteed resource should be unset") + assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource should be unset") } func TestAddAppForcedWithPlacement(t *testing.T) { - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org