[jira] [Resolved] (YUNIKORN-2725) Temporarily disable failing e2e preemption tests
[ https://issues.apache.org/jira/browse/YUNIKORN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2725. Fix Version/s: 1.6.0 Resolution: Fixed > Temporarily disable failing e2e preemption tests > > > Key: YUNIKORN-2725 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2725 > Project: Apache YuniKorn > Issue Type: Test > Components: shim - kubernetes, test - e2e >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Disable the following tests to have green builds: > Verify_preemption_on_priority_queue > Verify_basic_preemption > Verify_allow_preemption_tag -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2725) Temporarily disable failing e2e tests
Peter Bacsko created YUNIKORN-2725: -- Summary: Temporarily disable failing e2e tests Key: YUNIKORN-2725 URL: https://issues.apache.org/jira/browse/YUNIKORN-2725 Project: Apache YuniKorn Issue Type: Test Components: shim - kubernetes, test - e2e Reporter: Peter Bacsko Assignee: Peter Bacsko Disable the following tests to have green builds: Verify_preemption_on_priority_queue Verify_basic_preemption -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2724) Improve the signature of methods notifyTaskComplete() and ensureAppAndTaskCreated()
Peter Bacsko created YUNIKORN-2724: -- Summary: Improve the signature of methods notifyTaskComplete() and ensureAppAndTaskCreated() Key: YUNIKORN-2724 URL: https://issues.apache.org/jira/browse/YUNIKORN-2724 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Peter Bacsko >From the review [https://github.com/apache/yunikorn-k8shim/pull/864] "I also think we need to change the signature for {{notifyTaskComplete(string, string)}} to {{notifyTaskComplete(*Application, string)}} Probably better to use a separate jira for that as it flows through into {{NotifyTaskComplete()}} and some tests. The 2 tests have the application pointer already. It removes a number of extra getApplication() calls we really do not need. Similar for {{ensureAppAndTaskCreated()}} which is only ever called from this function. Add a parameter to it to make it: {{ensureAppAndTaskCreated(*v1.Pod, *Application)}} and only execute application creation {{{}if app == nil{}}}. This can be either in this jira or in a separate one." That is, optimize the methods so that we avoid unnecessary {{GetApplication()}} calls. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2182) Set ReadHeaderTimeout in http server
[ https://issues.apache.org/jira/browse/YUNIKORN-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2182. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Set ReadHeaderTimeout in http server > > > Key: YUNIKORN-2182 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2182 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common, webapp >Reporter: Wilfred Spiegelenburg >Assignee: Chenchen Lai >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > Potential Slowloris Attack because ReadHeaderTimeout is not configured in the > http.Server (gosec) > We do not set ReadTimeout or ReadHeaderTimeout so we do not have a timeout at > all at the moment. > BTW: this is not important for the webtest servers we build as they are just > for our tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2568) Move all xxxEvents types to objects/events
[ https://issues.apache.org/jira/browse/YUNIKORN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2568. Fix Version/s: 1.6.0 Resolution: Fixed > Move all xxxEvents types to objects/events > -- > > Key: YUNIKORN-2568 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2568 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package
[ https://issues.apache.org/jira/browse/YUNIKORN-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2564. Fix Version/s: 1.6.0 Resolution: Fixed > [Umbrella] Move xxxEvents types to a different package > -- > > Key: YUNIKORN-2564 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2564 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.6.0 > > > There are several Events that can be moved to a different package: > * queueEvents > * applicationEvents > * askEvents > * nodeEvents > There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity > to clean it up a bit and move these under eg. > {{pkg/scheduler/objects/events}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2708) Release notes for 1.5.2
Peter Bacsko created YUNIKORN-2708: -- Summary: Release notes for 1.5.2 Key: YUNIKORN-2708 URL: https://issues.apache.org/jira/browse/YUNIKORN-2708 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2709) Update website for 1.5.2
Peter Bacsko created YUNIKORN-2709: -- Summary: Update website for 1.5.2 Key: YUNIKORN-2709 URL: https://issues.apache.org/jira/browse/YUNIKORN-2709 Project: Apache YuniKorn Issue Type: Sub-task Components: release Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2706) [UMBRELLA] YuniKorn 1.5.2 release efforts
Peter Bacsko created YUNIKORN-2706: -- Summary: [UMBRELLA] YuniKorn 1.5.2 release efforts Key: YUNIKORN-2706 URL: https://issues.apache.org/jira/browse/YUNIKORN-2706 Project: Apache YuniKorn Issue Type: Task Components: release Reporter: Peter Bacsko Assignee: Peter Bacsko This umbrella is to track the work items needed for the 1.5.2 release. Release manager: Peter Bacsko. This release only contains bug fixes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2707) Tagging for 1.5.2
Peter Bacsko created YUNIKORN-2707: -- Summary: Tagging for 1.5.2 Key: YUNIKORN-2707 URL: https://issues.apache.org/jira/browse/YUNIKORN-2707 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2704) Event publish errors out when predicates fail
[ https://issues.apache.org/jira/browse/YUNIKORN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2704. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed Merged to master & branch-1.5 > Event publish errors out when predicates fail > - > > Key: YUNIKORN-2704 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2704 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Mit Desai >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.6.0, 1.5.2 > > > I consistently see this error in the logs when events are published. > I did put some debug logs and found that I only get it when the events for > untolerated taints are published. > E0618 17:43:17.858946 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event \"<>.17da2a31072bb32f\" is > invalid: [action: Required value, reason: Required value]" > event="\{ObjectMeta:{<>.17da2a31072bb32f dpi-dev 0 > 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-06-18 17:43:17.857332069 + UTC > m=+84279.014490005,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-59bdc88fdc-7h5bt,Action:,Reason:,Regarding:\{Pod > <> <> 5c90315c-a07d-4801-9ecc-baf61ee45f11 v1 > 4323324038 },Related:nil,Note:Predicate failed for request > '5c90315c-a07d-4801-9ecc-baf61ee45f11' with message: 'node(s) had untolerated > taint \{<>: <>}',Type:Normal,DeprecatedSource:\{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2694) Improve placement rule funtion's test coverage - 2
[ https://issues.apache.org/jira/browse/YUNIKORN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2694. Fix Version/s: 1.6.0 Resolution: Fixed > Improve placement rule funtion's test coverage - 2 > -- > > Key: YUNIKORN-2694 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2694 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2683) Unnecessary error is logged when resource usage is increased
[ https://issues.apache.org/jira/browse/YUNIKORN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2683. Fix Version/s: 1.6.0 Resolution: Fixed > Unnecessary error is logged when resource usage is increased > > > Key: YUNIKORN-2683 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2683 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > > The refactored code in YUNIKORN-2542 contains an unnecessary warning message: > {noformat} > appGroup := userTracker.getGroupForApp(applicationID) > log.Log(log.SchedUGM).Debug("Increasing resource usage for user", > zap.String("user", user.User), > zap.String("queue path", queuePath), > zap.String("application", applicationID), > zap.String("group", appGroup), > zap.Stringer("resource", usage)) > groupTracker := m.GetGroupTracker(appGroup) > if groupTracker == nil { > log.Log(log.SchedUGM).Error("group tracker should be available > in groupTrackers map", > zap.String("application", applicationID), > zap.String("group", appGroup)) > return > } > ... > {noformat} > We don't always have a {{groupTracker}}. The previous code simply called > {{increaseTrackedResource()}} on an empty tracker: > {noformat} > func (ut *UserTracker) increaseTrackedResource(queuePath string, > applicationID string, usage *resources.Resource) { > ut.Lock() > defer ut.Unlock() > ut.events.sendIncResourceUsageForUser(ut.userName, queuePath, usage) > hierarchy := strings.Split(queuePath, configs.DOT) > ut.queueTracker.increaseTrackedResource(hierarchy, applicationID, user, > usage) > gt := ut.appGroupTrackers[applicationID] > log.Log(log.SchedUGM).Debug("Increasing resource usage for group", > zap.String("group", gt.getName()), > zap.Strings("queue path", hierarchy), > zap.String("application", applicationID), > zap.Stringer("resource", usage)) > gt.increaseTrackedResource(queuePath, applicationID, usage, > ut.userName) <- can be null > } > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2661) Fix hard-coded boolean in setLimit
[ https://issues.apache.org/jira/browse/YUNIKORN-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2661. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed Merged to master & branch-1.5 > Fix hard-coded boolean in setLimit > -- > > Key: YUNIKORN-2661 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2661 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > Inside the UGM code {{setLimit()}}, we don't pass down {{doWildcardCheck}}, > so this variables never reaches the leafs: > {noformat} > / Note: Lock free call. The Lock of the linked tracker (UserTracker and > GroupTracker) should be held before calling this function. > func (qt *QueueTracker) setLimit(hierarchy []string, maxResource > *resources.Resource, maxApps uint64, useWildCard bool, trackType > trackingType, doWildCardCheck bool) { > log.Log(log.SchedUGM).Debug("Setting limits", > zap.String("queue path", qt.queuePath), > zap.Strings("hierarchy", hierarchy), > zap.Uint64("max applications", maxApps), > zap.Stringer("max resources", maxResource), > zap.Bool("use wild card", useWildCard)) > // depth first: all the way to the leaf, create if not exists > // more than 1 in the slice means we need to recurse down > if len(hierarchy) > 1 { > childName := hierarchy[1] > if qt.childQueueTrackers[childName] == nil { > qt.childQueueTrackers[childName] = > newQueueTracker(qt.queuePath, childName, trackType) > } > qt.childQueueTrackers[childName].setLimit(hierarchy[1:], > maxResource, maxApps, useWildCard, trackType, false) <-- should be > "doWildCardCheck" not "false" > ... > {noformat} > Fix this and create a unit test for {{setLimit()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2516) Update documentation about event.RESTResponseSize
[ https://issues.apache.org/jira/browse/YUNIKORN-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2516. Fix Version/s: 1.6.0 Resolution: Fixed > Update documentation about event.RESTResponseSize > - > > Key: YUNIKORN-2516 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2516 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: documentation >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2512) Event system properties are not used
[ https://issues.apache.org/jira/browse/YUNIKORN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2512. Fix Version/s: 1.6.0 Resolution: Fixed > Event system properties are not used > > > Key: YUNIKORN-2512 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2512 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.6.0 > > > There two properties which are not used by the event system: > # The property "event.requestCapacity" is supposed to determine the size of a > slice which is used between the core and shim to transfer events in every 2 > seconds. However, right now it's not used at all, we use the default (1000) > every time. > # The property "RESTResponseSize" is not even in the code at all. It > influences the maximum number of entries returned in the batch API. > Currently, the hard coded value is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2245) Application sorting: improve pending resource filtering
[ https://issues.apache.org/jira/browse/YUNIKORN-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2245. Resolution: Won't Do > Application sorting: improve pending resource filtering > --- > > Key: YUNIKORN-2245 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2245 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > > When sorting applications, we do a filtering on pending resources: > {noformat} > func filterOnPendingResources(apps map[string]*Application) []*Application { > filteredApps := make([]*Application, 0) > for _, app := range apps { > // Only look at app when pending-res > 0 > if resources.StrictlyGreaterThanZero(app.GetPendingResource()) { > filteredApps = append(filteredApps, app) > } > } > return filteredApps > } > {noformat} > This filtering is relatively expensive, but necessary, because during the > lifecycle of an application, {{sa.pending}} can become 0 and in this case, we > don't want to schedule anything from the app. > Suggested approach is to track total pendingAskRepeats inside the app. That > way we don't need to call {{resources.StrictlyGreaterThanZero()}} and we > perform a simple integer comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2221) Performance improvements phase II
[ https://issues.apache.org/jira/browse/YUNIKORN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko closed YUNIKORN-2221. -- > Performance improvements phase II > - > > Key: YUNIKORN-2221 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2221 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler, shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Fix For: 1.5.0 > > > Umbrella JIRA for further performance improvements in Yunikorn. > The main issues have been addressed in YUNIKORN-1715. However, it's still > possible to reduce memory and CPU usage further by doing smaller things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2221) Performance improvements phase II
[ https://issues.apache.org/jira/browse/YUNIKORN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2221. Fix Version/s: 1.5.0 Resolution: Fixed > Performance improvements phase II > - > > Key: YUNIKORN-2221 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2221 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler, shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Fix For: 1.5.0 > > > Umbrella JIRA for further performance improvements in Yunikorn. > The main issues have been addressed in YUNIKORN-1715. However, it's still > possible to reduce memory and CPU usage further by doing smaller things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance
[ https://issues.apache.org/jira/browse/YUNIKORN-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2653. Fix Version/s: 1.6.0 Resolution: Fixed > Gang scheduling K8s event formatting compliance > --- > > Key: YUNIKORN-2653 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2653 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > The K8s events provide definitions and rules around the content of the fields > within the event. Adjust the content of gang scheduling related events to > comply with the rules. > Focussed on the reason and action fields only. > * 'reason' is the reason this event is generated. 'reason' should be short > and unique; it should be in UpperCamelCase format (starting with a capital > letter). > * 'action' explains what happened with regarding/ what action did the > ReportingController take in objects name; it should be in UpperCamelCase > format (starting with a capital letter). > No space or long text. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2683) Unnecessary error is logged when resource usage is increased
Peter Bacsko created YUNIKORN-2683: -- Summary: Unnecessary error is logged when resource usage is increased Key: YUNIKORN-2683 URL: https://issues.apache.org/jira/browse/YUNIKORN-2683 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko The refactored code in YUNIKORN-2542 contains an unnecessary warning message: {noformat} appGroup := userTracker.getGroupForApp(applicationID) log.Log(log.SchedUGM).Debug("Increasing resource usage for user", zap.String("user", user.User), zap.String("queue path", queuePath), zap.String("application", applicationID), zap.String("group", appGroup), zap.Stringer("resource", usage)) groupTracker := m.GetGroupTracker(appGroup) if groupTracker == nil { log.Log(log.SchedUGM).Error("group tracker should be available in groupTrackers map", zap.String("application", applicationID), zap.String("group", appGroup)) return } ... {noformat} We don't always have a {{groupTracker}}. The previous code simply called {{increaseTrackedResource()}} on an empty tracker: {noformat} func (ut *UserTracker) increaseTrackedResource(queuePath string, applicationID string, usage *resources.Resource) { ut.Lock() defer ut.Unlock() ut.events.sendIncResourceUsageForUser(ut.userName, queuePath, usage) hierarchy := strings.Split(queuePath, configs.DOT) ut.queueTracker.increaseTrackedResource(hierarchy, applicationID, user, usage) gt := ut.appGroupTrackers[applicationID] log.Log(log.SchedUGM).Debug("Increasing resource usage for group", zap.String("group", gt.getName()), zap.Strings("queue path", hierarchy), zap.String("application", applicationID), zap.Stringer("resource", usage)) gt.increaseTrackedResource(queuePath, applicationID, usage, ut.userName) <- can be null } {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2680) Improve placement rule funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2680. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve placement rule funtion's test coverage > -- > > Key: YUNIKORN-2680 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2680 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2681) Data race in TestGetStream_Limit
Peter Bacsko created YUNIKORN-2681: -- Summary: Data race in TestGetStream_Limit Key: YUNIKORN-2681 URL: https://issues.apache.org/jira/browse/YUNIKORN-2681 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler, test - unit Reporter: Peter Bacsko Assignee: Peter Bacsko Data race was detected during an unit test: {noformat} == WARNING: DATA RACE Write at 0x0170c220 by goroutine 2575: github.com/apache/yunikorn-core/pkg/webservice.NewWebApp() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/webservice.go:82 +0x11c github.com/apache/yunikorn-core/pkg/webservice.TestCheckHealthStatusNotFound() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2574 +0x2f testing.tRunner() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x44 Previous read at 0x0170c220 by goroutine 2542: github.com/apache/yunikorn-core/pkg/webservice.getStream() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers.go:1225 +0xbd3 github.com/apache/yunikorn-core/pkg/webservice.TestGetStream_Limit.gowrap4() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2308 +0x4f Goroutine 2575 (running) created at: testing.(*T).Run() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x825 testing.runTests.func1() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2161 +0x85 testing.tRunner() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e testing.runTests() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2159 +0x8be testing.(*M).Run() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2027 +0xf17 main.main() _testmain.go:163 +0x2e4 Goroutine 2542 (running) created at: github.com/apache/yunikorn-core/pkg/webservice.TestGetStream_Limit() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2308 +0xbb7 testing.tRunner() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x44 == 2024-06-18T13:40:54.182ZINFOcore.events events/event_streaming.go:164 Removing event stream consumer {"name": "host-1", "creation time": "2024-06-18T13:40:54.181Z"} 2024-06-18T13:40:54.182ZINFOcore.scheduler.health webservice/handlers.go:623 Health check is not available --- FAIL: TestCheckHealthStatusNotFound (0.00s) testing.go:1398: race detected during execution of test {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2673) Improve newFilter funtion's test coverage in filter.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2673. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve newFilter funtion's test coverage in filter.go > -- > > Key: YUNIKORN-2673 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2673 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2515) Add property event.RESTResponseSize to the batch event handler
[ https://issues.apache.org/jira/browse/YUNIKORN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2515. Fix Version/s: 1.6.0 Resolution: Fixed > Add property event.RESTResponseSize to the batch event handler > -- > > Key: YUNIKORN-2515 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2515 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2670) Improve util funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2670. Fix Version/s: 1.6.0 Resolution: Fixed > Improve util funtion's test coverage > > > Key: YUNIKORN-2670 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2670 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Improve the following funtion's test coverage in util.go > * ZeroTimeInUnixNano > * GetNewUUID > * IsRecoveryQueue -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2669) nil pointer dereference error
[ https://issues.apache.org/jira/browse/YUNIKORN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2669. Resolution: Duplicate This looks like a dup of YUNIKORN-2562. The solution for this has been delivered in 1.5.1. It's also on master. > nil pointer dereference error > - > > Key: YUNIKORN-2669 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2669 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Junyoung Park >Assignee: Peter Bacsko >Priority: Major > > Environment: AWS EKS 1.26 > yunikorn-scheduler logs > {code:java} > panic: runtime error: invalid memory address or nil pointer > dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 > pc=0x179b2f5] > goroutine 50 > [running]:github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc000661000, > {0xc008ad14a0, 0x24}) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/objects/application.go:1739 > > +0x615github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xc00046a100?, > 0xc01436c880) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/partition.go:1281 > +0x27fgithub.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc000502680?, > {0xc02014da60, 0x1, 0xc0112f5ee8?}, {0xc0060f8980, 0xb}) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/context.go:868 > +0x9egithub.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc00046a100?, > 0xc0145e8eb0?) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/context.go:750 > +0xa5github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000120990) > > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/scheduler.go:111 > +0x16ecreated by > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in > goroutine 1 > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/scheduler.go:55 +0x9c > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2637. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed Merged to master & branch-1.5. > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2668) Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails
[ https://issues.apache.org/jira/browse/YUNIKORN-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2668. Fix Version/s: 1.6.0 Resolution: Fixed > Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails > > > Key: YUNIKORN-2668 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2668 > Project: Apache YuniKorn > Issue Type: Task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The test case TestUpdateAllocation_NewTask_AssumePodFails occasionally fails > due to a deadlock problem described in YUNIKORN-2629. Until that ticket is > resolved, let's disable this test for the time being, so upstream tests don't > fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2668) Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails
Peter Bacsko created YUNIKORN-2668: -- Summary: Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails Key: YUNIKORN-2668 URL: https://issues.apache.org/jira/browse/YUNIKORN-2668 Project: Apache YuniKorn Issue Type: Task Reporter: Peter Bacsko Assignee: Peter Bacsko The test case TestUpdateAllocation_NewTask_AssumePodFails occasionally fails due to a deadlock problem described in YUNIKORN-2629. Until that ticket is resolved, let's disable this test for the time being, so upstream tests don't fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2561) Support topology spread constraints on placeholder pods
[ https://issues.apache.org/jira/browse/YUNIKORN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2561. Fix Version/s: 1.6.0 Resolution: Fixed > Support topology spread constraints on placeholder pods > --- > > Key: YUNIKORN-2561 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2561 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Jacob Salway >Assignee: Jacob Salway >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > If a pod has a topology spread constraint with a `whenUnsatisfiable: > DoNotSchedule` constraint and is used as part of a task group, it is not > possible to pass the constraint to the placeholder pods created by Yunikorn. > This can result in placeholder pods being placed on a node that would violate > the original pod's topology spread constraint. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2643) utils.go WaitForCondition improvement
[ https://issues.apache.org/jira/browse/YUNIKORN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2643. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. Thanks [~mean-world] for the contribution. > utils.go WaitForCondition improvement > -- > > Key: YUNIKORN-2643 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2643 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: HUAN-IU LIOU >Assignee: HUAN-IU LIOU >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2663) Improve ACL struct funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2663. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve ACL struct funtion's test coverage > -- > > Key: YUNIKORN-2663 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2663 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Remove unreachable code in NewACL func > Improve the following funtion's test coverage in acl.go > * TestSetUsers > * TestSetGroups -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2666) Fix DeepEqual comparison in Test_fixedRule_ruleDAO
[ https://issues.apache.org/jira/browse/YUNIKORN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2666. Fix Version/s: 1.6.0 Resolution: Fixed > Fix DeepEqual comparison in Test_fixedRule_ruleDAO > --- > > Key: YUNIKORN-2666 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2666 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler, test - unit >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The test case {{Test_fixedRule_ruleDAO/filter}} can randomly fail due to the > non-deterministic nature of map key iteration: > {noformat} > fixed_rule_test.go:285: assertion failed: > --- tt.want > +++ ruleDAO > { > Name: "fixed", > Parameters: {"create": "true", "qualified": "false", > "queue": "default"}, > Filter: { > Type: "allow", > UserList: nil, > GroupList: []string{ > - "group1", > + "group2", > - "group2", > + "group1", > }, > UserExp: "", > GroupExp: "", > }, > ParentRule: nil, > } > {noformat} > We use {{maps.Keys()}} when we create the user list and group list in > {{FilterDAO}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2666) Fix DeepEqual comparison in Test_fixedRule_ruleDAO
Peter Bacsko created YUNIKORN-2666: -- Summary: Fix DeepEqual comparison in Test_fixedRule_ruleDAO Key: YUNIKORN-2666 URL: https://issues.apache.org/jira/browse/YUNIKORN-2666 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler, test - unit Reporter: Peter Bacsko The test case {{Test_fixedRule_ruleDAO/filter}} can randomly fail due to the non-deterministic nature of map key iteration: {noformat} fixed_rule_test.go:285: assertion failed: --- tt.want +++ ruleDAO { Name: "fixed", Parameters: {"create": "true", "qualified": "false", "queue": "default"}, Filter: { Type: "allow", UserList: nil, GroupList: []string{ - "group1", + "group2", - "group2", + "group1", }, UserExp: "", GroupExp: "", }, ParentRule: nil, } {noformat} We use {{maps.Keys()}} when we create the user list and group list in {{FilterDAO}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2650) Complete or remove web_server_test#TestProxy
[ https://issues.apache.org/jira/browse/YUNIKORN-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2650. Fix Version/s: 1.6.0 Resolution: Fixed > Complete or remove web_server_test#TestProxy > > > Key: YUNIKORN-2650 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2650 > Project: Apache YuniKorn > Issue Type: Test >Reporter: Chia-Ping Tsai >Assignee: Chenchen Lai >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > web_server_test has a empty test case: TestProxy [0]. It seems to me there is > proxy-related test [1]. > [0] > https://github.com/apache/yunikorn-k8shim/blob/58adfe941d2d8dae5544af8b49e435f304678807/pkg/webtest/web_server_test.go#L82 > [1] > https://github.com/apache/yunikorn-k8shim/blob/58adfe941d2d8dae5544af8b49e435f304678807/pkg/webtest/web_server_test.go#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2514) Update documentation about event.requestCapacity
[ https://issues.apache.org/jira/browse/YUNIKORN-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2514. Fix Version/s: 1.6.0 Resolution: Fixed > Update documentation about event.requestCapacity > > > Key: YUNIKORN-2514 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2514 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: documentation >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2654) Remove unused code in k8shim context
[ https://issues.apache.org/jira/browse/YUNIKORN-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2654. Fix Version/s: 1.6.0 Resolution: Fixed > Remove unused code in k8shim context > > > Key: YUNIKORN-2654 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2654 > Project: Apache YuniKorn > Issue Type: Task > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Chenchen Lai >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > The NotifyApplicationComplete and NotifyApplicationFail function are not > called by anything and are unused code. > The K8shim does not trigger the application completion or failure. This is > triggered by the core when the application no longer has any activity > registered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2647) Flaky test TestUpdateNodeCapacity
[ https://issues.apache.org/jira/browse/YUNIKORN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2647. Fix Version/s: 1.6.0 Resolution: Fixed > Flaky test TestUpdateNodeCapacity > - > > Key: YUNIKORN-2647 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2647 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - unit >Reporter: Wilfred Spiegelenburg >Assignee: Tseng Hsi-Huang >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > Same as we saw in YUNIKORN-2573 the single node update test might fail: > {code:java} > --- FAIL: TestUpdateNodeCapacity (0.03s) > operation_test.go:446: Expected partition resource map[memory:1 > vcore:2], doesn't match with actual partition resource > map[memory:1 vcore:2]{code} > We calculate the delta resources when updating node capacity with that delta > we update resources in partition. > The test would fail with following order same as for multiple nodes > node.SetCapacity() -> waitForAvailableNodeResource() -> > partitionInfo.GetTotalPartitionResource() -> > partition.updatePartitionResource() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2659) Improve config validator funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2659. Fix Version/s: 1.6.0 Resolution: Fixed > Improve config validator funtion's test coverage > > > Key: YUNIKORN-2659 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2659 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Improve the following funtion's test coverage in configvalidator.go > * checkPlacementRule > * checkLimitResource > * checkLimit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2661) Fix hard-coded boolean in setLimit
Peter Bacsko created YUNIKORN-2661: -- Summary: Fix hard-coded boolean in setLimit Key: YUNIKORN-2661 URL: https://issues.apache.org/jira/browse/YUNIKORN-2661 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko Inside the UGM code {{setLimit()}}, we don't pass down {{doWildcardCheck}}, so this variables never reaches the leafs: {noformat} / Note: Lock free call. The Lock of the linked tracker (UserTracker and GroupTracker) should be held before calling this function. func (qt *QueueTracker) setLimit(hierarchy []string, maxResource *resources.Resource, maxApps uint64, useWildCard bool, trackType trackingType, doWildCardCheck bool) { log.Log(log.SchedUGM).Debug("Setting limits", zap.String("queue path", qt.queuePath), zap.Strings("hierarchy", hierarchy), zap.Uint64("max applications", maxApps), zap.Stringer("max resources", maxResource), zap.Bool("use wild card", useWildCard)) // depth first: all the way to the leaf, create if not exists // more than 1 in the slice means we need to recurse down if len(hierarchy) > 1 { childName := hierarchy[1] if qt.childQueueTrackers[childName] == nil { qt.childQueueTrackers[childName] = newQueueTracker(qt.queuePath, childName, trackType) } qt.childQueueTrackers[childName].setLimit(hierarchy[1:], maxResource, maxApps, useWildCard, trackType, false) ... {noformat} Fix this and create a unit test for {{setLimit()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2649) Improve CalculateAbsUsedCapacity & CompUsageRatio funtion's test coverage in resources.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2649. Fix Version/s: 1.6.0 Resolution: Fixed > Improve CalculateAbsUsedCapacity & CompUsageRatio funtion's test coverage in > resources.go > - > > Key: YUNIKORN-2649 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2649 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2581) Expose running placement rules in REST
[ https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2581. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Expose running placement rules in REST > -- > > Key: YUNIKORN-2581 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Since introducing the use of placement rules always and the recovery rule the > queue config does not correctly show the running rules. > Also if a config update has been rejected, for any reason, the rules would > not be correct > Exposing the configured rules from the placement manager works around all > these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2646) Deadlock detected during preemption
[ https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2646. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed > Deadlock detected during preemption > --- > > Key: YUNIKORN-2646 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2646 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Dmitry >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > Attachments: yunikorn-logs-lock.txt.gz > > > Hitting deadlocks in 1.5.1 > The log is attached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2542) Consistent logging and tracker handling for increment/decrement
[ https://issues.apache.org/jira/browse/YUNIKORN-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2542. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. Thanks [~Tseng Hsi-Huang] for the contribution. > Consistent logging and tracker handling for increment/decrement > --- > > Key: YUNIKORN-2542 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2542 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Tseng Hsi-Huang >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > We log DEBUG output and use {{GroupTracker}} inconsistently in {{Manager}} > and in {{UserTracker}}. > Eg. > {{Manager.IncreaseTrackedResource()}}: only a single log output with DEBUG > level > {{Manager.DecreaseTrackedResource()}}: multiple log statements, also handles > the group tracker which is not the case with increments > This also affects {{UserTracker}} - logs handling are different > in {{increaseTrackedResource()}}/{{decreaseTrackedResource()}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2567) Remove Application reference from applicationEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2567. Fix Version/s: 1.6.0 Resolution: Fixed > Remove Application reference from applicationEvents > --- > > Key: YUNIKORN-2567 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2567 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2642) Don't set resources on the recovery queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2642. Resolution: Fixed > Don't set resources on the recovery queue > - > > Key: YUNIKORN-2642 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2642 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > The resource constrainst can be set on dynamic queues based on application > tags. We should not set this on the recovery queue, because there's no quota > on them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2635) test coverage improvement: same priority case in sorter
[ https://issues.apache.org/jira/browse/YUNIKORN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2635. Fix Version/s: 1.6.0 Resolution: Fixed > test coverage improvement: same priority case in sorter > > > Key: YUNIKORN-2635 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2635 > Project: Apache YuniKorn > Issue Type: Test > Components: core - scheduler >Reporter: Chen Yu Teng >Assignee: Chen Yu Teng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application
[ https://issues.apache.org/jira/browse/YUNIKORN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2633. Fix Version/s: 1.6.0 Resolution: Fixed > Unnecessary warning from Partition when adding an application > - > > Key: YUNIKORN-2633 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2633 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The following is printed when adding an application: > {noformat} > 2024-05-17T21:53:04.716+0200 WARNcore.scheduler.queue > scheduler/partition.go:344 Trying to set resources on a queue that is > not an unmanaged leaf{"queueName": "root.default"} > {noformat} > This message is supposed to be printed when the application defines a > guaranteed or max resource. After YUNIKORN-2547 it's always printed if the > queue is managed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2642) Don't set resources on the recovery queue
Peter Bacsko created YUNIKORN-2642: -- Summary: Don't set resources on the recovery queue Key: YUNIKORN-2642 URL: https://issues.apache.org/jira/browse/YUNIKORN-2642 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko The resource constrainst can be set on dynamic queues based on application tags. We should not set this on the recovery queue, because there's no quota on them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2566. Fix Version/s: 1.6.0 Resolution: Fixed > Remove AllocationAsk reference from askEvents > - > > Key: YUNIKORN-2566 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2566 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2565) Remove Node reference from nodeEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2565. Fix Version/s: 1.6.0 Resolution: Fixed > Remove Node reference from nodeEvents > - > > Key: YUNIKORN-2565 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2565 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2618) Streamline AsyncRMCallback UpdateAllocation
[ https://issues.apache.org/jira/browse/YUNIKORN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2618. Fix Version/s: 1.6.0 Resolution: Fixed > Streamline AsyncRMCallback UpdateAllocation > --- > > Key: YUNIKORN-2618 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2618 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Yun Sun >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > if task is not found, a nil is returned from {{context.getTask}} in for > {{response.New}} processing we should just log that fact and proceed to the > next alloc. Simplifies the flow as we never need to check for a. nil task. We > should never have a pod in the cache that does not exist as a task on an > application. > We retrieve the application using the application ID from the response to > never use the object. We only use the application ID to pass into an event. > The context event handler then does the exact same lookup again to process > the event on the app. > We need to become much smarter in this area, double or triple lookups, > generate async events that just change the state of the app or task or kick > off another event. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2611) [UMBRELLA] YuniKorn 1.5.1 release efforts
[ https://issues.apache.org/jira/browse/YUNIKORN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2611. Fix Version/s: 1.5.1 Resolution: Fixed > [UMBRELLA] YuniKorn 1.5.1 release efforts > - > > Key: YUNIKORN-2611 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2611 > Project: Apache YuniKorn > Issue Type: Task > Components: release >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Fix For: 1.5.1 > > > This umbrella is to track the work items needed for 1.5.0 release. > Release manager: Peter Bacsko. > This release only consists of bug fixes. Use the filter > [https://issues.apache.org/jira/issues/?filter=12353383] to see the list of > deliverables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2614) Update website for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2614. Fix Version/s: 1.5.1 Target Version: 1.5.1 Resolution: Fixed > Update website for 1.5.1 > > > Key: YUNIKORN-2614 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2614 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2639) Clarify release procedure for minor releases
Peter Bacsko created YUNIKORN-2639: -- Summary: Clarify release procedure for minor releases Key: YUNIKORN-2639 URL: https://issues.apache.org/jira/browse/YUNIKORN-2639 Project: Apache YuniKorn Issue Type: Task Components: release Reporter: Peter Bacsko After the release of 1.5.1, we realized that we need to properly define the release process for a minor release. This needs to be properly documented. The clarification should cover things like: # What it can and can't include (no features/bugfixes only) # How to publish docs? Shall we keep the current "a.b.c" version on the website or remove it and publish "a.b.c+1"? # Communication: possible difference in release notes, announcement, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application
Peter Bacsko created YUNIKORN-2633: -- Summary: Unnecessary warning from Partition when adding an application Key: YUNIKORN-2633 URL: https://issues.apache.org/jira/browse/YUNIKORN-2633 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko The following is printed when adding an application: {noformat} 2024-05-17T21:53:04.716+0200WARNcore.scheduler.queue scheduler/partition.go:344 Trying to set resources on a queue that is not an unmanaged leaf{"queueName": "root.default"} {noformat} This message is supposed to be printed when the application defines a guaranteed or max resource. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2613) Release notes for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2613. Fix Version/s: 1.5.1 Resolution: Fixed > Release notes for 1.5.1 > --- > > Key: YUNIKORN-2613 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2613 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2632) Data race in IncAllocatedResource
[ https://issues.apache.org/jira/browse/YUNIKORN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2632. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed > Data race in IncAllocatedResource > - > > Key: YUNIKORN-2632 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2632 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > After YUNIKORN-2548, we accidentally make an unlocked access to > \{{Queue.allocatedResource}}. > {noformat} > WARNING: DATA RACE > Read at 0x00c000578a00 by goroutine 52: > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).IncAllocatedResource() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1032 > +0x6b > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1495 > +0x184 > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes.func1() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1402 > +0x144 > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode.func1() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/node_iterator.go:42 > +0x95 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:522 > +0x6f1 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 > +0x448 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 > +0x448 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 > +0x448 > github.com/google/btree.(*BTreeG[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).Ascend() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:779 > +0x108 > github.com/google/btree.(*BTree).Ascend() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:1029 > +0x108 > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode() > ... > Previous write at 0x00c000578a00 by goroutine 49: > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).DecAllocatedResource() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1101 > +0x212 > > github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/partition.go:1357 > +0x17b4 > > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:870 > +0xba > > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:750 > +0x1e4 > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:133 > +0x28d > > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.gowrap1() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:60 > +0x33 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2632) Data race in IncAllocatedResource
Peter Bacsko created YUNIKORN-2632: -- Summary: Data race in IncAllocatedResource Key: YUNIKORN-2632 URL: https://issues.apache.org/jira/browse/YUNIKORN-2632 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko After YUNIKORN-2548, we accidentally make an unlocked access to \{{Queue.allocatedResource}}. {noformat} WARNING: DATA RACE Read at 0x00c000578a00 by goroutine 52: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).IncAllocatedResource() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1032 +0x6b github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1495 +0x184 github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes.func1() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1402 +0x144 github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode.func1() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/node_iterator.go:42 +0x95 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:522 +0x6f1 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 +0x448 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 +0x448 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 +0x448 github.com/google/btree.(*BTreeG[go.shape.interface { Less(github.com/google/btree.Item) bool }]).Ascend() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:779 +0x108 github.com/google/btree.(*BTree).Ascend() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:1029 +0x108 github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode() ... Previous write at 0x00c000578a00 by goroutine 49: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).DecAllocatedResource() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1101 +0x212 github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/partition.go:1357 +0x17b4 github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:870 +0xba github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:750 +0x1e4 github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:133 +0x28d github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.gowrap1() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:60 +0x33 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2629) Adding a node can result in a deadlock
Peter Bacsko created YUNIKORN-2629: -- Summary: Adding a node can result in a deadlock Key: YUNIKORN-2629 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2612) Tagging for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2612. Fix Version/s: 1.5.1 Resolution: Fixed > Tagging for 1.5.1 > - > > Key: YUNIKORN-2612 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2612 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.5.1 > > > Tagging for updating dependencies (SI/core/k8shim). > No branching is needed because we'll deliver the release from branch-1.5 > directly as we did with incubator minor releases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2623) Create unit tests for Clients
[ https://issues.apache.org/jira/browse/YUNIKORN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2623. Fix Version/s: 1.6.0 Resolution: Fixed > Create unit tests for Clients > - > > Key: YUNIKORN-2623 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2623 > Project: Apache YuniKorn > Issue Type: Test > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Follow-up on YUNIKORN-2621. > Create proper coverage for {{{}clients.Clients{}}}. See PR comment > https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2623) Create unit test coverage for Clients
Peter Bacsko created YUNIKORN-2623: -- Summary: Create unit test coverage for Clients Key: YUNIKORN-2623 URL: https://issues.apache.org/jira/browse/YUNIKORN-2623 Project: Apache YuniKorn Issue Type: Test Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko Follow-up on YUNIKORN-2621. Create proper coverage for {{{}clients.Clients{}}}. See PR comment https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2620) Remove redundant variable `errorExpected` from configvalidator_test.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2620. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Remove redundant variable `errorExpected` from configvalidator_test.go > -- > > Key: YUNIKORN-2620 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2620 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Yun Sun >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > This is similar to YUNIKORN-2598. We can check the existent of `validateFunc` > instead of having a extra boolean flag. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2614) Update website for 1.5.1
Peter Bacsko created YUNIKORN-2614: -- Summary: Update website for 1.5.1 Key: YUNIKORN-2614 URL: https://issues.apache.org/jira/browse/YUNIKORN-2614 Project: Apache YuniKorn Issue Type: Sub-task Components: release Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2613) Release notes for 1.5.1
Peter Bacsko created YUNIKORN-2613: -- Summary: Release notes for 1.5.1 Key: YUNIKORN-2613 URL: https://issues.apache.org/jira/browse/YUNIKORN-2613 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2612) Tagging for 1.5.1
Peter Bacsko created YUNIKORN-2612: -- Summary: Tagging for 1.5.1 Key: YUNIKORN-2612 URL: https://issues.apache.org/jira/browse/YUNIKORN-2612 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2611) [UMBRELLA] YuniKorn 1.5.1 release efforts
Peter Bacsko created YUNIKORN-2611: -- Summary: [UMBRELLA] YuniKorn 1.5.1 release efforts Key: YUNIKORN-2611 URL: https://issues.apache.org/jira/browse/YUNIKORN-2611 Project: Apache YuniKorn Issue Type: Task Components: release Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2600) Update K8s dependency to 1.29.4
[ https://issues.apache.org/jira/browse/YUNIKORN-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2600. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed Merged to master and branch-1.5. > Update K8s dependency to 1.29.4 > --- > > Key: YUNIKORN-2600 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2600 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > A security vulnerability was fixed in 1.29.4. Update K8s dependency to this > version. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2602) Fix spelling/grammar in configvalidator
Peter Bacsko created YUNIKORN-2602: -- Summary: Fix spelling/grammar in configvalidator Key: YUNIKORN-2602 URL: https://issues.apache.org/jira/browse/YUNIKORN-2602 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Peter Bacsko Let's fix some minor grammar issues in configvalidator. Eg.: "existed" -> "existing", but there could be other mistakes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2600) Update K8s dependency to 1.29.4
Peter Bacsko created YUNIKORN-2600: -- Summary: Update K8s dependency to 1.29.4 Key: YUNIKORN-2600 URL: https://issues.apache.org/jira/browse/YUNIKORN-2600 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko A security vulnerability was fixed in 1.29.4. Update K8s dependency to this version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2472) REST API returns subtree by default
[ https://issues.apache.org/jira/browse/YUNIKORN-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2472. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > REST API returns subtree by default > --- > > Key: YUNIKORN-2472 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2472 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common, documentation >Affects Versions: 1.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Ted Lin >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > The subtree query parameter is interpreted the opposite of what would be > expected. > If you call {{/ws/v1/partition/default/queue/root?subtree}} then you do not > get the subtree. If you call {{/ws/v1/partition/default/queue/root}} you get > the whole tree rooted at root > We have not documented the new API yet so before we add it to the docs we > should fix the behaviour: > * subtree given: return the whole tree > * subtree missing: return one level > The code fix is as simple as a ! in a single call and inverting the test > cases to pass or not pass {{?subtree}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2573) Flaky test TestUpdateNodeCapacityWithMultipleNodes
[ https://issues.apache.org/jira/browse/YUNIKORN-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2573. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Flaky test TestUpdateNodeCapacityWithMultipleNodes > -- > > Key: YUNIKORN-2573 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2573 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Arthur Wang >Assignee: Arthur Wang >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > [github > pipeline|https://github.com/apache/yunikorn-core/actions/runs/8770718393/job/24067600801] > Github CI occasionally fail. > > Root cause: > [https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/context.go#L665] > > {code:java} > partition.updatePartitionResource(node.SetCapacity(resources.NewResourceFromProto(sr))) > {code} > > We calculate the delta resources by updating node capacity. > Then we update resources map in partition. > The test would failed with following order > node.SetCapacity() -> > [waitForAvailableNodeResource()|https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/tests/operation_test.go#L520] > -> > [partitionInfo.GetTotalPartitionResource()|https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/tests/operation_test.go#L525] > -> partition.updatePartitionResource() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2599) AppStateChange/AppTaskCompleted event cannot be handled in many states
Peter Bacsko created YUNIKORN-2599: -- Summary: AppStateChange/AppTaskCompleted event cannot be handled in many states Key: YUNIKORN-2599 URL: https://issues.apache.org/jira/browse/YUNIKORN-2599 Project: Apache YuniKorn Issue Type: Bug Components: shim - yarn Reporter: Peter Bacsko After YUNIKORN-2597 got merged, it became clear that we keep sending an {{AppStateChange}} event which cannot be handled by the state machine. There isn't any state in the FSM object which would actually be able to process this event. {{AppTaskCompleted}} is very similar, it is only processed in {{Resuming}} state. If someone runs the test case TestApplicationScheduling, the following errors are displayed: {noformat} [...] 2024-05-02T18:08:14.856+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppStateChange", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 2024-05-02T18:08:14.856+0200INFOcore.scheduler.application [...] 2024-05-02T18:08:14.857+0200INFOcore.scheduler.partition scheduler/partition.go:928 scheduler allocation processed {"appID": "app0001", "allocationKey": "task0002", "allocatedResource": "map[memory:1000 pods:1 vcore:1]", "placeholder": false, "targetNode": "test.host.02"} 2024-05-02T18:08:14.857+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppStateChange", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 [...] 2024-05-02T18:08:15.856+0200INFOshim.fsmcache/task_state.go:380 Task state transition {"app": "app0001", "task": "task0001", "taskAlias": "default/task0001", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-05-02T18:08:15.856+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 [...] 2024-05-02T18:08:16.858+0200INFOshim.fsmcache/task_state.go:380 Task state transition {"app": "app0001", "task": "task0002", "taskAlias": "default/task0002", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-05-02T18:08:16.858+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 [...] 2024-05-02T18:08:16.859+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppStateChange", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
[jira] [Resolved] (YUNIKORN-2597) Improve error messages in Context
[ https://issues.apache.org/jira/browse/YUNIKORN-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2597. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed Merged to master & branch-1.5. > Improve error messages in Context > - > > Key: YUNIKORN-2597 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2597 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > The logging in {{cache.Context}} related to event handling needs some > improvement: > 1) When an error occurs while the task event handler is being retrieved, it > logs "failed to handle application event" and the task ID is omitted, which > makes debugging harder. > 2) If {{canHandle()}} returns false, we don't do anything, just return. > Again, this makes debugging much harder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2518) Allow recovery queue in REST requests
[ https://issues.apache.org/jira/browse/YUNIKORN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2518. Fix Version/s: 1.6.0 Resolution: Fixed > Allow recovery queue in REST requests > - > > Key: YUNIKORN-2518 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2518 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Ted Lin >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > The current checks for the REST requests that require a queue path to be > provided prevent looking at the {{root.@recover@}} queue. > The validator filters the queue names which makes it impossible to check if > the queue has any running applications or pod after initialisation using the > REST requests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2597) Fix error messages in Context
Peter Bacsko created YUNIKORN-2597: -- Summary: Fix error messages in Context Key: YUNIKORN-2597 URL: https://issues.apache.org/jira/browse/YUNIKORN-2597 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko The logging in Context related to event handling needs some improvement: 1) When an occur occurs when the task event handler is being retrieved, it logs "failed to handle application event" and the task ID is omitted, which makes debugging harder. 2) If {{canHandle()}} returns false, we don't do anything, just return. Again, this makes debugging much harder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2297) Update the unit test for CheckQueuesStructure
[ https://issues.apache.org/jira/browse/YUNIKORN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2297. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Update the unit test for CheckQueuesStructure > - > > Key: YUNIKORN-2297 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2297 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common, test - unit >Reporter: Kuan Po Tseng >Assignee: Yun Sun >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > as discussed in > [https://github.com/apache/yunikorn-core/pull/763#discussion_r1439053430] > there are many methods not covered by configvalidator_test.go e.g. > checkQueuesStructure checkQueues, .etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2583) Possible log spew on DEBUG level from objects.Node
[ https://issues.apache.org/jira/browse/YUNIKORN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2583. Resolution: Won't Do > Possible log spew on DEBUG level from objects.Node > -- > > Key: YUNIKORN-2583 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2583 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > > When a predicate fails for a given node, we print a message on DEBUG level. > The problem is, we keep doing this in every scheduling cycle, flooding the > logs: > {noformat} > 2024-04-17T21:32:19.446Z DEBUG core.scheduler.node objects/node.go:403 > running predicates failed {"allocationKey": > "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", > "allocateFlag": true, "error": "predicates were not running because pod or > node was not found in cache"} > 2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 > running predicates failed {"allocationKey": > "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", > "allocateFlag": true, "error": "predicates were not running because pod or > node was not found in cache"} > 2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 > running predicates failed {"allocationKey": > "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", > "allocateFlag": true, "error": "predicates were not running because pod or > node was not found in cache"} > {noformat} > Another problematic part is {{preAllocateCheck()}} we have an allocation ask > with a zero resource: > {noformat} > unc (sn *Node) preAllocateCheck(res *resources.Resource, resKey string) bool { > // cannot allocate zero or negative resource > if !resources.StrictlyGreaterThanZero(res) { > log.Log(log.SchedNode).Debug("pre alloc check: requested > resource is zero", > zap.String("nodeID", sn.NodeID))<-- will be printed > from every node > return false > } > ... > {noformat} > We need to reduce the amount of output with RateLimitedLogger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2583) Possible log spew on DEBUG level when predicates are evaluated
Peter Bacsko created YUNIKORN-2583: -- Summary: Possible log spew on DEBUG level when predicates are evaluated Key: YUNIKORN-2583 URL: https://issues.apache.org/jira/browse/YUNIKORN-2583 Project: Apache YuniKorn Issue Type: Bug Components: core - common Reporter: Peter Bacsko Assignee: Peter Bacsko When a predicate fails for a given node, we print a message on DEBUG level. The problem is, we keep doing this in every scheduling cycle, flooding the logs: {noformat} 2024-04-17T21:32:19.446Z DEBUG core.scheduler.node objects/node.go:403 running predicates failed {"allocationKey": "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", "allocateFlag": true, "error": "predicates were not running because pod or node was not found in cache"} 2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 running predicates failed {"allocationKey": "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", "allocateFlag": true, "error": "predicates were not running because pod or node was not found in cache"} 2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 running predicates failed {"allocationKey": "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", "allocateFlag": true, "error": "predicates were not running because pod or node was not found in cache"} {noformat} We need to reduce the amount of output with the RateLimitedLogger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2544) [UMBRELLA] Fix Yunikorn potential locking issues
[ https://issues.apache.org/jira/browse/YUNIKORN-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2544. Resolution: Fixed All subtasks have been resolved, closing ticket. > [UMBRELLA] Fix Yunikorn potential locking issues > > > Key: YUNIKORN-2544 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2544 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > Go tool [go-deadlock|https://github.com/sasha-s/go-deadlock/] identified > several potential deadlocks in Yunikorn. > Some of these do not cause problems right now, but a lock-related change in > the future can trigger a deadlock. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2574) totalPartitionResource should not be mutated with AddTo/SubFrom
Peter Bacsko created YUNIKORN-2574: -- Summary: totalPartitionResource should not be mutated with AddTo/SubFrom Key: YUNIKORN-2574 URL: https://issues.apache.org/jira/browse/YUNIKORN-2574 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Affects Versions: 1.5.0, 1.4.0 Reporter: Peter Bacsko Assignee: Peter Bacsko There is a potential data race in PartitionContext: the field "totalPartitionResource" is mutated in place. The problem is that the method {{GetTotalPartitionResource()}} does not clone it. {noformat} func (pc *PartitionContext) GetTotalPartitionResource() *resources.Resource { pc.RLock() defer pc.RUnlock() return pc.totalPartitionResource } {noformat} In general, we should prefer the immutable approach for variables like this, just like in {{objects.Queue}}: {noformat} func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported bool) error { // check this queue: failure stops checks if the allocation is not part of a node addition newAllocated := resources.Add(sq.allocatedResource, alloc) [ ... removed ... ] sq.Lock() defer sq.Unlock() // all OK update this queue sq.allocatedResource = newAllocated sq.updateAllocatedResourceMetrics() return nil } // incPendingResource increments pending resource of this queue and its parents. func (sq *Queue) incPendingResource(delta *resources.Resource) { // update the parent if sq.parent != nil { sq.parent.incPendingResource(delta) } // update this queue sq.Lock() defer sq.Unlock() sq.pending = resources.Add(sq.pending, delta) sq.updatePendingResourceMetrics() } {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2562) Nil pointer panic in Application.ReplaceAllocation()
[ https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2562. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed > Nil pointer panic in Application.ReplaceAllocation() > > > Key: YUNIKORN-2562 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2562 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > The following panic was generated during placeholder replacement: > {noformat} > 2024-04-16T13:46:58.583Z INFOshim.cache.task cache/task.go:542 > releasing allocations {"numOfAsksToRelease": 1, > "numOfAllocationsToRelease": 1} > 2024-04-16T13:46:58.583Z INFOshim.fsmcache/task_state.go:380 > Task state transition {"app": "application-spark-abrdrsmo8no2", "task": > "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": > "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", > "source": "Bound", "destination": "Completed", "event": "CompleteTask"} > 2024-04-16T13:46:58.584Z INFOcore.scheduler.application > objects/application.go:616 ask removed successfully from application > {"appID": "application-spark-abrdrsmo8no2", "ask": > "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"} > 2024-04-16T13:46:58.584Z INFOcore.scheduler.partition > scheduler/partition.go:1281 replacing placeholder allocation > {"appID": "application-spark-abrdrsmo8no2", "allocationID": > "cd73be15-af61-4248-89e1-d3296e72214e"} > panic: runtime error: invalid memory address or nil pointer dereference > [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255] > goroutine 117 [running]: > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600, > {0xc007710cf0, 0x24}) > > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745 > +0x615 > github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?, > 0xc009786700) > > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 > +0x28b > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?, > {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9}) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 > +0x9e > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?, > 0xc0071a3f10?) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 > +0xa5 > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540) > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 > +0x1c5 > created by > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in > goroutine 1 > github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 > +0x9c > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2568) Move all xxxEvents types to objects/events
Peter Bacsko created YUNIKORN-2568: -- Summary: Move all xxxEvents types to objects/events Key: YUNIKORN-2568 URL: https://issues.apache.org/jira/browse/YUNIKORN-2568 Project: Apache YuniKorn Issue Type: Sub-task Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents
Peter Bacsko created YUNIKORN-2566: -- Summary: Remove AllocationAsk reference from askEvents Key: YUNIKORN-2566 URL: https://issues.apache.org/jira/browse/YUNIKORN-2566 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2567) Remove Application reference from applicationEvents
Peter Bacsko created YUNIKORN-2567: -- Summary: Remove Application reference from applicationEvents Key: YUNIKORN-2567 URL: https://issues.apache.org/jira/browse/YUNIKORN-2567 Project: Apache YuniKorn Issue Type: Sub-task Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package
Peter Bacsko created YUNIKORN-2564: -- Summary: [Umbrella] Move xxxEvents types to a different package Key: YUNIKORN-2564 URL: https://issues.apache.org/jira/browse/YUNIKORN-2564 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko There are several Events that can be moved to a different package: * queueEvents * applicationEvents * askEvents * nodeEvents There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity to clean it up a bit and move these under eg. {{pkg/scheduler/objects/events}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2565) Remove Node reference from nodeEvents
Peter Bacsko created YUNIKORN-2565: -- Summary: Remove Node reference from nodeEvents Key: YUNIKORN-2565 URL: https://issues.apache.org/jira/browse/YUNIKORN-2565 Project: Apache YuniKorn Issue Type: Sub-task Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2563) [shim] Enable deadlock detection during unit tests
Peter Bacsko created YUNIKORN-2563: -- Summary: [shim] Enable deadlock detection during unit tests Key: YUNIKORN-2563 URL: https://issues.apache.org/jira/browse/YUNIKORN-2563 Project: Apache YuniKorn Issue Type: Sub-task Components: shim - kubernetes, test - unit Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2553) Enable deadlock detection during unit tests
[ https://issues.apache.org/jira/browse/YUNIKORN-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2553. Fix Version/s: 1.6.0 1.5.1 Target Version: 1.6.0, 1.5.1 Resolution: Fixed > Enable deadlock detection during unit tests > --- > > Key: YUNIKORN-2553 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2553 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler, shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > To enhance testing, we can enable deadlock detection when the unit tests are > executed with "make test". > This gives us instant feedback about locking issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()
Peter Bacsko created YUNIKORN-2562: -- Summary: Nil pointer in Application.ReplaceAllocation() Key: YUNIKORN-2562 URL: https://issues.apache.org/jira/browse/YUNIKORN-2562 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko The following panic was generated during placeholder release: {noformat} 2024-04-16T13:46:58.583ZINFOshim.cache.task cache/task.go:542 releasing allocations {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 1} 2024-04-16T13:46:58.583ZINFOshim.fsmcache/task_state.go:380 Task state transition {"app": "application-spark-abrdrsmo8no2", "task": "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-04-16T13:46:58.584ZINFOcore.scheduler.application objects/application.go:616 ask removed successfully from application {"appID": "application-spark-abrdrsmo8no2", "ask": "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"} 2024-04-16T13:46:58.584ZINFOcore.scheduler.partition scheduler/partition.go:1281 replacing placeholder allocation {"appID": "application-spark-abrdrsmo8no2", "allocationID": "cd73be15-af61-4248-89e1-d3296e72214e"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255] goroutine 117 [running]: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600, {0xc007710cf0, 0x24}) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745 +0x615 github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?, 0xc009786700) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?, {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9}) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 +0x9e github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?, 0xc0071a3f10?) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 +0xa5 github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540) github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 +0x1c5 created by github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in goroutine 1 github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 +0x9c {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2552) Recursive locking when sending remove queue event
[ https://issues.apache.org/jira/browse/YUNIKORN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2552. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed > Recursive locking when sending remove queue event > - > > Key: YUNIKORN-2552 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2552 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > When sending a queue event from {{queueEvents}}, we acquire the read lock > again. > {noformat} > objects.(*Queue).IsManaged { sq.RLock() } < > objects.(*Queue).IsManaged { func (sq *Queue) IsManaged() bool { } > objects.(*queueEvents).sendRemoveQueueEvent { } } > objects.(*Queue).RemoveQueue { sq.queueEvents.sendRemoveQueueEvent() } > scheduler.(*partitionManager).cleanQueues { // all OK update the queue > hierarchy and partition } > scheduler.(*partitionManager).cleanQueues { if children := > queue.GetCopyOfChildren(); len(children) != 0 { } > scheduler.(*partitionManager).cleanRoot { > manager.cleanQueues(manager.pc.root) } > {noformat} > {{RemoveQueue()}} already has the read lock. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2550) Fix locking in PartitionContext
[ https://issues.apache.org/jira/browse/YUNIKORN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2550. Fix Version/s: 1.6.0 1.5.1 Target Version: 1.6.0, 1.5.1 Resolution: Fixed > Fix locking in PartitionContext > --- > > Key: YUNIKORN-2550 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2550 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - common >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Possible deadlock was detected: > {noformat} > placement.(*AppPlacementManager).initialise { m.Lock() } < > placement.(*AppPlacementManager).initialise { } } > placement.(*AppPlacementManager).UpdateRules { > log.Log(log.Config).Info("Building new rule list for placement manager") } > scheduler.(*PartitionContext).updatePartitionDetails { err := > pc.placementManager.UpdateRules(conf.PlacementRules) } > scheduler.(*ClusterContext).updateSchedulerConfig { err = > part.updatePartitionDetails(p) } > scheduler.(*ClusterContext).processRMConfigUpdateEvent { err = > cc.updateSchedulerConfig(conf, rmID) } > scheduler.(*Scheduler).handleRMEvent { case *rmevent.RMConfigUpdateEvent: } > scheduler.(*PartitionContext).GetQueue { pc.RLock() } < > scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) > GetQueue(name string) *objects.Queue { } > placement.(*providedRule).placeApplication { // if we cannot create the queue > must exist } > placement.(*AppPlacementManager).PlaceApplication { queueName, err = > checkRule.placeApplication(app, m.queueFn) } > scheduler.(*PartitionContext).AddApplication { err := > pc.getPlacementManager().PlaceApplication(app) } > scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := > objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) } > scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: } > {noformat} > Lock order is different between {{PartitionContext}} and > {{AppPlacementManager}}. > There's also an interference between {{PartitionContext}} and an > {{Application}} object: > {noformat} > objects.(*Application).SetTerminatedCallback { sa.Lock() } < > objects.(*Application).SetTerminatedCallback { func (sa *Application) > SetTerminatedCallback(callback func(appID string)) { } > scheduler.(*PartitionContext).AddApplication { > app.SetTerminatedCallback(pc.moveTerminatedApp) } > scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := > objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) } > scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: } > scheduler.(*PartitionContext).GetNode { pc.RLock() } < > scheduler.(*PartitionContext).GetNode { func (pc *PartitionContext) > GetNode(nodeID string) *objects.Node { } > objects.(*Application).tryPlaceholderAllocate { // resource usage should not > change anyway between placeholder and real one at this point } > objects.(*Queue).TryPlaceholderAllocate { for _, app := range > sq.sortApplications(true) { } > objects.(*Queue).TryPlaceholderAllocate { for _, child := range > sq.sortQueues() { } > scheduler.(*PartitionContext).tryPlaceholderAllocate { alloc := > pc.root.TryPlaceholderAllocate(pc.GetNodeIterator, pc.GetNode) } > scheduler.(*ClusterContext).schedule { // nothing reserved that can be > allocated try normal allocate } > scheduler.(*Scheduler).MultiStepSchedule { // Note, this sleep only works in > tests. } > tests.TestDupReleasesInGangScheduling { // and it waits for the shim's > confirmation } > {noformat} > There's no need to have a locked access for {{PartitionContext.nodes}}. The > base implementation of {{NodeCollection}} ({{baseNodeCollection}}) is already > internally synchronized. The "nodes" field is set once. Therefore, no locking > is necessary when accessing it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2554) Remove "rules" field from PartitionContext
Peter Bacsko created YUNIKORN-2554: -- Summary: Remove "rules" field from PartitionContext Key: YUNIKORN-2554 URL: https://issues.apache.org/jira/browse/YUNIKORN-2554 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko The "rules" field inside the PartitionContext is obsolete. It is set but nothing reads it. It can also become out of sync with the contents of the placement manager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2549) Fixing lint issues
[ https://issues.apache.org/jira/browse/YUNIKORN-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2549. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Fixing lint issues > -- > > Key: YUNIKORN-2549 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2549 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Michael Akinyemi >Assignee: Michael Akinyemi >Priority: Trivial > Labels: pull-request-available > Fix For: 1.6.0 > > > Fixing lint issues in test/e2e/framework/helpers/common/utils.go: > test/e2e/framework/helpers/common/utils.go:60:9: wrapperFunc: use > strings.ReplaceAll method in `strings.Replace(name, "/", "-", -1)` (gocritic) > return strings.Replace(name, "/", "-", -1) > ^ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2553) Integrate deadlock detection with unit test
Peter Bacsko created YUNIKORN-2553: -- Summary: Integrate deadlock detection with unit test Key: YUNIKORN-2553 URL: https://issues.apache.org/jira/browse/YUNIKORN-2553 Project: Apache YuniKorn Issue Type: Sub-task Components: core - scheduler, shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko To enhance testing, we can enable deadlock detection when the unit test is executed with "make test". This gives us instant feedback about locking issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2552) Recursive locking when sending Queue events
Peter Bacsko created YUNIKORN-2552: -- Summary: Recursive locking when sending Queue events Key: YUNIKORN-2552 URL: https://issues.apache.org/jira/browse/YUNIKORN-2552 Project: Apache YuniKorn Issue Type: Sub-task Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko When sending a queue event from {{queueEvents}}, we acquire the read lock again. {noformat} objects.(*Queue).IsManaged { sq.RLock() } < objects.(*Queue).IsManaged { func (sq *Queue) IsManaged() bool { } objects.(*queueEvents).sendRemoveQueueEvent { } } objects.(*Queue).RemoveQueue { sq.queueEvents.sendRemoveQueueEvent() } scheduler.(*partitionManager).cleanQueues { // all OK update the queue hierarchy and partition } scheduler.(*partitionManager).cleanQueues { if children := queue.GetCopyOfChildren(); len(children) != 0 { } scheduler.(*partitionManager).cleanRoot { manager.cleanQueues(manager.pc.root) } {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2550) Fix locking in PartitionContext
Peter Bacsko created YUNIKORN-2550: -- Summary: Fix locking in PartitionContext Key: YUNIKORN-2550 URL: https://issues.apache.org/jira/browse/YUNIKORN-2550 Project: Apache YuniKorn Issue Type: Sub-task Components: core - common Reporter: Peter Bacsko Assignee: Peter Bacsko Possible deadlock was detected: {noformat} ~/repos/yunikorn-core/pkg/scheduler/partition.go:448 scheduler.(*PartitionContext).GetQueue { pc.RLock() } < ~/repos/yunikorn-core/pkg/scheduler/partition.go:447 scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) GetQueue(name string) *objects.Queue { } ~/repos/yunikorn-core/pkg/scheduler/placement/provided_rule.go:107 placement.(*providedRule).placeApplication { // if we cannot create the queue must exist } ~/repos/yunikorn-core/pkg/scheduler/placement/placement.go:125 placement.(*AppPlacementManager).PlaceApplication { queueName, err = checkRule.placeApplication(app, m.queueFn) } ~/repos/yunikorn-core/pkg/scheduler/partition.go:309 scheduler.(*PartitionContext).AddApplication { err := pc.getPlacementManager().PlaceApplication(app) } ~/repos/yunikorn-core/pkg/scheduler/context.go:523 scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) } ~/repos/yunikorn-core/pkg/scheduler/scheduler.go:130 scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: } Lock order is different between {{PartitionContext}} and {{AppPlacementManager}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2543) Fix locking in RMProxy
[ https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2543. Fix Version/s: 1.6.0 Resolution: Fixed > Fix locking in RMProxy > -- > > Key: YUNIKORN-2543 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2543 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > After merging YUNIKORN-2539, we already saw a potential issue with > {{rmproxy.RMProxy}} and {{cache.Context}}: > Gourutine 1: > {noformat} > github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:307 > rmproxy.(*RMProxy).GetResourceManagerCallback ??? < > github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:306 > rmproxy.(*RMProxy).GetResourceManagerCallback ??? > github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:359 > rmproxy.(*RMProxy).UpdateNode ??? > github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 > cache.(*Context).updateNodeResources ??? > github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 > cache.(*Context).updateNodeOccupiedResources ??? > github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 > cache.(*Context).updateForeignPod ??? > github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 > cache.(*Context).UpdatePod ??? > {noformat} > Goroutine 2: > {noformat} > github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 > cache.(*Context).ForgetPod ??? < > github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 > cache.(*Context).ForgetPod ??? > github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 > cache.(*AsyncRMCallback).UpdateAllocation ??? > github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:162 > rmproxy.(*RMProxy).triggerUpdateAllocation ??? > github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:150 > rmproxy.(*RMProxy).processRMReleaseAllocationEvent ??? > github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:234 > rmproxy.(*RMProxy).handleRMEvents ??? > {noformat} > Right now this seems to be safe because we only call {{RLock()}} in the > {{RMProxy}} methods. However, should any of this change, we're in trouble due > to lock ordering (Cache->RMProxy and RMProxy->Cache). > We need to investigate why we use only {{RLock()}} and whether it's needed at > all. If nothing is modified, then we can drop the mutex completely. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org