[jira] [Resolved] (YUNIKORN-2766) Only generate event if all predicates failed
[ https://issues.apache.org/jira/browse/YUNIKORN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2766. Fix Version/s: 1.6.0 Resolution: Fixed > Only generate event if all predicates failed > > > Key: YUNIKORN-2766 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2766 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Right now, we send an event to the pod if a predicate failed: > {noformat} >if err := plugin.Predicates({ > AllocationKey: allocationKey, > NodeID:sn.NodeID, > Allocate: allocate, > }); err != nil { > log.Log(log.SchedNode).Debug("running predicates > failed", > zap.String("allocationKey", allocationKey), > zap.String("nodeID", sn.NodeID), > zap.Bool("allocateFlag", allocate), > zap.Error(err)) > // running predicates failed > msg := err.Error() > ask.LogAllocationFailure(msg, allocate) > ask.SendPredicateFailedEvent(msg) > return false > } > {noformat} > This is, however, not correct. We should only generate an event if *all* > predicates have failed, which means that the pod cannot be scheduled. A > failing predicate for a given node can be perfectly normal in many cases. > Instead, we should aggregate the failed predicates and send an event like: > {noformat} > All predicates failed for request '345d70d7-243a-4077-a9f8-0bb76c3532d7': > node(s) didn't match Pod's node affinity/selector (20x); node(s) had taints > that the pod didn't tolerate (5x) > {noformat} > where 20x and 5x tell how many times a certain predicate failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2652) Expand getApplication() endpoint handler to return resource usage
[ https://issues.apache.org/jira/browse/YUNIKORN-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2652. Fix Version/s: 1.6.0 Resolution: Fixed > Expand getApplication() endpoint handler to return resource usage > - > > Key: YUNIKORN-2652 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2652 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Rich Scott >Assignee: Rich Scott >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Some users would like to be able to see resource usage (preempted, > placeholder resource, etc) for applications that have been completed. The > `getApplication()` endpoint handler should be enhanced to take an optional > parameter specifying that the user would like details about resources > included in the response, and a new `ApplicationXXXDAOInfo` object that is a > slight superset of `ApplicationDAOInfo` should be introduced, and can be used > in the response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2777) Improve TrackedResource type
Peter Bacsko created YUNIKORN-2777: -- Summary: Improve TrackedResource type Key: YUNIKORN-2777 URL: https://issues.apache.org/jira/browse/YUNIKORN-2777 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Peter Bacsko Currently, TrackedResource is defined as: {noformat} type TrackedResource struct { TrackedResourceMap map[string]map[string]int64 locking.RWMutex } {noformat} As it turned out during the review of [YUNIKORN-2652|https://github.com/apache/yunikorn-core/pull/897], {{TrackedResourceMap}} is actually {{map[string]*Resource}}. If we change the definition, we'll be able to use the existing functions that already exist for {{Resource}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2707) Tagging for 1.5.2
[ https://issues.apache.org/jira/browse/YUNIKORN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2707. Fix Version/s: 1.5.2 Resolution: Fixed > Tagging for 1.5.2 > - > > Key: YUNIKORN-2707 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2707 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.2 RC1
+1 binding - Built images from source (amd64) on Ubuntu 22.04 w/ go 1.22.5 - Checked signature & checksum - Run make test && make image - Installed on KIND and ran some sample jobs - Ran performance benchmark unit test with race & deadlock detection Thank you all for the voting on the RC1 for 1.5.2. Voting for the release has passed with: 4 binding +1 3 non binding +1 no 0 or -1 votes. As the next step, I'll publish the release, images and update the website. After that is done I will send an announcement email. Thank you, Peter On Fri, Jul 26, 2024 at 10:41 AM Lan Tzu-Hua wrote: > +1 (non-binding) > > - Verified signature and checksums > - Built on Windows 11 (amd64) with WSL Ubuntu 22.04.4 using go 1.22.4 > - Installed locally on Kind cluster 1.28.9 > - make test passed > - E2E test passed > - Verified Web UI > > Thanks, > Tzu-Hua Lan > > On Fri, Jul 26, 2024 at 6:14 AM Craig Condit wrote: > > > +1 (binding). > > > > - Verified signature and checksums > > - Verified LICENSE and NOTICE files > > - Verified release tarball structure > > - Built both arm64 and amd64 releases on M3 Mac using go 1.22.5. > > - Installed arm64 locally on Kind cluster (1.30.0) > > - Installed amd64 on real cluster (1.29.2+rke2r1) > > - Ran simple jobs > > > > > > > On Jul 23, 2024, at 7:17 AM, Peter Bacsko wrote: > > > > > > Quick correction: the proper URL is > > > https://dist.apache.org/repos/dist/dev/yunikorn/1.5.2-RC1/ > > > > > > > > > On Tue, Jul 23, 2024 at 2:15 PM Peter Bacsko > wrote: > > > > > >> Hello everyone, > > >> > > >> I would like to call a vote for releasing Apache YuniKorn 1.5.2 RC1. > > >> This is a minor release which contains only bugfixes. > > >> > > >> The release artefacts have been uploaded here: > > >> https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC2/ > > >> > > >> My public key is located in the KEYS file: > > >> https://downloads.apache.org//yunikorn/KEYS > > >> > > >> JIRA issues that have been resolved in this release: > > >> https://issues.apache.org/jira/issues/?filter=12353487 > > >> > > >> The release (similarly to 1.5.1) solves a deadlock issue. If possible, > > >> test Yunikorn with workloads that put Yunikorn under stress (ie. > > >> thousands/tens of thousands of pods). > > >> > > >> Git tags for each component are as follows: > > >> yunikorn-scheduler-interface: v1.5.2-1 > > >> yunikorn-core: v1.5.2-1 > > >> yunikorn-k8shim: v1.5.2-1 > > >> yunikorn-web: v1.5.2-1 > > >> yunikorn-release: v1.5.2-1 > > >> > > >> Once the release is voted on and approved, all repos will be tagged > > >> 1.5.2 for consistency. > > >> > > >> Please review and vote. The vote will be open for at least 72 hours > > >> and closes on Firday 26 Jul 2024 16:00:00 CEST. > > >> > > >> [ ] +1 Approve > > >> [ ] +0 No opinion > > >> [ ] -1 Disapprove (and the reason why) > > >> > > >> > > >> Thank you, > > >> Peter > > >> > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org > > For additional commands, e-mail: dev-h...@yunikorn.apache.org > > > > >
[jira] [Resolved] (YUNIKORN-2759) Replace %w by Errors.join
[ https://issues.apache.org/jira/browse/YUNIKORN-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2759. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Replace %w by Errors.join > - > > Key: YUNIKORN-2759 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2759 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > original discussion: https://issues.apache.org/jira/browse/YUNIKORN-2262 > Errors.join can make the code more performant and readable -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2770) Simplify Application.GetTask()
[ https://issues.apache.org/jira/browse/YUNIKORN-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2770. Fix Version/s: 1.6.0 Resolution: Fixed > Simplify Application.GetTask() > -- > > Key: YUNIKORN-2770 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2770 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > {{Application.GetTask()}} returns a {{*Task}} and an {{error}}, but the > {{error}} is completely unnecessary. We either have the task for the given > taskID or we don't. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2765) Improve si_helper & resource funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2765. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve si_helper & resource funtion's test coverage > > > Key: YUNIKORN-2765 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2765 > Project: Apache YuniKorn > Issue Type: Test > Components: shim - kubernetes >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Improve the following funtion's test coverage > * GetTerminationTypeFromString (unknow terminationtype) > * getMaxResource (requested resource types are fewer than allocated types) > * GetResource > * GetTGResource -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2770) Simplify Application.GetTask()
Peter Bacsko created YUNIKORN-2770: -- Summary: Simplify Application.GetTask() Key: YUNIKORN-2770 URL: https://issues.apache.org/jira/browse/YUNIKORN-2770 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko {{Application.GetTask()}} returns a {{*Task}} and an {{error}}, but the {{error}} is completely unnecessary. We either have the task for the given taskID or we don't. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2707) Tagging for 1.5.2
[ https://issues.apache.org/jira/browse/YUNIKORN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2707. Fix Version/s: 1.5.2 Resolution: Fixed > Tagging for 1.5.2 > - > > Key: YUNIKORN-2707 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2707 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Reopened] (YUNIKORN-2707) Tagging for 1.5.2
[ https://issues.apache.org/jira/browse/YUNIKORN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reopened YUNIKORN-2707: > Tagging for 1.5.2 > - > > Key: YUNIKORN-2707 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2707 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.2 RC1
Quick correction: the proper URL is https://dist.apache.org/repos/dist/dev/yunikorn/1.5.2-RC1/ On Tue, Jul 23, 2024 at 2:15 PM Peter Bacsko wrote: > Hello everyone, > > I would like to call a vote for releasing Apache YuniKorn 1.5.2 RC1. > This is a minor release which contains only bugfixes. > > The release artefacts have been uploaded here: > https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC2/ > > My public key is located in the KEYS file: > https://downloads.apache.org//yunikorn/KEYS > > JIRA issues that have been resolved in this release: >https://issues.apache.org/jira/issues/?filter=12353487 > > The release (similarly to 1.5.1) solves a deadlock issue. If possible, > test Yunikorn with workloads that put Yunikorn under stress (ie. > thousands/tens of thousands of pods). > > Git tags for each component are as follows: > yunikorn-scheduler-interface: v1.5.2-1 > yunikorn-core: v1.5.2-1 > yunikorn-k8shim: v1.5.2-1 > yunikorn-web: v1.5.2-1 > yunikorn-release: v1.5.2-1 > > Once the release is voted on and approved, all repos will be tagged > 1.5.2 for consistency. > > Please review and vote. The vote will be open for at least 72 hours > and closes on Firday 26 Jul 2024 16:00:00 CEST. > > [ ] +1 Approve > [ ] +0 No opinion > [ ] -1 Disapprove (and the reason why) > > > Thank you, > Peter >
[VOTE] Release Apache YuniKorn 1.5.2 RC1
Hello everyone, I would like to call a vote for releasing Apache YuniKorn 1.5.2 RC1. This is a minor release which contains only bugfixes. The release artefacts have been uploaded here: https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC2/ My public key is located in the KEYS file: https://downloads.apache.org//yunikorn/KEYS JIRA issues that have been resolved in this release: https://issues.apache.org/jira/issues/?filter=12353487 The release (similarly to 1.5.1) solves a deadlock issue. If possible, test Yunikorn with workloads that put Yunikorn under stress (ie. thousands/tens of thousands of pods). Git tags for each component are as follows: yunikorn-scheduler-interface: v1.5.2-1 yunikorn-core: v1.5.2-1 yunikorn-k8shim: v1.5.2-1 yunikorn-web: v1.5.2-1 yunikorn-release: v1.5.2-1 Once the release is voted on and approved, all repos will be tagged 1.5.2 for consistency. Please review and vote. The vote will be open for at least 72 hours and closes on Firday 26 Jul 2024 16:00:00 CEST. [ ] +1 Approve [ ] +0 No opinion [ ] -1 Disapprove (and the reason why) Thank you, Peter
Re: [DISCUSSION] Yunikorn release 1.5.2
Hi all, it seems to me that there are no more pending items that are required for 1.5.2. I'll start the branching & tagging. Cheers, Peter On Mon, Jul 1, 2024 at 3:06 PM Peter Bacsko wrote: > Hi team, > > it's about time to start the release of Yunikorn 1.5.2. > > The following items are planned for delivery: > https://issues.apache.org/jira/issues/?filter=12353487 > > No features in his release, only bugfixes. > > YUNIKORN-2703 [1] is a potential candidate, but whether it should be > included or not is currently being discussed. > > If anyone knows about an issue which should be part of 1.5.2, please let > me know. > > I'll be on vacation from 8th Jul for 2 weeks, so I don't have an exact > timeline yet. > > Thanks, > Peter > [1] https://issues.apache.org/jira/browse/YUNIKORN-2703 >
[jira] [Created] (YUNIKORN-2766) Only generate event if all predicates failed
Peter Bacsko created YUNIKORN-2766: -- Summary: Only generate event if all predicates failed Key: YUNIKORN-2766 URL: https://issues.apache.org/jira/browse/YUNIKORN-2766 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko Right now, we send an event to the pod if a predicate failed: {noformat} if err := plugin.Predicates({ AllocationKey: allocationKey, NodeID:sn.NodeID, Allocate: allocate, }); err != nil { log.Log(log.SchedNode).Debug("running predicates failed", zap.String("allocationKey", allocationKey), zap.String("nodeID", sn.NodeID), zap.Bool("allocateFlag", allocate), zap.Error(err)) // running predicates failed msg := err.Error() ask.LogAllocationFailure(msg, allocate) ask.SendPredicateFailedEvent(msg) return false } {noformat} This is, however, not correct. We should only generate an event if *all* predicates have failed, which means that the pod cannot be scheduled. A failing predicate for a given node can be perfectly normal in many cases. Instead, we should aggregate the failed predicates and send an event like: {noformat} All predicates failed for request '345d70d7-243a-4077-a9f8-0bb76c3532d7': node(s) didn't match Pod's node affinity/selector (20x), node(s) had taints that the pod didn't tolerate (5x) {noformat} where 20x and 5x tell how many times a certain predicate failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2725) Temporarily disable failing e2e preemption tests
[ https://issues.apache.org/jira/browse/YUNIKORN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2725. Fix Version/s: 1.6.0 Resolution: Fixed > Temporarily disable failing e2e preemption tests > > > Key: YUNIKORN-2725 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2725 > Project: Apache YuniKorn > Issue Type: Test > Components: shim - kubernetes, test - e2e > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Disable the following tests to have green builds: > Verify_preemption_on_priority_queue > Verify_basic_preemption > Verify_allow_preemption_tag -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2725) Temporarily disable failing e2e tests
Peter Bacsko created YUNIKORN-2725: -- Summary: Temporarily disable failing e2e tests Key: YUNIKORN-2725 URL: https://issues.apache.org/jira/browse/YUNIKORN-2725 Project: Apache YuniKorn Issue Type: Test Components: shim - kubernetes, test - e2e Reporter: Peter Bacsko Assignee: Peter Bacsko Disable the following tests to have green builds: Verify_preemption_on_priority_queue Verify_basic_preemption -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2724) Improve the signature of methods notifyTaskComplete() and ensureAppAndTaskCreated()
Peter Bacsko created YUNIKORN-2724: -- Summary: Improve the signature of methods notifyTaskComplete() and ensureAppAndTaskCreated() Key: YUNIKORN-2724 URL: https://issues.apache.org/jira/browse/YUNIKORN-2724 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Peter Bacsko >From the review [https://github.com/apache/yunikorn-k8shim/pull/864] "I also think we need to change the signature for {{notifyTaskComplete(string, string)}} to {{notifyTaskComplete(*Application, string)}} Probably better to use a separate jira for that as it flows through into {{NotifyTaskComplete()}} and some tests. The 2 tests have the application pointer already. It removes a number of extra getApplication() calls we really do not need. Similar for {{ensureAppAndTaskCreated()}} which is only ever called from this function. Add a parameter to it to make it: {{ensureAppAndTaskCreated(*v1.Pod, *Application)}} and only execute application creation {{{}if app == nil{}}}. This can be either in this jira or in a separate one." That is, optimize the methods so that we avoid unnecessary {{GetApplication()}} calls. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2182) Set ReadHeaderTimeout in http server
[ https://issues.apache.org/jira/browse/YUNIKORN-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2182. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Set ReadHeaderTimeout in http server > > > Key: YUNIKORN-2182 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2182 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common, webapp >Reporter: Wilfred Spiegelenburg >Assignee: Chenchen Lai >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > Potential Slowloris Attack because ReadHeaderTimeout is not configured in the > http.Server (gosec) > We do not set ReadTimeout or ReadHeaderTimeout so we do not have a timeout at > all at the moment. > BTW: this is not important for the webtest servers we build as they are just > for our tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2568) Move all xxxEvents types to objects/events
[ https://issues.apache.org/jira/browse/YUNIKORN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2568. Fix Version/s: 1.6.0 Resolution: Fixed > Move all xxxEvents types to objects/events > -- > > Key: YUNIKORN-2568 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2568 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package
[ https://issues.apache.org/jira/browse/YUNIKORN-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2564. Fix Version/s: 1.6.0 Resolution: Fixed > [Umbrella] Move xxxEvents types to a different package > -- > > Key: YUNIKORN-2564 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2564 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.6.0 > > > There are several Events that can be moved to a different package: > * queueEvents > * applicationEvents > * askEvents > * nodeEvents > There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity > to clean it up a bit and move these under eg. > {{pkg/scheduler/objects/events}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[DISCUSSION] Yunikorn release 1.5.2
Hi team, it's about time to start the release of Yunikorn 1.5.2. The following items are planned for delivery: https://issues.apache.org/jira/issues/?filter=12353487 No features in his release, only bugfixes. YUNIKORN-2703 [1] is a potential candidate, but whether it should be included or not is currently being discussed. If anyone knows about an issue which should be part of 1.5.2, please let me know. I'll be on vacation from 8th Jul for 2 weeks, so I don't have an exact timeline yet. Thanks, Peter [1] https://issues.apache.org/jira/browse/YUNIKORN-2703
[jira] [Created] (YUNIKORN-2708) Release notes for 1.5.2
Peter Bacsko created YUNIKORN-2708: -- Summary: Release notes for 1.5.2 Key: YUNIKORN-2708 URL: https://issues.apache.org/jira/browse/YUNIKORN-2708 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2709) Update website for 1.5.2
Peter Bacsko created YUNIKORN-2709: -- Summary: Update website for 1.5.2 Key: YUNIKORN-2709 URL: https://issues.apache.org/jira/browse/YUNIKORN-2709 Project: Apache YuniKorn Issue Type: Sub-task Components: release Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2706) [UMBRELLA] YuniKorn 1.5.2 release efforts
Peter Bacsko created YUNIKORN-2706: -- Summary: [UMBRELLA] YuniKorn 1.5.2 release efforts Key: YUNIKORN-2706 URL: https://issues.apache.org/jira/browse/YUNIKORN-2706 Project: Apache YuniKorn Issue Type: Task Components: release Reporter: Peter Bacsko Assignee: Peter Bacsko This umbrella is to track the work items needed for the 1.5.2 release. Release manager: Peter Bacsko. This release only contains bug fixes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2707) Tagging for 1.5.2
Peter Bacsko created YUNIKORN-2707: -- Summary: Tagging for 1.5.2 Key: YUNIKORN-2707 URL: https://issues.apache.org/jira/browse/YUNIKORN-2707 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2704) Event publish errors out when predicates fail
[ https://issues.apache.org/jira/browse/YUNIKORN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2704. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed Merged to master & branch-1.5 > Event publish errors out when predicates fail > - > > Key: YUNIKORN-2704 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2704 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Mit Desai >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.6.0, 1.5.2 > > > I consistently see this error in the logs when events are published. > I did put some debug logs and found that I only get it when the events for > untolerated taints are published. > E0618 17:43:17.858946 1 event_broadcaster.go:270] "Server rejected > event (will not retry!)" err="Event \"<>.17da2a31072bb32f\" is > invalid: [action: Required value, reason: Required value]" > event="\{ObjectMeta:{<>.17da2a31072bb32f dpi-dev 0 > 0001-01-01 00:00:00 + UTC map[] map[] [] [] > []},EventTime:2024-06-18 17:43:17.857332069 + UTC > m=+84279.014490005,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-59bdc88fdc-7h5bt,Action:,Reason:,Regarding:\{Pod > <> <> 5c90315c-a07d-4801-9ecc-baf61ee45f11 v1 > 4323324038 },Related:nil,Note:Predicate failed for request > '5c90315c-a07d-4801-9ecc-baf61ee45f11' with message: 'node(s) had untolerated > taint \{<>: <>}',Type:Normal,DeprecatedSource:\{ > },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + > UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2694) Improve placement rule funtion's test coverage - 2
[ https://issues.apache.org/jira/browse/YUNIKORN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2694. Fix Version/s: 1.6.0 Resolution: Fixed > Improve placement rule funtion's test coverage - 2 > -- > > Key: YUNIKORN-2694 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2694 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2683) Unnecessary error is logged when resource usage is increased
[ https://issues.apache.org/jira/browse/YUNIKORN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2683. Fix Version/s: 1.6.0 Resolution: Fixed > Unnecessary error is logged when resource usage is increased > > > Key: YUNIKORN-2683 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2683 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > > The refactored code in YUNIKORN-2542 contains an unnecessary warning message: > {noformat} > appGroup := userTracker.getGroupForApp(applicationID) > log.Log(log.SchedUGM).Debug("Increasing resource usage for user", > zap.String("user", user.User), > zap.String("queue path", queuePath), > zap.String("application", applicationID), > zap.String("group", appGroup), > zap.Stringer("resource", usage)) > groupTracker := m.GetGroupTracker(appGroup) > if groupTracker == nil { > log.Log(log.SchedUGM).Error("group tracker should be available > in groupTrackers map", > zap.String("application", applicationID), > zap.String("group", appGroup)) > return > } > ... > {noformat} > We don't always have a {{groupTracker}}. The previous code simply called > {{increaseTrackedResource()}} on an empty tracker: > {noformat} > func (ut *UserTracker) increaseTrackedResource(queuePath string, > applicationID string, usage *resources.Resource) { > ut.Lock() > defer ut.Unlock() > ut.events.sendIncResourceUsageForUser(ut.userName, queuePath, usage) > hierarchy := strings.Split(queuePath, configs.DOT) > ut.queueTracker.increaseTrackedResource(hierarchy, applicationID, user, > usage) > gt := ut.appGroupTrackers[applicationID] > log.Log(log.SchedUGM).Debug("Increasing resource usage for group", > zap.String("group", gt.getName()), > zap.Strings("queue path", hierarchy), > zap.String("application", applicationID), > zap.Stringer("resource", usage)) > gt.increaseTrackedResource(queuePath, applicationID, usage, > ut.userName) <- can be null > } > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2661) Fix hard-coded boolean in setLimit
[ https://issues.apache.org/jira/browse/YUNIKORN-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2661. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed Merged to master & branch-1.5 > Fix hard-coded boolean in setLimit > -- > > Key: YUNIKORN-2661 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2661 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > Inside the UGM code {{setLimit()}}, we don't pass down {{doWildcardCheck}}, > so this variables never reaches the leafs: > {noformat} > / Note: Lock free call. The Lock of the linked tracker (UserTracker and > GroupTracker) should be held before calling this function. > func (qt *QueueTracker) setLimit(hierarchy []string, maxResource > *resources.Resource, maxApps uint64, useWildCard bool, trackType > trackingType, doWildCardCheck bool) { > log.Log(log.SchedUGM).Debug("Setting limits", > zap.String("queue path", qt.queuePath), > zap.Strings("hierarchy", hierarchy), > zap.Uint64("max applications", maxApps), > zap.Stringer("max resources", maxResource), > zap.Bool("use wild card", useWildCard)) > // depth first: all the way to the leaf, create if not exists > // more than 1 in the slice means we need to recurse down > if len(hierarchy) > 1 { > childName := hierarchy[1] > if qt.childQueueTrackers[childName] == nil { > qt.childQueueTrackers[childName] = > newQueueTracker(qt.queuePath, childName, trackType) > } > qt.childQueueTrackers[childName].setLimit(hierarchy[1:], > maxResource, maxApps, useWildCard, trackType, false) <-- should be > "doWildCardCheck" not "false" > ... > {noformat} > Fix this and create a unit test for {{setLimit()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2516) Update documentation about event.RESTResponseSize
[ https://issues.apache.org/jira/browse/YUNIKORN-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2516. Fix Version/s: 1.6.0 Resolution: Fixed > Update documentation about event.RESTResponseSize > - > > Key: YUNIKORN-2516 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2516 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: documentation > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2512) Event system properties are not used
[ https://issues.apache.org/jira/browse/YUNIKORN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2512. Fix Version/s: 1.6.0 Resolution: Fixed > Event system properties are not used > > > Key: YUNIKORN-2512 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2512 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.6.0 > > > There two properties which are not used by the event system: > # The property "event.requestCapacity" is supposed to determine the size of a > slice which is used between the core and shim to transfer events in every 2 > seconds. However, right now it's not used at all, we use the default (1000) > every time. > # The property "RESTResponseSize" is not even in the code at all. It > influences the maximum number of entries returned in the batch API. > Currently, the hard coded value is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2245) Application sorting: improve pending resource filtering
[ https://issues.apache.org/jira/browse/YUNIKORN-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2245. Resolution: Won't Do > Application sorting: improve pending resource filtering > --- > > Key: YUNIKORN-2245 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2245 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > > When sorting applications, we do a filtering on pending resources: > {noformat} > func filterOnPendingResources(apps map[string]*Application) []*Application { > filteredApps := make([]*Application, 0) > for _, app := range apps { > // Only look at app when pending-res > 0 > if resources.StrictlyGreaterThanZero(app.GetPendingResource()) { > filteredApps = append(filteredApps, app) > } > } > return filteredApps > } > {noformat} > This filtering is relatively expensive, but necessary, because during the > lifecycle of an application, {{sa.pending}} can become 0 and in this case, we > don't want to schedule anything from the app. > Suggested approach is to track total pendingAskRepeats inside the app. That > way we don't need to call {{resources.StrictlyGreaterThanZero()}} and we > perform a simple integer comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2221) Performance improvements phase II
[ https://issues.apache.org/jira/browse/YUNIKORN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko closed YUNIKORN-2221. -- > Performance improvements phase II > - > > Key: YUNIKORN-2221 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2221 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler, shim - kubernetes > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Fix For: 1.5.0 > > > Umbrella JIRA for further performance improvements in Yunikorn. > The main issues have been addressed in YUNIKORN-1715. However, it's still > possible to reduce memory and CPU usage further by doing smaller things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2221) Performance improvements phase II
[ https://issues.apache.org/jira/browse/YUNIKORN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2221. Fix Version/s: 1.5.0 Resolution: Fixed > Performance improvements phase II > - > > Key: YUNIKORN-2221 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2221 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler, shim - kubernetes > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Fix For: 1.5.0 > > > Umbrella JIRA for further performance improvements in Yunikorn. > The main issues have been addressed in YUNIKORN-1715. However, it's still > possible to reduce memory and CPU usage further by doing smaller things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance
[ https://issues.apache.org/jira/browse/YUNIKORN-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2653. Fix Version/s: 1.6.0 Resolution: Fixed > Gang scheduling K8s event formatting compliance > --- > > Key: YUNIKORN-2653 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2653 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > The K8s events provide definitions and rules around the content of the fields > within the event. Adjust the content of gang scheduling related events to > comply with the rules. > Focussed on the reason and action fields only. > * 'reason' is the reason this event is generated. 'reason' should be short > and unique; it should be in UpperCamelCase format (starting with a capital > letter). > * 'action' explains what happened with regarding/ what action did the > ReportingController take in objects name; it should be in UpperCamelCase > format (starting with a capital letter). > No space or long text. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2683) Unnecessary error is logged when resource usage is increased
Peter Bacsko created YUNIKORN-2683: -- Summary: Unnecessary error is logged when resource usage is increased Key: YUNIKORN-2683 URL: https://issues.apache.org/jira/browse/YUNIKORN-2683 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko The refactored code in YUNIKORN-2542 contains an unnecessary warning message: {noformat} appGroup := userTracker.getGroupForApp(applicationID) log.Log(log.SchedUGM).Debug("Increasing resource usage for user", zap.String("user", user.User), zap.String("queue path", queuePath), zap.String("application", applicationID), zap.String("group", appGroup), zap.Stringer("resource", usage)) groupTracker := m.GetGroupTracker(appGroup) if groupTracker == nil { log.Log(log.SchedUGM).Error("group tracker should be available in groupTrackers map", zap.String("application", applicationID), zap.String("group", appGroup)) return } ... {noformat} We don't always have a {{groupTracker}}. The previous code simply called {{increaseTrackedResource()}} on an empty tracker: {noformat} func (ut *UserTracker) increaseTrackedResource(queuePath string, applicationID string, usage *resources.Resource) { ut.Lock() defer ut.Unlock() ut.events.sendIncResourceUsageForUser(ut.userName, queuePath, usage) hierarchy := strings.Split(queuePath, configs.DOT) ut.queueTracker.increaseTrackedResource(hierarchy, applicationID, user, usage) gt := ut.appGroupTrackers[applicationID] log.Log(log.SchedUGM).Debug("Increasing resource usage for group", zap.String("group", gt.getName()), zap.Strings("queue path", hierarchy), zap.String("application", applicationID), zap.Stringer("resource", usage)) gt.increaseTrackedResource(queuePath, applicationID, usage, ut.userName) <- can be null } {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2680) Improve placement rule funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2680. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve placement rule funtion's test coverage > -- > > Key: YUNIKORN-2680 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2680 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2681) Data race in TestGetStream_Limit
Peter Bacsko created YUNIKORN-2681: -- Summary: Data race in TestGetStream_Limit Key: YUNIKORN-2681 URL: https://issues.apache.org/jira/browse/YUNIKORN-2681 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler, test - unit Reporter: Peter Bacsko Assignee: Peter Bacsko Data race was detected during an unit test: {noformat} == WARNING: DATA RACE Write at 0x0170c220 by goroutine 2575: github.com/apache/yunikorn-core/pkg/webservice.NewWebApp() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/webservice.go:82 +0x11c github.com/apache/yunikorn-core/pkg/webservice.TestCheckHealthStatusNotFound() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2574 +0x2f testing.tRunner() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x44 Previous read at 0x0170c220 by goroutine 2542: github.com/apache/yunikorn-core/pkg/webservice.getStream() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers.go:1225 +0xbd3 github.com/apache/yunikorn-core/pkg/webservice.TestGetStream_Limit.gowrap4() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2308 +0x4f Goroutine 2575 (running) created at: testing.(*T).Run() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x825 testing.runTests.func1() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2161 +0x85 testing.tRunner() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e testing.runTests() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2159 +0x8be testing.(*M).Run() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2027 +0xf17 main.main() _testmain.go:163 +0x2e4 Goroutine 2542 (running) created at: github.com/apache/yunikorn-core/pkg/webservice.TestGetStream_Limit() /home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2308 +0xbb7 testing.tRunner() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x44 == 2024-06-18T13:40:54.182ZINFOcore.events events/event_streaming.go:164 Removing event stream consumer {"name": "host-1", "creation time": "2024-06-18T13:40:54.181Z"} 2024-06-18T13:40:54.182ZINFOcore.scheduler.health webservice/handlers.go:623 Health check is not available --- FAIL: TestCheckHealthStatusNotFound (0.00s) testing.go:1398: race detected during execution of test {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2673) Improve newFilter funtion's test coverage in filter.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2673. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve newFilter funtion's test coverage in filter.go > -- > > Key: YUNIKORN-2673 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2673 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2515) Add property event.RESTResponseSize to the batch event handler
[ https://issues.apache.org/jira/browse/YUNIKORN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2515. Fix Version/s: 1.6.0 Resolution: Fixed > Add property event.RESTResponseSize to the batch event handler > -- > > Key: YUNIKORN-2515 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2515 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2670) Improve util funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2670. Fix Version/s: 1.6.0 Resolution: Fixed > Improve util funtion's test coverage > > > Key: YUNIKORN-2670 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2670 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Improve the following funtion's test coverage in util.go > * ZeroTimeInUnixNano > * GetNewUUID > * IsRecoveryQueue -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2669) nil pointer dereference error
[ https://issues.apache.org/jira/browse/YUNIKORN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2669. Resolution: Duplicate This looks like a dup of YUNIKORN-2562. The solution for this has been delivered in 1.5.1. It's also on master. > nil pointer dereference error > - > > Key: YUNIKORN-2669 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2669 > Project: Apache YuniKorn > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Junyoung Park >Assignee: Peter Bacsko >Priority: Major > > Environment: AWS EKS 1.26 > yunikorn-scheduler logs > {code:java} > panic: runtime error: invalid memory address or nil pointer > dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 > pc=0x179b2f5] > goroutine 50 > [running]:github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc000661000, > {0xc008ad14a0, 0x24}) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/objects/application.go:1739 > > +0x615github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xc00046a100?, > 0xc01436c880) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/partition.go:1281 > +0x27fgithub.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc000502680?, > {0xc02014da60, 0x1, 0xc0112f5ee8?}, {0xc0060f8980, 0xb}) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/context.go:868 > +0x9egithub.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc00046a100?, > 0xc0145e8eb0?) > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/context.go:750 > +0xa5github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000120990) > > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/scheduler.go:111 > +0x16ecreated by > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in > goroutine 1 > github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/scheduler.go:55 +0x9c > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2637. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed Merged to master & branch-1.5. > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2668) Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails
[ https://issues.apache.org/jira/browse/YUNIKORN-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2668. Fix Version/s: 1.6.0 Resolution: Fixed > Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails > > > Key: YUNIKORN-2668 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2668 > Project: Apache YuniKorn > Issue Type: Task > Reporter: Peter Bacsko > Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The test case TestUpdateAllocation_NewTask_AssumePodFails occasionally fails > due to a deadlock problem described in YUNIKORN-2629. Until that ticket is > resolved, let's disable this test for the time being, so upstream tests don't > fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2668) Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails
Peter Bacsko created YUNIKORN-2668: -- Summary: Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails Key: YUNIKORN-2668 URL: https://issues.apache.org/jira/browse/YUNIKORN-2668 Project: Apache YuniKorn Issue Type: Task Reporter: Peter Bacsko Assignee: Peter Bacsko The test case TestUpdateAllocation_NewTask_AssumePodFails occasionally fails due to a deadlock problem described in YUNIKORN-2629. Until that ticket is resolved, let's disable this test for the time being, so upstream tests don't fail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2561) Support topology spread constraints on placeholder pods
[ https://issues.apache.org/jira/browse/YUNIKORN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2561. Fix Version/s: 1.6.0 Resolution: Fixed > Support topology spread constraints on placeholder pods > --- > > Key: YUNIKORN-2561 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2561 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Jacob Salway >Assignee: Jacob Salway >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > If a pod has a topology spread constraint with a `whenUnsatisfiable: > DoNotSchedule` constraint and is used as part of a task group, it is not > possible to pass the constraint to the placeholder pods created by Yunikorn. > This can result in placeholder pods being placed on a node that would violate > the original pod's topology spread constraint. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2643) utils.go WaitForCondition improvement
[ https://issues.apache.org/jira/browse/YUNIKORN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2643. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. Thanks [~mean-world] for the contribution. > utils.go WaitForCondition improvement > -- > > Key: YUNIKORN-2643 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2643 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: HUAN-IU LIOU >Assignee: HUAN-IU LIOU >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2663) Improve ACL struct funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2663. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Improve ACL struct funtion's test coverage > -- > > Key: YUNIKORN-2663 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2663 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Remove unreachable code in NewACL func > Improve the following funtion's test coverage in acl.go > * TestSetUsers > * TestSetGroups -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2666) Fix DeepEqual comparison in Test_fixedRule_ruleDAO
[ https://issues.apache.org/jira/browse/YUNIKORN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2666. Fix Version/s: 1.6.0 Resolution: Fixed > Fix DeepEqual comparison in Test_fixedRule_ruleDAO > --- > > Key: YUNIKORN-2666 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2666 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler, test - unit > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The test case {{Test_fixedRule_ruleDAO/filter}} can randomly fail due to the > non-deterministic nature of map key iteration: > {noformat} > fixed_rule_test.go:285: assertion failed: > --- tt.want > +++ ruleDAO > { > Name: "fixed", > Parameters: {"create": "true", "qualified": "false", > "queue": "default"}, > Filter: { > Type: "allow", > UserList: nil, > GroupList: []string{ > - "group1", > + "group2", > - "group2", > + "group1", > }, > UserExp: "", > GroupExp: "", > }, > ParentRule: nil, > } > {noformat} > We use {{maps.Keys()}} when we create the user list and group list in > {{FilterDAO}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2666) Fix DeepEqual comparison in Test_fixedRule_ruleDAO
Peter Bacsko created YUNIKORN-2666: -- Summary: Fix DeepEqual comparison in Test_fixedRule_ruleDAO Key: YUNIKORN-2666 URL: https://issues.apache.org/jira/browse/YUNIKORN-2666 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler, test - unit Reporter: Peter Bacsko The test case {{Test_fixedRule_ruleDAO/filter}} can randomly fail due to the non-deterministic nature of map key iteration: {noformat} fixed_rule_test.go:285: assertion failed: --- tt.want +++ ruleDAO { Name: "fixed", Parameters: {"create": "true", "qualified": "false", "queue": "default"}, Filter: { Type: "allow", UserList: nil, GroupList: []string{ - "group1", + "group2", - "group2", + "group1", }, UserExp: "", GroupExp: "", }, ParentRule: nil, } {noformat} We use {{maps.Keys()}} when we create the user list and group list in {{FilterDAO}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2650) Complete or remove web_server_test#TestProxy
[ https://issues.apache.org/jira/browse/YUNIKORN-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2650. Fix Version/s: 1.6.0 Resolution: Fixed > Complete or remove web_server_test#TestProxy > > > Key: YUNIKORN-2650 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2650 > Project: Apache YuniKorn > Issue Type: Test >Reporter: Chia-Ping Tsai >Assignee: Chenchen Lai >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > web_server_test has a empty test case: TestProxy [0]. It seems to me there is > proxy-related test [1]. > [0] > https://github.com/apache/yunikorn-k8shim/blob/58adfe941d2d8dae5544af8b49e435f304678807/pkg/webtest/web_server_test.go#L82 > [1] > https://github.com/apache/yunikorn-k8shim/blob/58adfe941d2d8dae5544af8b49e435f304678807/pkg/webtest/web_server_test.go#L73 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2514) Update documentation about event.requestCapacity
[ https://issues.apache.org/jira/browse/YUNIKORN-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2514. Fix Version/s: 1.6.0 Resolution: Fixed > Update documentation about event.requestCapacity > > > Key: YUNIKORN-2514 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2514 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: documentation > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2654) Remove unused code in k8shim context
[ https://issues.apache.org/jira/browse/YUNIKORN-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2654. Fix Version/s: 1.6.0 Resolution: Fixed > Remove unused code in k8shim context > > > Key: YUNIKORN-2654 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2654 > Project: Apache YuniKorn > Issue Type: Task > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Chenchen Lai >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > The NotifyApplicationComplete and NotifyApplicationFail function are not > called by anything and are unused code. > The K8shim does not trigger the application completion or failure. This is > triggered by the core when the application no longer has any activity > registered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2647) Flaky test TestUpdateNodeCapacity
[ https://issues.apache.org/jira/browse/YUNIKORN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2647. Fix Version/s: 1.6.0 Resolution: Fixed > Flaky test TestUpdateNodeCapacity > - > > Key: YUNIKORN-2647 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2647 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - unit >Reporter: Wilfred Spiegelenburg >Assignee: Tseng Hsi-Huang >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > Same as we saw in YUNIKORN-2573 the single node update test might fail: > {code:java} > --- FAIL: TestUpdateNodeCapacity (0.03s) > operation_test.go:446: Expected partition resource map[memory:1 > vcore:2], doesn't match with actual partition resource > map[memory:1 vcore:2]{code} > We calculate the delta resources when updating node capacity with that delta > we update resources in partition. > The test would fail with following order same as for multiple nodes > node.SetCapacity() -> waitForAvailableNodeResource() -> > partitionInfo.GetTotalPartitionResource() -> > partition.updatePartitionResource() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2659) Improve config validator funtion's test coverage
[ https://issues.apache.org/jira/browse/YUNIKORN-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2659. Fix Version/s: 1.6.0 Resolution: Fixed > Improve config validator funtion's test coverage > > > Key: YUNIKORN-2659 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2659 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > Improve the following funtion's test coverage in configvalidator.go > * checkPlacementRule > * checkLimitResource > * checkLimit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2661) Fix hard-coded boolean in setLimit
Peter Bacsko created YUNIKORN-2661: -- Summary: Fix hard-coded boolean in setLimit Key: YUNIKORN-2661 URL: https://issues.apache.org/jira/browse/YUNIKORN-2661 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko Inside the UGM code {{setLimit()}}, we don't pass down {{doWildcardCheck}}, so this variables never reaches the leafs: {noformat} / Note: Lock free call. The Lock of the linked tracker (UserTracker and GroupTracker) should be held before calling this function. func (qt *QueueTracker) setLimit(hierarchy []string, maxResource *resources.Resource, maxApps uint64, useWildCard bool, trackType trackingType, doWildCardCheck bool) { log.Log(log.SchedUGM).Debug("Setting limits", zap.String("queue path", qt.queuePath), zap.Strings("hierarchy", hierarchy), zap.Uint64("max applications", maxApps), zap.Stringer("max resources", maxResource), zap.Bool("use wild card", useWildCard)) // depth first: all the way to the leaf, create if not exists // more than 1 in the slice means we need to recurse down if len(hierarchy) > 1 { childName := hierarchy[1] if qt.childQueueTrackers[childName] == nil { qt.childQueueTrackers[childName] = newQueueTracker(qt.queuePath, childName, trackType) } qt.childQueueTrackers[childName].setLimit(hierarchy[1:], maxResource, maxApps, useWildCard, trackType, false) ... {noformat} Fix this and create a unit test for {{setLimit()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2649) Improve CalculateAbsUsedCapacity & CompUsageRatio funtion's test coverage in resources.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2649. Fix Version/s: 1.6.0 Resolution: Fixed > Improve CalculateAbsUsedCapacity & CompUsageRatio funtion's test coverage in > resources.go > - > > Key: YUNIKORN-2649 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2649 > Project: Apache YuniKorn > Issue Type: Test > Components: core - common >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2581) Expose running placement rules in REST
[ https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2581. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Expose running placement rules in REST > -- > > Key: YUNIKORN-2581 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2581 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Since introducing the use of placement rules always and the recovery rule the > queue config does not correctly show the running rules. > Also if a config update has been rejected, for any reason, the rules would > not be correct > Exposing the configured rules from the placement manager works around all > these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2646) Deadlock detected during preemption
[ https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2646. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed > Deadlock detected during preemption > --- > > Key: YUNIKORN-2646 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2646 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Dmitry >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > Attachments: yunikorn-logs-lock.txt.gz > > > Hitting deadlocks in 1.5.1 > The log is attached -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2542) Consistent logging and tracker handling for increment/decrement
[ https://issues.apache.org/jira/browse/YUNIKORN-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2542. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. Thanks [~Tseng Hsi-Huang] for the contribution. > Consistent logging and tracker handling for increment/decrement > --- > > Key: YUNIKORN-2542 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2542 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Tseng Hsi-Huang >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > We log DEBUG output and use {{GroupTracker}} inconsistently in {{Manager}} > and in {{UserTracker}}. > Eg. > {{Manager.IncreaseTrackedResource()}}: only a single log output with DEBUG > level > {{Manager.DecreaseTrackedResource()}}: multiple log statements, also handles > the group tracker which is not the case with increments > This also affects {{UserTracker}} - logs handling are different > in {{increaseTrackedResource()}}/{{decreaseTrackedResource()}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2567) Remove Application reference from applicationEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2567. Fix Version/s: 1.6.0 Resolution: Fixed > Remove Application reference from applicationEvents > --- > > Key: YUNIKORN-2567 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2567 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2642) Don't set resources on the recovery queue
[ https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2642. Resolution: Fixed > Don't set resources on the recovery queue > - > > Key: YUNIKORN-2642 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2642 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > The resource constrainst can be set on dynamic queues based on application > tags. We should not set this on the recovery queue, because there's no quota > on them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2635) test coverage improvement: same priority case in sorter
[ https://issues.apache.org/jira/browse/YUNIKORN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2635. Fix Version/s: 1.6.0 Resolution: Fixed > test coverage improvement: same priority case in sorter > > > Key: YUNIKORN-2635 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2635 > Project: Apache YuniKorn > Issue Type: Test > Components: core - scheduler >Reporter: Chen Yu Teng >Assignee: Chen Yu Teng >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application
[ https://issues.apache.org/jira/browse/YUNIKORN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2633. Fix Version/s: 1.6.0 Resolution: Fixed > Unnecessary warning from Partition when adding an application > - > > Key: YUNIKORN-2633 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2633 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > The following is printed when adding an application: > {noformat} > 2024-05-17T21:53:04.716+0200 WARNcore.scheduler.queue > scheduler/partition.go:344 Trying to set resources on a queue that is > not an unmanaged leaf{"queueName": "root.default"} > {noformat} > This message is supposed to be printed when the application defines a > guaranteed or max resource. After YUNIKORN-2547 it's always printed if the > queue is managed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2642) Don't set resources on the recovery queue
Peter Bacsko created YUNIKORN-2642: -- Summary: Don't set resources on the recovery queue Key: YUNIKORN-2642 URL: https://issues.apache.org/jira/browse/YUNIKORN-2642 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko The resource constrainst can be set on dynamic queues based on application tags. We should not set this on the recovery queue, because there's no quota on them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2566. Fix Version/s: 1.6.0 Resolution: Fixed > Remove AllocationAsk reference from askEvents > - > > Key: YUNIKORN-2566 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2566 > Project: Apache YuniKorn > Issue Type: Sub-task > Reporter: Peter Bacsko > Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2565) Remove Node reference from nodeEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2565. Fix Version/s: 1.6.0 Resolution: Fixed > Remove Node reference from nodeEvents > - > > Key: YUNIKORN-2565 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2565 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2618) Streamline AsyncRMCallback UpdateAllocation
[ https://issues.apache.org/jira/browse/YUNIKORN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2618. Fix Version/s: 1.6.0 Resolution: Fixed > Streamline AsyncRMCallback UpdateAllocation > --- > > Key: YUNIKORN-2618 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2618 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Yun Sun >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > if task is not found, a nil is returned from {{context.getTask}} in for > {{response.New}} processing we should just log that fact and proceed to the > next alloc. Simplifies the flow as we never need to check for a. nil task. We > should never have a pod in the cache that does not exist as a task on an > application. > We retrieve the application using the application ID from the response to > never use the object. We only use the application ID to pass into an event. > The context event handler then does the exact same lookup again to process > the event on the app. > We need to become much smarter in this area, double or triple lookups, > generate async events that just change the state of the app or task or kick > off another event. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2611) [UMBRELLA] YuniKorn 1.5.1 release efforts
[ https://issues.apache.org/jira/browse/YUNIKORN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2611. Fix Version/s: 1.5.1 Resolution: Fixed > [UMBRELLA] YuniKorn 1.5.1 release efforts > - > > Key: YUNIKORN-2611 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2611 > Project: Apache YuniKorn > Issue Type: Task > Components: release > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Fix For: 1.5.1 > > > This umbrella is to track the work items needed for 1.5.0 release. > Release manager: Peter Bacsko. > This release only consists of bug fixes. Use the filter > [https://issues.apache.org/jira/issues/?filter=12353383] to see the list of > deliverables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2614) Update website for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2614. Fix Version/s: 1.5.1 Target Version: 1.5.1 Resolution: Fixed > Update website for 1.5.1 > > > Key: YUNIKORN-2614 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2614 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2639) Clarify release procedure for minor releases
Peter Bacsko created YUNIKORN-2639: -- Summary: Clarify release procedure for minor releases Key: YUNIKORN-2639 URL: https://issues.apache.org/jira/browse/YUNIKORN-2639 Project: Apache YuniKorn Issue Type: Task Components: release Reporter: Peter Bacsko After the release of 1.5.1, we realized that we need to properly define the release process for a minor release. This needs to be properly documented. The clarification should cover things like: # What it can and can't include (no features/bugfixes only) # How to publish docs? Shall we keep the current "a.b.c" version on the website or remove it and publish "a.b.c+1"? # Communication: possible difference in release notes, announcement, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[ANNOUNCE] YuniKorn 1.5.1 released
Hi all, It gives me great pleasure to announce that the Apache YuniKorn community has voted to release Apache YuniKorn v1.5.1. This a minor release which contains 18 fixes. The release details are on the v1.5.1 announcement page [1]. You can also download the release from the Downloads page [2], Many thanks to everyone who contributed to the release. Peter [1] https://yunikorn.apache.org/release-announce/1.5.1 [2] https://yunikorn.apache.org/community/download/
[jira] [Created] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application
Peter Bacsko created YUNIKORN-2633: -- Summary: Unnecessary warning from Partition when adding an application Key: YUNIKORN-2633 URL: https://issues.apache.org/jira/browse/YUNIKORN-2633 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko The following is printed when adding an application: {noformat} 2024-05-17T21:53:04.716+0200WARNcore.scheduler.queue scheduler/partition.go:344 Trying to set resources on a queue that is not an unmanaged leaf{"queueName": "root.default"} {noformat} This message is supposed to be printed when the application defines a guaranteed or max resource. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2613) Release notes for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2613. Fix Version/s: 1.5.1 Resolution: Fixed > Release notes for 1.5.1 > --- > > Key: YUNIKORN-2613 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2613 > Project: Apache YuniKorn > Issue Type: Sub-task > Reporter: Peter Bacsko > Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2632) Data race in IncAllocatedResource
[ https://issues.apache.org/jira/browse/YUNIKORN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2632. Fix Version/s: 1.6.0 1.5.2 Resolution: Fixed > Data race in IncAllocatedResource > - > > Key: YUNIKORN-2632 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2632 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0, 1.5.2 > > > After YUNIKORN-2548, we accidentally make an unlocked access to > \{{Queue.allocatedResource}}. > {noformat} > WARNING: DATA RACE > Read at 0x00c000578a00 by goroutine 52: > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).IncAllocatedResource() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1032 > +0x6b > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1495 > +0x184 > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes.func1() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1402 > +0x144 > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode.func1() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/node_iterator.go:42 > +0x95 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:522 > +0x6f1 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 > +0x448 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 > +0x448 > github.com/google/btree.(*node[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).iterate() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 > +0x448 > github.com/google/btree.(*BTreeG[go.shape.interface { > Less(github.com/google/btree.Item) bool }]).Ascend() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:779 > +0x108 > github.com/google/btree.(*BTree).Ascend() > > /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:1029 > +0x108 > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode() > ... > Previous write at 0x00c000578a00 by goroutine 49: > > github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).DecAllocatedResource() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1101 > +0x212 > > github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/partition.go:1357 > +0x17b4 > > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:870 > +0xba > > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:750 > +0x1e4 > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:133 > +0x28d > > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.gowrap1() > > /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:60 > +0x33 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2632) Data race in IncAllocatedResource
Peter Bacsko created YUNIKORN-2632: -- Summary: Data race in IncAllocatedResource Key: YUNIKORN-2632 URL: https://issues.apache.org/jira/browse/YUNIKORN-2632 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Peter Bacsko Assignee: Peter Bacsko After YUNIKORN-2548, we accidentally make an unlocked access to \{{Queue.allocatedResource}}. {noformat} WARNING: DATA RACE Read at 0x00c000578a00 by goroutine 52: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).IncAllocatedResource() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1032 +0x6b github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1495 +0x184 github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes.func1() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1402 +0x144 github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode.func1() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/node_iterator.go:42 +0x95 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:522 +0x6f1 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 +0x448 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 +0x448 github.com/google/btree.(*node[go.shape.interface { Less(github.com/google/btree.Item) bool }]).iterate() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 +0x448 github.com/google/btree.(*BTreeG[go.shape.interface { Less(github.com/google/btree.Item) bool }]).Ascend() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:779 +0x108 github.com/google/btree.(*BTree).Ascend() /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:1029 +0x108 github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode() ... Previous write at 0x00c000578a00 by goroutine 49: github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).DecAllocatedResource() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1101 +0x212 github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/partition.go:1357 +0x17b4 github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:870 +0xba github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:750 +0x1e4 github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:133 +0x28d github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.gowrap1() /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:60 +0x33 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
Yes, that is correct. On Thu, May 16, 2024 at 8:54 PM Desai, Mit wrote: > This issue could also be faced by non-autoscaled clusters who still gets a > node added at some point. Right? > > -Mit > > From: Peter Bacsko > Date: Thursday, May 16, 2024 at 11:23 AM > To: dev@yunikorn.apache.org > Subject: Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1 > I'm fine with either approach. If it's too late, then let's go ahead with > 1.5.1. > > Maybe it's better this way because we can do a more thorough verification. > > Peter > > On Thu, May 16, 2024 at 8:15 PM Craig Condit wrote: > > > IMO, it’s too late to update 1.5.1. We’ve already cut the tags, and those > > must remain immutable. Our best bet would probably be to continue with > > 1.5.1 as-is; the new issue is unlikely to affect non-autoscaled clusters > > and it’s better than 1.5.0. We should, I think, get this latest issue > fixed > > and go for 1.5.2. > > > > Do we have a fix yet? If so, we could probably push for 1.5.2 alone. But > > either way, 1.5.1 is already baked. > > > > > > Craig > > > > > > > On May 16, 2024, at 1:06 PM, Peter Bacsko wrote: > > > > > > Dear community, > > > > > > I've been working together with Jacob Salway on an issue and we found > out > > > that there's one more deadlock in the shim which can be triggered when > a > > > new node is added. This means that an autoscaler setup is prone to a > > > deadlock. > > > > > > I filed a JIRA which explains the problem: > > > > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FYUNIKORN-2629=05%7C02%7Cmdesai%40visa.com%7Cdcae530e12744b3b959d08dc75d549a6%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638514806143403455%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=hsPVxgphiw1b6zTFScOiQ%2Fea7%2BydOqLpZCdqp38V26s%3D=0 > <https://issues.apache.org/jira/browse/YUNIKORN-2629> > > > > > > I already published the release artifacts to the release repo, GitHub > and > > > dockerhub, however no announcement has been made. I think we need an > RC2 > > > and re-run the voting and delete the artifacts. > > > > > > Thoughts, opinions? > > > > > > Peter > > > > > > On Thu, May 16, 2024 at 11:34 AM Peter Bacsko > wrote: > > > > > >> +1 binding > > >> > > >> - Built images from source (amd64) on Ubuntu 22.04 > > >> - Run make test && make image > > >> - Run it on a local cluster > > >> - Checked some REST API endpoints > > >> - Ran sample jobs > > >> > > >> Thank you all for the voting on the RC1 for 1.5.1. > > >> > > >> Voting for the release has passed with: > > >> 5 binding +1 > > >> 3 non binding +1 > > >> > > >> no 0 or -1 votes. > > >> > > >> As the next step, I'll publish the release, images and update the > > website. > > >> After that is done I will send an announcement email. > > >> > > >> Thank you, > > >> Peter > > >> > > >> > > >> On Wed, May 15, 2024 at 4:45 PM Manikandan R > > wrote: > > >> > > >>> +1 (Binding) > > >>> > > >>> - Built images from source on Mac M1 MacOS Monterey (arm64) with go > > 1.21.8 > > >>> - Verified the signatures > > >>> - Verified the licences and checksums > > >>> - Run the scheduler with a local kind cluster (version 1.29.0) > > >>> - Ran simple sleep jobs > > >>> - Verified REST APIs outputs, Web UI > > >>> > > >>> Thanks, > > >>> Mani > > >>> > > >>> On Tue, May 14, 2024 at 9:41 PM Desai, Mit > > >>> wrote: > > >>> > > >>>> +1 (non-binding) > > >>>> > > >>>> > > >>>> * Built release on MacOS Sonoma (arm64) > > >>>> * Installed locally on Kind Cluster (1.28) > > >>>> * Successfully ran make test > > >>>> * Ran sample sleep jobs > > >>>> > > >>>> Thank you, Peter, for your efforts in driving the release. > > >>>> > > >>>> - Mit Desai > > >>>> > > >>>> From: Peter Bacsko > > >>>&
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
I'm fine with either approach. If it's too late, then let's go ahead with 1.5.1. Maybe it's better this way because we can do a more thorough verification. Peter On Thu, May 16, 2024 at 8:15 PM Craig Condit wrote: > IMO, it’s too late to update 1.5.1. We’ve already cut the tags, and those > must remain immutable. Our best bet would probably be to continue with > 1.5.1 as-is; the new issue is unlikely to affect non-autoscaled clusters > and it’s better than 1.5.0. We should, I think, get this latest issue fixed > and go for 1.5.2. > > Do we have a fix yet? If so, we could probably push for 1.5.2 alone. But > either way, 1.5.1 is already baked. > > > Craig > > > > On May 16, 2024, at 1:06 PM, Peter Bacsko wrote: > > > > Dear community, > > > > I've been working together with Jacob Salway on an issue and we found out > > that there's one more deadlock in the shim which can be triggered when a > > new node is added. This means that an autoscaler setup is prone to a > > deadlock. > > > > I filed a JIRA which explains the problem: > > https://issues.apache.org/jira/browse/YUNIKORN-2629 > > > > I already published the release artifacts to the release repo, GitHub and > > dockerhub, however no announcement has been made. I think we need an RC2 > > and re-run the voting and delete the artifacts. > > > > Thoughts, opinions? > > > > Peter > > > > On Thu, May 16, 2024 at 11:34 AM Peter Bacsko wrote: > > > >> +1 binding > >> > >> - Built images from source (amd64) on Ubuntu 22.04 > >> - Run make test && make image > >> - Run it on a local cluster > >> - Checked some REST API endpoints > >> - Ran sample jobs > >> > >> Thank you all for the voting on the RC1 for 1.5.1. > >> > >> Voting for the release has passed with: > >> 5 binding +1 > >> 3 non binding +1 > >> > >> no 0 or -1 votes. > >> > >> As the next step, I'll publish the release, images and update the > website. > >> After that is done I will send an announcement email. > >> > >> Thank you, > >> Peter > >> > >> > >> On Wed, May 15, 2024 at 4:45 PM Manikandan R > wrote: > >> > >>> +1 (Binding) > >>> > >>> - Built images from source on Mac M1 MacOS Monterey (arm64) with go > 1.21.8 > >>> - Verified the signatures > >>> - Verified the licences and checksums > >>> - Run the scheduler with a local kind cluster (version 1.29.0) > >>> - Ran simple sleep jobs > >>> - Verified REST APIs outputs, Web UI > >>> > >>> Thanks, > >>> Mani > >>> > >>> On Tue, May 14, 2024 at 9:41 PM Desai, Mit > >>> wrote: > >>> > >>>> +1 (non-binding) > >>>> > >>>> > >>>> * Built release on MacOS Sonoma (arm64) > >>>> * Installed locally on Kind Cluster (1.28) > >>>> * Successfully ran make test > >>>> * Ran sample sleep jobs > >>>> > >>>> Thank you, Peter, for your efforts in driving the release. > >>>> > >>>> - Mit Desai > >>>> > >>>> From: Peter Bacsko > >>>> Date: Friday, May 10, 2024 at 1:41 AM > >>>> To: dev@yunikorn.apache.org > >>>> Subject: [VOTE] Release Apache YuniKorn 1.5.1 RC1 > >>>> Hello everyone, > >>>> > >>>> I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. > >>>> This is a minor release which contains only bugfixes. > >>>> > >>>> The release artefacts have been uploaded here: > >>>> > >>>> > >>> > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fyunikorn%2F1.5.1-RC1%2F=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668929112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=DjD5Z%2BWZJwP%2Brya2vzsYf%2BMawgZ%2B57Uc6ksy6daaOLk%3D=0 > >>>> <https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/> > >>>> > >>>> My public key is located in the KEYS file: > >>>> > >>>> > >>> > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2F%2Fyunikorn%2FKEYS=05%7C02%7Cmdesai%40visa.com%7C2
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
Dear community, I've been working together with Jacob Salway on an issue and we found out that there's one more deadlock in the shim which can be triggered when a new node is added. This means that an autoscaler setup is prone to a deadlock. I filed a JIRA which explains the problem: https://issues.apache.org/jira/browse/YUNIKORN-2629 I already published the release artifacts to the release repo, GitHub and dockerhub, however no announcement has been made. I think we need an RC2 and re-run the voting and delete the artifacts. Thoughts, opinions? Peter On Thu, May 16, 2024 at 11:34 AM Peter Bacsko wrote: > +1 binding > > - Built images from source (amd64) on Ubuntu 22.04 > - Run make test && make image > - Run it on a local cluster > - Checked some REST API endpoints > - Ran sample jobs > > Thank you all for the voting on the RC1 for 1.5.1. > > Voting for the release has passed with: > 5 binding +1 > 3 non binding +1 > > no 0 or -1 votes. > > As the next step, I'll publish the release, images and update the website. > After that is done I will send an announcement email. > > Thank you, > Peter > > > On Wed, May 15, 2024 at 4:45 PM Manikandan R wrote: > >> +1 (Binding) >> >> - Built images from source on Mac M1 MacOS Monterey (arm64) with go 1.21.8 >> - Verified the signatures >> - Verified the licences and checksums >> - Run the scheduler with a local kind cluster (version 1.29.0) >> - Ran simple sleep jobs >> - Verified REST APIs outputs, Web UI >> >> Thanks, >> Mani >> >> On Tue, May 14, 2024 at 9:41 PM Desai, Mit >> wrote: >> >> > +1 (non-binding) >> > >> > >> > * Built release on MacOS Sonoma (arm64) >> > * Installed locally on Kind Cluster (1.28) >> > * Successfully ran make test >> > * Ran sample sleep jobs >> > >> > Thank you, Peter, for your efforts in driving the release. >> > >> > - Mit Desai >> > >> > From: Peter Bacsko >> > Date: Friday, May 10, 2024 at 1:41 AM >> > To: dev@yunikorn.apache.org >> > Subject: [VOTE] Release Apache YuniKorn 1.5.1 RC1 >> > Hello everyone, >> > >> > I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. >> > This is a minor release which contains only bugfixes. >> > >> > The release artefacts have been uploaded here: >> > >> > >> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fyunikorn%2F1.5.1-RC1%2F=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668929112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=DjD5Z%2BWZJwP%2Brya2vzsYf%2BMawgZ%2B57Uc6ksy6daaOLk%3D=0 >> > <https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/> >> > >> > My public key is located in the KEYS file: >> > >> > >> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2F%2Fyunikorn%2FKEYS=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668939209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=bSdAxF2fZu4mbBCmWSAFCtUr3lN8Ok1j6wFG%2FjCExt8%3D=0 >> > <https://downloads.apache.org//yunikorn/KEYS> >> > >> > JIRA issues that have been resolved in this release: >> > >> > >> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Ffilter%3D12353383=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668945621%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=YXpRtzAMX1WVourp29T3sm6hWciTzJDOFhPtjKwNMM4%3D=0 >> > <https://issues.apache.org/jira/issues/?filter=12353383> >> > >> > The release solves a deadlock issue. If possible, test Yunikorn with >> > workloads that put Yunikorn under stress (ie. thousands/tens of >> thousands >> > of pods). >> > >> > Git tags for each component are as follows: >> > yunikorn-scheduler-interface: v1.5.1-1 >> > yunikorn-core: v1.5.1-1 >> > yunikorn-k8shim: v1.5.1-1 >> > yunikorn-web: v1.5.1-1 >> > yunikorn-release: v1.5.1-1 >> > >> > Once the release is voted on and approved, all repos will be tagged >> > 1.5.1 for consistency. >> > >> > Please review and vote. The vote will be open for at least 96 hours >> > and closes on Tuesday 14 May 2024, 20:00:00 CEST. >> > >> > [ ] +1 Approve >> > [ ] +0 No opinion >> > [ ] -1 Disapprove (and the reason why) >> > >> > >> > Thank you, >> > Peter >> > >> >
[jira] [Created] (YUNIKORN-2629) Adding a node can result in a deadlock
Peter Bacsko created YUNIKORN-2629: -- Summary: Adding a node can result in a deadlock Key: YUNIKORN-2629 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko Adding a new node after Yunikorn state initialization can result in a deadlock. The problem is that {{Context.addNode()}} holds a lock while we're waiting for the {{NodeAccepted}} event: {noformat} dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) api := ctx.apiProvider.GetAPIs().SchedulerAPI if err := api.UpdateNode({ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here {noformat} If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context: {noformat} go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v) {noformat} Since {{addNode()}} is holding a write lock, the event processing loop gets stuck. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2612) Tagging for 1.5.1
[ https://issues.apache.org/jira/browse/YUNIKORN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2612. Fix Version/s: 1.5.1 Resolution: Fixed > Tagging for 1.5.1 > - > > Key: YUNIKORN-2612 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2612 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: release > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 1.5.1 > > > Tagging for updating dependencies (SI/core/k8shim). > No branching is needed because we'll deliver the release from branch-1.5 > directly as we did with incubator minor releases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
+1 binding - Built images from source (amd64) on Ubuntu 22.04 - Run make test && make image - Run it on a local cluster - Checked some REST API endpoints - Ran sample jobs Thank you all for the voting on the RC1 for 1.5.1. Voting for the release has passed with: 5 binding +1 3 non binding +1 no 0 or -1 votes. As the next step, I'll publish the release, images and update the website. After that is done I will send an announcement email. Thank you, Peter On Wed, May 15, 2024 at 4:45 PM Manikandan R wrote: > +1 (Binding) > > - Built images from source on Mac M1 MacOS Monterey (arm64) with go 1.21.8 > - Verified the signatures > - Verified the licences and checksums > - Run the scheduler with a local kind cluster (version 1.29.0) > - Ran simple sleep jobs > - Verified REST APIs outputs, Web UI > > Thanks, > Mani > > On Tue, May 14, 2024 at 9:41 PM Desai, Mit > wrote: > > > +1 (non-binding) > > > > > > * Built release on MacOS Sonoma (arm64) > > * Installed locally on Kind Cluster (1.28) > > * Successfully ran make test > > * Ran sample sleep jobs > > > > Thank you, Peter, for your efforts in driving the release. > > > > - Mit Desai > > > > From: Peter Bacsko > > Date: Friday, May 10, 2024 at 1:41 AM > > To: dev@yunikorn.apache.org > > Subject: [VOTE] Release Apache YuniKorn 1.5.1 RC1 > > Hello everyone, > > > > I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. > > This is a minor release which contains only bugfixes. > > > > The release artefacts have been uploaded here: > > > > > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fyunikorn%2F1.5.1-RC1%2F=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668929112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=DjD5Z%2BWZJwP%2Brya2vzsYf%2BMawgZ%2B57Uc6ksy6daaOLk%3D=0 > > <https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/> > > > > My public key is located in the KEYS file: > > > > > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownloads.apache.org%2F%2Fyunikorn%2FKEYS=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668939209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=bSdAxF2fZu4mbBCmWSAFCtUr3lN8Ok1j6wFG%2FjCExt8%3D=0 > > <https://downloads.apache.org//yunikorn/KEYS> > > > > JIRA issues that have been resolved in this release: > > > > > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Ffilter%3D12353383=05%7C02%7Cmdesai%40visa.com%7C2a3124b63a9d4c5c1e0e08dc70cced61%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638509272668945621%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=YXpRtzAMX1WVourp29T3sm6hWciTzJDOFhPtjKwNMM4%3D=0 > > <https://issues.apache.org/jira/issues/?filter=12353383> > > > > The release solves a deadlock issue. If possible, test Yunikorn with > > workloads that put Yunikorn under stress (ie. thousands/tens of thousands > > of pods). > > > > Git tags for each component are as follows: > > yunikorn-scheduler-interface: v1.5.1-1 > > yunikorn-core: v1.5.1-1 > > yunikorn-k8shim: v1.5.1-1 > > yunikorn-web: v1.5.1-1 > > yunikorn-release: v1.5.1-1 > > > > Once the release is voted on and approved, all repos will be tagged > > 1.5.1 for consistency. > > > > Please review and vote. The vote will be open for at least 96 hours > > and closes on Tuesday 14 May 2024, 20:00:00 CEST. > > > > [ ] +1 Approve > > [ ] +0 No opinion > > [ ] -1 Disapprove (and the reason why) > > > > > > Thank you, > > Peter > > >
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
Thanks everyone for the testing I'll extend the voting deadline a bit (+24hrs) because one of our community member (Jacob Salway) wants to test it on a larger cluster with thousands of pods. Peter On Tue, May 14, 2024 at 4:20 PM TingYao wrote: > +1 (binding) > > - Verified signatures and checksums > - Verified LICENSE and NOTICE files > - Built release on Mac Sonoma (ARM64) > - make image with go 1.21.8 > - Ran make test, all tests passed > - Installed locally on Kind cluster (1.29.4) > - Ran simple sleep jobs > > Wilfred Spiegelenburg 於 2024年5月14日 週二 下午12:25寫道: > > > +1 (binding) > > > > - Verified signatures and checksums > > - Verified LICENSE and NOTICE files > > - Verified release tarball structure > > - Built release on Mac Sonoma (ARM64): > > - make image with go 1.22 and 1.21 > > - Ran make test, all tests passed > > - Installed locally on Kind cluster (1.29) > > > > - REST interface checks: > > - verified the SHA references in the cluster detail > > - verified the build date is set correctly > > - checked REST endpoints and UI > > > > On Fri, 10 May 2024 at 18:40, Peter Bacsko wrote: > > > > > > Hello everyone, > > > > > > I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. > > > This is a minor release which contains only bugfixes. > > > > > > The release artefacts have been uploaded here: > > > https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/ > > > > > > My public key is located in the KEYS file: > > > https://downloads.apache.org//yunikorn/KEYS > > > > > > JIRA issues that have been resolved in this release: > > >https://issues.apache.org/jira/issues/?filter=12353383 > > > > > > The release solves a deadlock issue. If possible, test Yunikorn with > > > workloads that put Yunikorn under stress (ie. thousands/tens of > thousands > > > of pods). > > > > > > Git tags for each component are as follows: > > > yunikorn-scheduler-interface: v1.5.1-1 > > > yunikorn-core: v1.5.1-1 > > > yunikorn-k8shim: v1.5.1-1 > > > yunikorn-web: v1.5.1-1 > > > yunikorn-release: v1.5.1-1 > > > > > > Once the release is voted on and approved, all repos will be tagged > > > 1.5.1 for consistency. > > > > > > Please review and vote. The vote will be open for at least 96 hours > > > and closes on Tuesday 14 May 2024, 20:00:00 CEST. > > > > > > [ ] +1 Approve > > > [ ] +0 No opinion > > > [ ] -1 Disapprove (and the reason why) > > > > > > > > > Thank you, > > > Peter > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org > > For additional commands, e-mail: dev-h...@yunikorn.apache.org > > > > >
[jira] [Resolved] (YUNIKORN-2623) Create unit tests for Clients
[ https://issues.apache.org/jira/browse/YUNIKORN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2623. Fix Version/s: 1.6.0 Resolution: Fixed > Create unit tests for Clients > - > > Key: YUNIKORN-2623 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2623 > Project: Apache YuniKorn > Issue Type: Test > Components: shim - kubernetes > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Follow-up on YUNIKORN-2621. > Create proper coverage for {{{}clients.Clients{}}}. See PR comment > https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
Thanks everyone for testing So far we have 3 non-binding +1s. I'll probably extend the voting for another 24 hours to get some binding feedbacks as well. Peter On Mon, May 13, 2024 at 8:15 PM 陳昱霖 wrote: > +1 (non-binding) > > - Verified signatures and checksums > - Built on Ubuntu 23.04(amd64) with go1.22.2 linux/amd64, deploy on Kind > 1.29.1 > - E2E tests passed in standard mode. > - Run simple preemption test > - Check Restful APIs > - Run smoking test 10 times in shim > > Yu-Lin Chen >
[jira] [Created] (YUNIKORN-2623) Create unit test coverage for Clients
Peter Bacsko created YUNIKORN-2623: -- Summary: Create unit test coverage for Clients Key: YUNIKORN-2623 URL: https://issues.apache.org/jira/browse/YUNIKORN-2623 Project: Apache YuniKorn Issue Type: Test Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko Follow-up on YUNIKORN-2621. Create proper coverage for {{{}clients.Clients{}}}. See PR comment https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2620) Remove redundant variable `errorExpected` from configvalidator_test.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2620. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Remove redundant variable `errorExpected` from configvalidator_test.go > -- > > Key: YUNIKORN-2620 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2620 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Yun Sun >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > This is similar to YUNIKORN-2598. We can check the existent of `validateFunc` > instead of having a extra boolean flag. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[VOTE] Release Apache YuniKorn 1.5.1 RC1
Hello everyone, I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. This is a minor release which contains only bugfixes. The release artefacts have been uploaded here: https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/ My public key is located in the KEYS file: https://downloads.apache.org//yunikorn/KEYS JIRA issues that have been resolved in this release: https://issues.apache.org/jira/issues/?filter=12353383 The release solves a deadlock issue. If possible, test Yunikorn with workloads that put Yunikorn under stress (ie. thousands/tens of thousands of pods). Git tags for each component are as follows: yunikorn-scheduler-interface: v1.5.1-1 yunikorn-core: v1.5.1-1 yunikorn-k8shim: v1.5.1-1 yunikorn-web: v1.5.1-1 yunikorn-release: v1.5.1-1 Once the release is voted on and approved, all repos will be tagged 1.5.1 for consistency. Please review and vote. The vote will be open for at least 96 hours and closes on Tuesday 14 May 2024, 20:00:00 CEST. [ ] +1 Approve [ ] +0 No opinion [ ] -1 Disapprove (and the reason why) Thank you, Peter
Re: [DISCUSSION] Yunikorn release 1.5.1
I'd like to start the release process. The following items will be delivered as part of 1.5.1: https://issues.apache.org/jira/issues/?filter=12353383 No features in this release, only bugfixes. No further items are considered (unless something critical is found). Planned schedule: RC1 out: 10th May Voting: from 10th May to early next week, 13th-14th May Release: 15-16th May Thanks, Peter On Thu, May 2, 2024 at 2:11 AM Shravan Achar wrote: > Have been helping Peter with YUNIKORN-2526, and it has been a tricky > problem to reproduce and resolve. It makes sense to continue to make > progress on it without blocking the 1.5.1 patch release as it has > considerable fixes already (re: deadlock) > > Shravan > > On 2024/04/29 15:20:27 Peter Bacsko wrote: > > Hey Wilfred, > > > > Yes, I'm taking the role of release manager. > > I cherry-picked YUNIKORN-2520 to branch-1.5. > > > > Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look > at > > YUNIKORN-2057 as he originally volunteered to solve it. I told him that > it > > was not urgent, but depending on how quickly he makes progress, we might > > re-consider our position later. > > > > Peter > > > > On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg > > wrote: > > > > > Peter, > > > > > > Thank you for starting this discussion. See inline for further > comments. > > > > > > > Hi all, > > > > > > > > Due to the number of problems that we have discovered since the > release > > > of > > > > 1.5.0, I believe it makes sense to create a new Yunikorn release > which > > > > consists of bug fixes only. If I'm not mistaken we haven't done this > > > before > > > > (at least since leaving the ASF incubator), so this would be the > first > > > > minor Yunikorn release. > > > > > > +1 > > > I am totally for releasing YuniKorn 1.5.1 with the lock fixes. > > > Looking at all the work you have done for this release: would you be > > > willing to also step up as a release manager for the 1.5.1 release? > > > > > > > There are a bunch of fixes that are already on branch-1.5: > > > > > > > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by > > > YUNIKORN-2544) > > > >- YUNIKORN-2539 Add optional deadlock detection > > > >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues > > > > - YUNIKORN-2543 Fix locking in RMProxy > > > > - YUNIKORN-2545 Eliminate multiple lock calls from Queue > > > > - YUNIKORN-2548 Potential deadlock during concurrent > > > > bottom-up/top-down queue traversal > > > > - YUNIKORN-2550 Fix locking in PartitionContext > > > > - YUNIKORN-2552 Recursive locking when sending remove queue > event > > > > - YUNIKORN-2553 [core] Enable deadlock detection during unit > tests > > > > - YUNIKORN-2563 [shim] Enable deadlock detection during unit > tests > > > > - YUNIKORN-2574 totalPartitionResource should not be mutated > with > > > > AddTo/SubFrom > > > > - YUNIKORN-2562 Nil pointer panic in > > > Application.ReplaceAllocation() > > > > > > > > > > Yes for all the above. > > > > > > > The following is In Progress for 1.5.1: > > > > > > > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task > list > > > >after scheduler restart > > > > > > This would be a good one to get in if we have some progress on this. > > > Do we understand what is going on yet? I looked at the jira and am not > > > sure if we understand the root cause. > > > > > > > Candidates: > > > > > > > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled > properly - > > > >Resolved, only cherry-picking is needed > > > > > > Yes, this could be added. > > > > > > I also think we need to check if we have any CVE fixes that need to be > > > added. > > > Quick check shows these two: > > > * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via > YUNIKORN-2541) > > > * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via > YUNIKORN-2469) > > > * build with golang 1.21.9 > > > > > > To satisfy the scanners, although we are not affected: > > > * K8s 1.29.4 (CVE-2024-3177) > > > > > > > &g
[jira] [Created] (YUNIKORN-2614) Update website for 1.5.1
Peter Bacsko created YUNIKORN-2614: -- Summary: Update website for 1.5.1 Key: YUNIKORN-2614 URL: https://issues.apache.org/jira/browse/YUNIKORN-2614 Project: Apache YuniKorn Issue Type: Sub-task Components: release Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2613) Release notes for 1.5.1
Peter Bacsko created YUNIKORN-2613: -- Summary: Release notes for 1.5.1 Key: YUNIKORN-2613 URL: https://issues.apache.org/jira/browse/YUNIKORN-2613 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2612) Tagging for 1.5.1
Peter Bacsko created YUNIKORN-2612: -- Summary: Tagging for 1.5.1 Key: YUNIKORN-2612 URL: https://issues.apache.org/jira/browse/YUNIKORN-2612 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2611) [UMBRELLA] YuniKorn 1.5.1 release efforts
Peter Bacsko created YUNIKORN-2611: -- Summary: [UMBRELLA] YuniKorn 1.5.1 release efforts Key: YUNIKORN-2611 URL: https://issues.apache.org/jira/browse/YUNIKORN-2611 Project: Apache YuniKorn Issue Type: Task Components: release Reporter: Peter Bacsko Assignee: Peter Bacsko -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2600) Update K8s dependency to 1.29.4
[ https://issues.apache.org/jira/browse/YUNIKORN-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2600. Fix Version/s: 1.6.0 1.5.1 Resolution: Fixed Merged to master and branch-1.5. > Update K8s dependency to 1.29.4 > --- > > Key: YUNIKORN-2600 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2600 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes > Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > A security vulnerability was fixed in 1.29.4. Update K8s dependency to this > version. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2602) Fix spelling/grammar in configvalidator
Peter Bacsko created YUNIKORN-2602: -- Summary: Fix spelling/grammar in configvalidator Key: YUNIKORN-2602 URL: https://issues.apache.org/jira/browse/YUNIKORN-2602 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Peter Bacsko Let's fix some minor grammar issues in configvalidator. Eg.: "existed" -> "existing", but there could be other mistakes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2600) Update K8s dependency to 1.29.4
Peter Bacsko created YUNIKORN-2600: -- Summary: Update K8s dependency to 1.29.4 Key: YUNIKORN-2600 URL: https://issues.apache.org/jira/browse/YUNIKORN-2600 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko A security vulnerability was fixed in 1.29.4. Update K8s dependency to this version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2472) REST API returns subtree by default
[ https://issues.apache.org/jira/browse/YUNIKORN-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2472. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > REST API returns subtree by default > --- > > Key: YUNIKORN-2472 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2472 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - common, documentation >Affects Versions: 1.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Ted Lin >Priority: Minor > Labels: newbie, pull-request-available > Fix For: 1.6.0 > > > The subtree query parameter is interpreted the opposite of what would be > expected. > If you call {{/ws/v1/partition/default/queue/root?subtree}} then you do not > get the subtree. If you call {{/ws/v1/partition/default/queue/root}} you get > the whole tree rooted at root > We have not documented the new API yet so before we add it to the docs we > should fix the behaviour: > * subtree given: return the whole tree > * subtree missing: return one level > The code fix is as simple as a ! in a single call and inverting the test > cases to pass or not pass {{?subtree}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2573) Flaky test TestUpdateNodeCapacityWithMultipleNodes
[ https://issues.apache.org/jira/browse/YUNIKORN-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2573. Fix Version/s: 1.6.0 Resolution: Fixed Merged to master. > Flaky test TestUpdateNodeCapacityWithMultipleNodes > -- > > Key: YUNIKORN-2573 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2573 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Arthur Wang >Assignee: Arthur Wang >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > [github > pipeline|https://github.com/apache/yunikorn-core/actions/runs/8770718393/job/24067600801] > Github CI occasionally fail. > > Root cause: > [https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/context.go#L665] > > {code:java} > partition.updatePartitionResource(node.SetCapacity(resources.NewResourceFromProto(sr))) > {code} > > We calculate the delta resources by updating node capacity. > Then we update resources map in partition. > The test would failed with following order > node.SetCapacity() -> > [waitForAvailableNodeResource()|https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/tests/operation_test.go#L520] > -> > [partitionInfo.GetTotalPartitionResource()|https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/tests/operation_test.go#L525] > -> partition.updatePartitionResource() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2599) AppStateChange/AppTaskCompleted event cannot be handled in many states
Peter Bacsko created YUNIKORN-2599: -- Summary: AppStateChange/AppTaskCompleted event cannot be handled in many states Key: YUNIKORN-2599 URL: https://issues.apache.org/jira/browse/YUNIKORN-2599 Project: Apache YuniKorn Issue Type: Bug Components: shim - yarn Reporter: Peter Bacsko After YUNIKORN-2597 got merged, it became clear that we keep sending an {{AppStateChange}} event which cannot be handled by the state machine. There isn't any state in the FSM object which would actually be able to process this event. {{AppTaskCompleted}} is very similar, it is only processed in {{Resuming}} state. If someone runs the test case TestApplicationScheduling, the following errors are displayed: {noformat} [...] 2024-05-02T18:08:14.856+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppStateChange", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 2024-05-02T18:08:14.856+0200INFOcore.scheduler.application [...] 2024-05-02T18:08:14.857+0200INFOcore.scheduler.partition scheduler/partition.go:928 scheduler allocation processed {"appID": "app0001", "allocationKey": "task0002", "allocatedResource": "map[memory:1000 pods:1 vcore:1]", "placeholder": false, "targetNode": "test.host.02"} 2024-05-02T18:08:14.857+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppStateChange", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 [...] 2024-05-02T18:08:15.856+0200INFOshim.fsmcache/task_state.go:380 Task state transition {"app": "app0001", "task": "task0001", "taskAlias": "default/task0001", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-05-02T18:08:15.856+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225 [...] 2024-05-02T18:08:16.858+0200INFOshim.fsmcache/task_state.go:380 Task state transition {"app": "app0001", "task": "task0002", "taskAlias": "default/task0002", "source": "Bound", "destination": "Completed", "event": "CompleteTask"} 2024-05-02T18:08:16.858+0200ERROR shim.contextcache/context.go:1316 application event cannot be handled in the current state {"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"} github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316 github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1 /home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123 github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1