Re: Improving documentation about observability
Please file jiras for any of the issues mentioned, or one jira if it can all be handled from one. All your remarks make sense. You can even open a PR for the changes that you would like to make. Documentation in the yunikorn-site repository. The sample deployment is located in the yunikorn-k8shim repository [1]. Contributions are always welcome. We should document and or set sensible configuration values if we provide any. Wilfred [1] https://github.com/apache/yunikorn-k8shim/blob/master/deployments/scheduler/prometheus.yml#L18 On Mon, 13 May 2024 at 19:31, Wiard van Rij wrote: > > Hello everyone, > > I'm getting in touch through the mailing list since I haven't set up my Jira > account yet. > > I'd like to discuss the content found at > https://yunikorn.apache.org/docs/user_guide/prometheus/. It seems that out of > the box, it doesn't offer sensible default values. Typically, Prometheus is > deployed as a comprehensive solution, not just for a single service like > yunikorn. Thus, suggesting a configuration change that alters the global > interval rate to 3 seconds might not be the most advisable approach. Instead, > I'd argue that adjusting this interval isn't necessary, especially > considering you're recommending adding another job to the static config. > > Specifically, I propose the following adjustments: > > * Eliminate the global block from the configuration. > * If an evaluation_interval is suggested, ensure its value matches the > scrape interval. > * Set the scrape_interval to either 15 seconds or 30 seconds. I lean > towards 15 seconds as it should be more than adequate. > * > Encourage users to avoid using overrides in the scrape_configs. Instead, they > could utilize annotations on the service or implement a serviceMonitor when > using Prometheus Operator. > * > This is honestly a more easier solution that doesn't involve changing > Prometheus 'core' configuration > > Thanks in advance, > > Wiard > - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
+1 (binding) - Verified signatures and checksums - Verified LICENSE and NOTICE files - Verified release tarball structure - Built release on Mac Sonoma (ARM64): - make image with go 1.22 and 1.21 - Ran make test, all tests passed - Installed locally on Kind cluster (1.29) - REST interface checks: - verified the SHA references in the cluster detail - verified the build date is set correctly - checked REST endpoints and UI On Fri, 10 May 2024 at 18:40, Peter Bacsko wrote: > > Hello everyone, > > I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. > This is a minor release which contains only bugfixes. > > The release artefacts have been uploaded here: > https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/ > > My public key is located in the KEYS file: > https://downloads.apache.org//yunikorn/KEYS > > JIRA issues that have been resolved in this release: >https://issues.apache.org/jira/issues/?filter=12353383 > > The release solves a deadlock issue. If possible, test Yunikorn with > workloads that put Yunikorn under stress (ie. thousands/tens of thousands > of pods). > > Git tags for each component are as follows: > yunikorn-scheduler-interface: v1.5.1-1 > yunikorn-core: v1.5.1-1 > yunikorn-k8shim: v1.5.1-1 > yunikorn-web: v1.5.1-1 > yunikorn-release: v1.5.1-1 > > Once the release is voted on and approved, all repos will be tagged > 1.5.1 for consistency. > > Please review and vote. The vote will be open for at least 96 hours > and closes on Tuesday 14 May 2024, 20:00:00 CEST. > > [ ] +1 Approve > [ ] +0 No opinion > [ ] -1 Disapprove (and the reason why) > > > Thank you, > Peter - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
Sent with the wrong address (sorry). > On May 13, 2024, at 4:58 PM, Craig Condit wrote: > > +1 (binding). > > - Verified keys and hashes > - Built and deployed to on-prem cluster > - Ran some test workloads > > > Craig Condit > > >> On May 10, 2024, at 3:40 AM, Peter Bacsko wrote: >> >> Hello everyone, >> >> I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1. >> This is a minor release which contains only bugfixes. >> >> The release artefacts have been uploaded here: >> https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/ >> >> My public key is located in the KEYS file: >> https://downloads.apache.org//yunikorn/KEYS >> >> JIRA issues that have been resolved in this release: >> https://issues.apache.org/jira/issues/?filter=12353383 >> >> The release solves a deadlock issue. If possible, test Yunikorn with >> workloads that put Yunikorn under stress (ie. thousands/tens of thousands >> of pods). >> >> Git tags for each component are as follows: >> yunikorn-scheduler-interface: v1.5.1-1 >> yunikorn-core: v1.5.1-1 >> yunikorn-k8shim: v1.5.1-1 >> yunikorn-web: v1.5.1-1 >> yunikorn-release: v1.5.1-1 >> >> Once the release is voted on and approved, all repos will be tagged >> 1.5.1 for consistency. >> >> Please review and vote. The vote will be open for at least 96 hours >> and closes on Tuesday 14 May 2024, 20:00:00 CEST. >> >> [ ] +1 Approve >> [ ] +0 No opinion >> [ ] -1 Disapprove (and the reason why) >> >> >> Thank you, >> Peter > - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1
Thanks everyone for testing So far we have 3 non-binding +1s. I'll probably extend the voting for another 24 hours to get some binding feedbacks as well. Peter On Mon, May 13, 2024 at 8:15 PM 陳昱霖 wrote: > +1 (non-binding) > > - Verified signatures and checksums > - Built on Ubuntu 23.04(amd64) with go1.22.2 linux/amd64, deploy on Kind > 1.29.1 > - E2E tests passed in standard mode. > - Run simple preemption test > - Check Restful APIs > - Run smoking test 10 times in shim > > Yu-Lin Chen >
[VOTE] Release Apache YuniKorn 1.5.1 RC1
+1 (non-binding) - Verified signatures and checksums - Built on Ubuntu 23.04(amd64) with go1.22.2 linux/amd64, deploy on Kind 1.29.1 - E2E tests passed in standard mode. - Run simple preemption test - Check Restful APIs - Run smoking test 10 times in shim Yu-Lin Chen
[jira] [Resolved] (YUNIKORN-2578) Refactor SchedulerCache.GetPod() remove bool return
[ https://issues.apache.org/jira/browse/YUNIKORN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chia-Ping Tsai resolved YUNIKORN-2578. -- Fix Version/s: 1.6.0 Resolution: Fixed > Refactor SchedulerCache.GetPod() remove bool return > --- > > Key: YUNIKORN-2578 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2578 > Project: Apache YuniKorn > Issue Type: Task > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > SchedulerCache {{GetPod()}} and {{GetPodNoLock()}} retrun two values: > # *v1.Pod > # bool > The boolean value is redundant as it is false if the pod is not found and a > nil is returned for the pod. The boolean is true if the pod has a value. > Testing for a nil pod has the same result. > We do not cache a nil pod in the cache for a pod UID -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2617) Update kindest/node to v1.29.2 for Makefile
[ https://issues.apache.org/jira/browse/YUNIKORN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chia-Ping Tsai resolved YUNIKORN-2617. -- Fix Version/s: 1.6.0 Resolution: Fixed > Update kindest/node to v1.29.2 for Makefile > --- > > Key: YUNIKORN-2617 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2617 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > see > https://github.com/apache/yunikorn-k8shim/blob/d884f194b2cf60e574717f60fe648305781b56ef/Makefile#L68 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[DISCUSSION] Preemption Hardening
Hi Everyone, After preemption feature release, we have been facing some issues with its behaviour especially with extra resource types, killing unnecessary victims, etc. Hence, Preemption Harding umbrella jira https://issues.apache.org/jira/browse/YUNIKORN-2493 has been filed. I've been working on this for a while. I've written a doc https://docs.google.com/document/d/1nYtputEluP4Akf3CAu7DdGCfKW_WHJUD0rKjLVMJGj8/edit?usp=sharing to explain the problem background, approach taken to solve these problems etc. Some of the sub tasks have been completed and the whole Hardening exercise is nearing completion. There is an important PR https://github.com/apache/yunikorn-core/pull/830 containing the core changes discussed in the doc pending review. Please read the doc, review PR and share your feedback. Also go through the other sub tasks as well. In case anyone has come across any issues or weird behaviour, would like your extra attention on this whole exercise (either by running your failed cases against the new code or sharing your problem statement, revisiting the related open bugs filed by you etc) so that we can iterate further and make the feature more stable. Thank you. Mani
[jira] [Created] (YUNIKORN-2624) Enable hotlinking to YuniKorn
Denis Coric created YUNIKORN-2624: - Summary: Enable hotlinking to YuniKorn Key: YUNIKORN-2624 URL: https://issues.apache.org/jira/browse/YUNIKORN-2624 Project: Apache YuniKorn Issue Type: Sub-task Components: webapp Reporter: Denis Coric Assignee: Denis Coric Enable third-party apps to set links to YuniKorn that will populate partition and queue using the query parameters. Queue and Partition should be preselected on the page and stored to the application storage using the existing functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2623) Create unit test coverage for Clients
Peter Bacsko created YUNIKORN-2623: -- Summary: Create unit test coverage for Clients Key: YUNIKORN-2623 URL: https://issues.apache.org/jira/browse/YUNIKORN-2623 Project: Apache YuniKorn Issue Type: Test Components: shim - kubernetes Reporter: Peter Bacsko Assignee: Peter Bacsko Follow-up on YUNIKORN-2621. Create proper coverage for {{{}clients.Clients{}}}. See PR comment https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
Improving documentation about observability
Hello everyone, I'm getting in touch through the mailing list since I haven't set up my Jira account yet. I'd like to discuss the content found at https://yunikorn.apache.org/docs/user_guide/prometheus/. It seems that out of the box, it doesn't offer sensible default values. Typically, Prometheus is deployed as a comprehensive solution, not just for a single service like yunikorn. Thus, suggesting a configuration change that alters the global interval rate to 3 seconds might not be the most advisable approach. Instead, I'd argue that adjusting this interval isn't necessary, especially considering you're recommending adding another job to the static config. Specifically, I propose the following adjustments: * Eliminate the global block from the configuration. * If an evaluation_interval is suggested, ensure its value matches the scrape interval. * Set the scrape_interval to either 15 seconds or 30 seconds. I lean towards 15 seconds as it should be more than adequate. * Encourage users to avoid using overrides in the scrape_configs. Instead, they could utilize annotations on the service or implement a serviceMonitor when using Prometheus Operator. * This is honestly a more easier solution that doesn't involve changing Prometheus 'core' configuration Thanks in advance, Wiard
[jira] [Resolved] (YUNIKORN-2570) Add test cases to break the current preemption flow
[ https://issues.apache.org/jira/browse/YUNIKORN-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikandan R resolved YUNIKORN-2570. Fix Version/s: 1.6.0 Resolution: Fixed > Add test cases to break the current preemption flow > --- > > Key: YUNIKORN-2570 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2570 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > Add various test cases to break the current preemption flow. These test would > fail now. Follow up jira's > [https://issues.apache.org/jira/browse/YUNIKORN-2500] should fix the problems > in current preemption flow so that these test cases should pass. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org