Re: Improving documentation about observability

2024-05-13 Thread Wilfred Spiegelenburg
Please file jiras for any of the issues mentioned, or one jira if it
can all be handled from one.
All your remarks make sense.

You can even open a PR for the changes that you would like to make.
Documentation in the yunikorn-site repository.
The sample deployment is located in the yunikorn-k8shim repository [1].
Contributions are always welcome. We should document and or set
sensible configuration values if we provide any.

Wilfred

[1] 
https://github.com/apache/yunikorn-k8shim/blob/master/deployments/scheduler/prometheus.yml#L18

On Mon, 13 May 2024 at 19:31, Wiard van Rij  wrote:
>
> Hello everyone,
>
> I'm getting in touch through the mailing list since I haven't set up my Jira 
> account yet.
>
> I'd like to discuss the content found at 
> https://yunikorn.apache.org/docs/user_guide/prometheus/. It seems that out of 
> the box, it doesn't offer sensible default values. Typically, Prometheus is 
> deployed as a comprehensive solution, not just for a single service like 
> yunikorn. Thus, suggesting a configuration change that alters the global 
> interval rate to 3 seconds might not be the most advisable approach. Instead, 
> I'd argue that adjusting this interval isn't necessary, especially 
> considering you're recommending adding another job to the static config.
>
> Specifically, I propose the following adjustments:
>
>   *   Eliminate the global block from the configuration.
>   *   If an evaluation_interval is suggested, ensure its value matches the 
> scrape interval.
>   *   Set the scrape_interval to either 15 seconds or 30 seconds. I lean 
> towards 15 seconds as it should be more than adequate.
>   *
> Encourage users to avoid using overrides in the scrape_configs. Instead, they 
> could utilize annotations on the service or implement a serviceMonitor when 
> using Prometheus Operator.
>  *
> This is honestly a more easier solution that doesn't involve changing 
> Prometheus 'core' configuration
>
> Thanks in advance,
>
> Wiard
>

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1

2024-05-13 Thread Wilfred Spiegelenburg
+1 (binding)

- Verified signatures and checksums
- Verified LICENSE and NOTICE files
- Verified release tarball structure
- Built release on Mac Sonoma (ARM64):
  - make image with go 1.22 and 1.21
- Ran make test, all tests passed
- Installed locally on Kind cluster (1.29)

- REST interface checks:
  - verified the SHA references in the cluster detail
  - verified the build date is set correctly
- checked REST endpoints and UI

On Fri, 10 May 2024 at 18:40, Peter Bacsko  wrote:
>
> Hello everyone,
>
> I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1.
> This is a minor release which contains only bugfixes.
>
> The release artefacts have been uploaded here:
>   https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/
>
> My public key is located in the KEYS file:
>   https://downloads.apache.org//yunikorn/KEYS
>
> JIRA issues that have been resolved in this release:
>https://issues.apache.org/jira/issues/?filter=12353383
>
> The release solves a deadlock issue. If possible, test Yunikorn with
> workloads that put Yunikorn under stress (ie. thousands/tens of thousands
> of pods).
>
> Git tags for each component are as follows:
> yunikorn-scheduler-interface: v1.5.1-1
> yunikorn-core: v1.5.1-1
> yunikorn-k8shim: v1.5.1-1
> yunikorn-web: v1.5.1-1
> yunikorn-release: v1.5.1-1
>
> Once the release is voted on and approved, all repos will be tagged
> 1.5.1 for consistency.
>
> Please review and vote. The vote will be open for at least 96 hours
> and closes on Tuesday 14 May 2024, 20:00:00 CEST.
>
> [ ] +1 Approve
> [ ] +0 No opinion
> [ ] -1 Disapprove (and the reason why)
>
>
> Thank you,
> Peter

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1

2024-05-13 Thread Craig Condit
Sent with the wrong address (sorry).

> On May 13, 2024, at 4:58 PM, Craig Condit  wrote:
> 
> +1 (binding).
> 
> - Verified keys and hashes
> - Built and deployed to on-prem cluster
> - Ran some test workloads
> 
> 
> Craig Condit
> 
> 
>> On May 10, 2024, at 3:40 AM, Peter Bacsko  wrote:
>> 
>> Hello everyone,
>> 
>> I would like to call a vote for releasing Apache YuniKorn 1.5.1 RC1.
>> This is a minor release which contains only bugfixes.
>> 
>> The release artefacts have been uploaded here:
>> https://dist.apache.org/repos/dist/dev/yunikorn/1.5.1-RC1/
>> 
>> My public key is located in the KEYS file:
>> https://downloads.apache.org//yunikorn/KEYS
>> 
>> JIRA issues that have been resolved in this release:
>>  https://issues.apache.org/jira/issues/?filter=12353383
>> 
>> The release solves a deadlock issue. If possible, test Yunikorn with
>> workloads that put Yunikorn under stress (ie. thousands/tens of thousands
>> of pods).
>> 
>> Git tags for each component are as follows:
>> yunikorn-scheduler-interface: v1.5.1-1
>> yunikorn-core: v1.5.1-1
>> yunikorn-k8shim: v1.5.1-1
>> yunikorn-web: v1.5.1-1
>> yunikorn-release: v1.5.1-1
>> 
>> Once the release is voted on and approved, all repos will be tagged
>> 1.5.1 for consistency.
>> 
>> Please review and vote. The vote will be open for at least 96 hours
>> and closes on Tuesday 14 May 2024, 20:00:00 CEST.
>> 
>> [ ] +1 Approve
>> [ ] +0 No opinion
>> [ ] -1 Disapprove (and the reason why)
>> 
>> 
>> Thank you,
>> Peter
> 


-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



Re: [VOTE] Release Apache YuniKorn 1.5.1 RC1

2024-05-13 Thread Peter Bacsko
Thanks everyone for testing

So far we have 3 non-binding +1s. I'll probably extend the voting for
another 24 hours to get some binding feedbacks as well.

Peter

On Mon, May 13, 2024 at 8:15 PM 陳昱霖  wrote:

> +1 (non-binding)
>
> - Verified signatures and checksums
> - Built on Ubuntu 23.04(amd64) with go1.22.2 linux/amd64, deploy on Kind
> 1.29.1
> - E2E tests passed in standard mode.
> - Run simple preemption test
> - Check Restful APIs
> - Run smoking test 10 times in shim
>
> Yu-Lin Chen
>


[VOTE] Release Apache YuniKorn 1.5.1 RC1

2024-05-13 Thread 陳昱霖
+1 (non-binding)

- Verified signatures and checksums
- Built on Ubuntu 23.04(amd64) with go1.22.2 linux/amd64, deploy on Kind
1.29.1
- E2E tests passed in standard mode.
- Run simple preemption test
- Check Restful APIs
- Run smoking test 10 times in shim

Yu-Lin Chen


[jira] [Resolved] (YUNIKORN-2578) Refactor SchedulerCache.GetPod() remove bool return

2024-05-13 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved YUNIKORN-2578.
--
Fix Version/s: 1.6.0
   Resolution: Fixed

> Refactor SchedulerCache.GetPod() remove bool return
> ---
>
> Key: YUNIKORN-2578
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2578
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> SchedulerCache {{GetPod()}} and {{GetPodNoLock()}} retrun two values:
> # *v1.Pod
> # bool
> The boolean value is redundant as it is false if the pod is not found and a 
> nil is returned for the pod. The boolean is true if the pod has a value. 
> Testing for a nil pod has the same result.
> We do not cache a nil pod in the cache for a pod UID



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2617) Update kindest/node to v1.29.2 for Makefile

2024-05-13 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved YUNIKORN-2617.
--
Fix Version/s: 1.6.0
   Resolution: Fixed

> Update kindest/node to v1.29.2 for Makefile
> ---
>
> Key: YUNIKORN-2617
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2617
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Chia-Ping Tsai
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> see 
> https://github.com/apache/yunikorn-k8shim/blob/d884f194b2cf60e574717f60fe648305781b56ef/Makefile#L68



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[DISCUSSION] Preemption Hardening

2024-05-13 Thread Manikandan R
Hi Everyone,

After preemption feature release, we have been facing some issues with its
behaviour especially with extra resource types, killing unnecessary
victims, etc.

Hence, Preemption Harding umbrella jira
https://issues.apache.org/jira/browse/YUNIKORN-2493 has been filed. I've
been working on this for a while. I've written a doc
https://docs.google.com/document/d/1nYtputEluP4Akf3CAu7DdGCfKW_WHJUD0rKjLVMJGj8/edit?usp=sharing
to explain the problem background, approach taken to solve these problems
etc. Some of the sub tasks have been completed and the whole Hardening
exercise is nearing completion. There is an important PR
https://github.com/apache/yunikorn-core/pull/830 containing the core
changes discussed in the doc pending review.

Please read the doc, review PR and share your feedback. Also go through the
other sub tasks as well. In case anyone has come across any issues or weird
behaviour, would like your extra attention on this whole exercise (either
by running your failed cases against the new code or sharing your problem
statement, revisiting the related open bugs filed by you etc) so that we
can iterate further and make the feature more stable.

Thank you.

Mani


[jira] [Created] (YUNIKORN-2624) Enable hotlinking to YuniKorn

2024-05-13 Thread Denis Coric (Jira)
Denis Coric created YUNIKORN-2624:
-

 Summary: Enable hotlinking to YuniKorn
 Key: YUNIKORN-2624
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2624
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: webapp
Reporter: Denis Coric
Assignee: Denis Coric


Enable third-party apps to set links to YuniKorn that will populate partition 
and queue using the query parameters.

Queue and Partition should be preselected on the page and stored to the 
application storage using the existing functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2623) Create unit test coverage for Clients

2024-05-13 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2623:
--

 Summary: Create unit test coverage for Clients
 Key: YUNIKORN-2623
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2623
 Project: Apache YuniKorn
  Issue Type: Test
  Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Follow-up on YUNIKORN-2621.

Create proper coverage for {{{}clients.Clients{}}}. See PR comment 
https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



Improving documentation about observability

2024-05-13 Thread Wiard van Rij
Hello everyone,

I'm getting in touch through the mailing list since I haven't set up my Jira 
account yet.

I'd like to discuss the content found at 
https://yunikorn.apache.org/docs/user_guide/prometheus/. It seems that out of 
the box, it doesn't offer sensible default values. Typically, Prometheus is 
deployed as a comprehensive solution, not just for a single service like 
yunikorn. Thus, suggesting a configuration change that alters the global 
interval rate to 3 seconds might not be the most advisable approach. Instead, 
I'd argue that adjusting this interval isn't necessary, especially considering 
you're recommending adding another job to the static config.

Specifically, I propose the following adjustments:

  *   Eliminate the global block from the configuration.
  *   If an evaluation_interval is suggested, ensure its value matches the 
scrape interval.
  *   Set the scrape_interval to either 15 seconds or 30 seconds. I lean 
towards 15 seconds as it should be more than adequate.
  *
Encourage users to avoid using overrides in the scrape_configs. Instead, they 
could utilize annotations on the service or implement a serviceMonitor when 
using Prometheus Operator.
 *
This is honestly a more easier solution that doesn't involve changing 
Prometheus 'core' configuration

Thanks in advance,

Wiard



[jira] [Resolved] (YUNIKORN-2570) Add test cases to break the current preemption flow

2024-05-13 Thread Manikandan R (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R resolved YUNIKORN-2570.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Add test cases to break the current preemption flow
> ---
>
> Key: YUNIKORN-2570
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2570
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Add various test cases to break the current preemption flow. These test would 
> fail now. Follow up jira's 
> [https://issues.apache.org/jira/browse/YUNIKORN-2500] should fix the problems 
> in current preemption flow so that these test cases should pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org