[jira] [Resolved] (YUNIKORN-2927) Update MockScheduler test case with foreign pod resource update

2024-11-08 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2927.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Update MockScheduler test case with foreign pod resource update
> ---
>
> Key: YUNIKORN-2927
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2927
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2966) Not all tags are created for foreign allocations

2024-11-08 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2966.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Not all tags are created for foreign allocations
> 
>
> Key: YUNIKORN-2966
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2966
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> When we create an Allocation request to the core, we don't populate 
> allocation tags properly in {{{}CreateAllocationForForeignPod(){}}}. We miss 
> the call to {{CreateTagsForTask()}} which adds a number of useful tags such 
> as namespace, pod name, labels, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2962) Governance clarification: guidance requested on extending Yunikorn core functionality

2024-11-06 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2962.
--
Resolution: Information Provided

> Governance clarification: guidance requested on extending Yunikorn core 
> functionality
> -
>
> Key: YUNIKORN-2962
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2962
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: David Gantenbein
>Priority: Major
>
> Hey,
>  
> If you’re not aware, G-Research Open Source Software (GR-OSS), has been 
> working in and around the YuniKorn ecosystem for the last several months. 
> We’ve actively contributed a number of enhancements to the project, along 
> with several features related to our need for a persistent record of YuniKorn 
> events merged[0][1][2]. 
>  
> However, recently many of our contributions to Apache YuniKorn appear to have 
> been reverted unilaterally with minimal explanation. It’s unclear to us where 
> the open discussion about this removal, as required by the ASF Code of 
> Conduct, occurred for this – we’re interested and would like to participate 
> in those technical discussions in the future. We’ve tried to glean the 
> primary points of contention here:
>  
> First, our choice of name for the (formerly) YuniKorn-history-server project 
> was unwise given YuniKorn is a trademark of the Apache Software Foundation 
> (ASF). We’ve rectified this issue by renaming the project to 
> unicorn-history-server. The original name was driven by our hope that 
> unicorn-history-server may one day find its home as an official part of the 
> YuniKorn. We hope our swift resolution of this concern is evidence of our 
> commitment to the same open philosophies held by the ASF.
>  
> Next, Craig Condit stated a concern[3] that our changes were geared towards 
> permitting proprietary extensions to YuniKorn. GR-OSS is an open source 
> policy office that does not write any proprietary code as part of our 
> mission. In fact, the unicorn-history-server is Apache 2.0 licensed, just 
> like YuniKorn. Our team made extensive efforts to devise the most minimally 
> intrusive changes possible after it was suggested to us that it was better to 
> be out of tree – we’d be thrilled if the solution to this problem would be 
> the unicorn-history-server being adopted as part of yunikorn-core; in lieu of 
> this, keeping a plugin mechanism is a base-level requirement for the 
> unicorn-history-server to function. We hope that our reputation as good 
> upstream citizens and operators can help you understand that we have no 
> hidden agendas – we aren’t even a product company.
>  
> Finally, it was suggested[4] that the getApplication API endpoint was 
> inappropriate due to its exposure of internal YuniKorn data structures. We’re 
> open to feedback regarding this feature and how to improve it, but again, 
> we’re confused as to where these discussions are happening and how to get 
> involved in them. In the original proposal of this feature[5], we added tests 
> and modified the implementation at the request of project maintainers – it’s 
> upsetting to have all that work and cooperation discarded without even a 
> conversation.
>  
> Our desire is to remain a part of the YuniKorn community, but we’re very 
> confused about the governance and technical design process for the project – 
> according to [https://yunikorn.apache.org/community/people/], the maintainers 
> reverting our patches are at the same leadership level as those who approved 
> the patches originally. Can we get some clarity on the reasoning for 
> reverting the patches and documentation of the open community collaboration, 
> as required by the ASF Code of Conduct, that precipitated this removal – this 
> appeared to have been mentioned in the October 30th meeting[6], but this date 
> is after the revert of the patches, so we assume there must have been another 
> discussion.
>  
> If there’s anything further we need to do in order to spawn the technical 
> dialogue needed to address your concerns with unicorn-history-server and the 
> supporting elements, please let us know. Our desire is to implement something 
> in an open way that meets not only the needs of G-Research, but also those of 
> the overall YuniKorn community.
>  
> Thanks for your time,
> Rich Scott, Open Source Developer
> Denis Coric, Open Source Developer
> Jay Faulkner, Open Source Developer
> Dave Gantenbein, Director of Software Development
> Alexander Scammon, Head of Open Source Development
> G-Research Open Source Software
>  
> 0: https://issues.apache.org/jira/browse/YUNIKORN-2606
> 1: https://issues.apache.org/jira/browse/YUNIKORN-2652  
> 2: [http://tiny.cc/ag5tzz]
> 3: https://issues.apache.org/jira/browse/YUNIKORN-29

[jira] [Commented] (YUNIKORN-2962) Governance clarification: guidance requested on extending Yunikorn core functionality

2024-11-06 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896047#comment-17896047
 ] 

Craig Condit commented on YUNIKORN-2962:


With regard to the ASF Code of Conduct, no violations (alleged or otherwise) 
have occurred. The ASF Code of Conduct says nothing whatsoever about "requiring 
open discussion" about code reverts. That said, in each case that you have made 
this assertion, we have, in fact, operated by your own standard (i.e. "open 
discussion"). The open discussion happened on the JIRA and GitHub PRs for the 
relevant issues under discussion here.

A point of correction: there have not been "several contributions reverted" 
There has also only been *one* previously-committed PR that has been 
subsequently reverted:  YUNIKORN-2606 (Modular sidebar with remote components), 
which was reverted by YUNIKORN-2954. This JIRA was not "unilaterally reverted" 
– in fact, YUNIKORN-2954 was submitted by a PMC member (myself) along with 
relevant documentation as to why, and approved by [~pbacsko], another PMC 
member. In between, YUNIKORN-2949 (Load external Scheduler Service using Module 
Federation), which you failed to mention here, was submitted, building upon 
YUNIKORN-2606 and if committed, would have wholesale replaced huge portions of 
the YuniKorn Web UI without any user-visible indication that what was being 
displayed on screen was not, in fact, a part of Apache YuniKorn. This was 
rejected, but its submission triggered further review of YUNIKORN-2606 (the 
implications of which had not yet been fully understood) and it became apparent 
that this was not a direction we wanted to pursue. I opened YUNIKORN-2954 and 
provided my justification {*}in the JIRA description and pull request{*}. 
Nothing was done arbitrarily or in secret as you allege.

Additionally, the assertion that the "patches" (only one in fact) were reverted 
by maintainers at the "same leadership level as those who approved the 
patch[es] originally" is false – YUNIKORN-2606 was approved by a single 
committer, and the reversion was submitted by a PMC member and approved by a 
second PMC member.

The only other potential revert that is on the table is YUNIKORN-2925 (Remove 
internal objects from application REST endpoint), which was created by 
[~wilfreds], another PMC member. I agree with this revert as well; we don't 
want to have internal objects in the REST API, nor historical information. That 
is what we have built the YuniKorn event system for. [~pbacsko], who approved 
the original PR, has also commented on the reversion with additional REST API 
endpoints that should probably be cleaned up as well. This would seem to 
indicate that he too has had a change of heart regarding the wisdom of keeping 
the original PR intact. The fact is, we don't revert commits arbitrarily or 
frequently, but sometimes things slip through and we need to course-correct.

Now for some less-technical points...

Project naming of G-Research History Server: Simply changing the spelling of 
"yunikorn" to "unicorn" is not sufficient differentiation under U.S. Trademark 
Law, as this would almost certainly run afoul of the [confusingly 
similar|https://www.law.cornell.edu/wex/confusingly_similar] test. Some 
possible suggestions: Use your company name (i.e. G-Research History Server), a 
generic identifier (Scheduling History Server) or pick a distinct project name, 
i.e. Monocerus, another mythical creature related to the unicorn. Trademark law 
also allows you to reference trademarked entities in your documentation. For 
example, this would be okay: "G-Research History Server is a history service 
designed to integrate with the Apache YuniKorn scheduler for Kubernetes". This 
makes clear that your project is independent, while also providing clarity as 
to its purpose. Ultimately, it's your project – name it whatever you want 
(while respecting trademark law).

Regarding the comment that "it was suggested to us that it was better [for the 
history server] to be out of tree", this was discussed during the initial 
proposal by G-Research of the history server during the May 1, 2024 YuniKorn 
community meeting. As I recall, there were significant concerns raised about 
the validity of augmenting the REST API with history information. We were by 
this point well underway with designs and development of the (now mature) event 
system for YuniKorn, where real-time events would be emitted to an external 
consumer. A very large motivator for that design was to ensure that a future 
history server (yes, you're not the first to think of one) would be able to 
scale well and not bog down YuniKorn itself with non-scheduler overhead. The 
G-Research approach was very much at odds with that (already agreed upon) 
direction. When these concerns were raised, it was suggested that perhaps the 
G-Research history server would be better d

[jira] [Commented] (YUNIKORN-2925) Remove internal objects from application REST response

2024-11-04 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895394#comment-17895394
 ] 

Craig Condit commented on YUNIKORN-2925:


Historical information has no place in the REST API at all. That's what the 
event system is for.

> Remove internal objects from application REST response
> --
>
> Key: YUNIKORN-2925
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2925
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: release-notes
>
> The REST api for application objects exposes an internal object type 
> (resource) directly without conversion. That means any internal 
> representation change will break REST compatibility. This should never have 
> happened and needs to be reversed ASAP. All other REST calls 
> The other problem with the exposed information is that it is only accurate 
> for the COMPLETING or COMPLETED state of an application. The data is 
> incomplete at any other state as it is only updated when an allocation 
> finishes. Running allocations are not included. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2953) Placeholder release count incorrect

2024-11-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2953.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Placeholder release count incorrect
> ---
>
> Key: YUNIKORN-2953
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2953
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: core - scheduler
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Even after YUNIKORN-2926 we have not fully fixed the placeholder release 
> count issue. 
> The release of allocated placeholders is counted double on timeout first on 
> release as part of the cleanup that is triggered. Then when the allocation is 
> really removed it is tracked again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2956) Fix layout break on Queues v2 page

2024-10-30 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2956.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Fix layout break on Queues v2 page
> --
>
> Key: YUNIKORN-2956
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2956
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2928) [core] Update foreign pod resource usage

2024-10-30 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2928.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> [core] Update foreign pod resource usage
> 
>
> Key: YUNIKORN-2928
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2928
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2951) Remove unnecessary locking from RequiredNodePreemptor

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2951.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Remove unnecessary locking from RequiredNodePreemptor
> -
>
> Key: YUNIKORN-2951
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2951
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.7.0
>
>
> RequiredNodePreemptor use lock at some places before doing read and write at 
> some places. Based on the assessment, there is no reason to use locks and 
> should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-2606) Modular sidebar with remote components

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit reopened YUNIKORN-2606:


> Modular sidebar with remote components
> --
>
> Key: YUNIKORN-2606
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2606
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
> Attachments: image-2024-05-07-18-25-08-070.png
>
>
> -We need a link to the external application that will display logs and more 
> details about the application or the pod itself.- 
> -External URLs can be defined in the form of a string template that can be 
> set as an env variable.-
> -If the variable is present on build time, the Logs link will be visible on 
> the UI.-
> To minimize changes in the YuniKorn itself and enable maximal customization 
> and easy connection with the YuniKorn History Server (YHS) that is being 
> developed, the easiest solution would be to add externally loaded component 
> by using module federation. Components will be served by the YHS server 
> (changes on YHS endpoints would reflect in web components as well) and loaded 
> in YuniKorn web with Module Federation. 
> This ticket should add the required configuration for loading a custom module 
> that will be enabled through the env variables. If env is not set, YuniKorn 
> will work as usual (no changes to the default behavior)
> !image-2024-05-07-18-25-08-070.png|width=1240,height=647!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2951) Remove unnecessary locking from RequiredNodePreemptor

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2951:
---
Summary: Remove unnecessary locking from RequiredNodePreemptor  (was: 
RequiredNodePreemptor doesn't require lock)

> Remove unnecessary locking from RequiredNodePreemptor
> -
>
> Key: YUNIKORN-2951
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2951
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: newbie, pull-request-available
>
> RequiredNodePreemptor use lock at some places before doing read and write at 
> some places. Based on the assessment, there is no reason to use locks and 
> should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2609) Improve visual style of the Web UI

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2609.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

> Improve visual style of the Web UI
> --
>
> Key: YUNIKORN-2609
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2609
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: webapp
>Reporter: Denis Coric
>Assignee: JunHong Peng
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.7.0
>
>
> Implement required CSS changes to tweak the overall look and feel of the web 
> UI.
> The full design can be previewed on this link: [ 
> [DESIGN|https://xd.adobe.com/view/1d84899f-72a8-472f-b03f-de40451b0956-48d7/] 
> ]
> This should include:
>  * Fix padding/margin values
>  * Add rounding on elements to match the design (menu selection, dropdowns, 
> etc)
>  * Fix font weight on visual elements to match the design
> _Note: Queues page can be skipped as it is being redesigned in YUNIKORN-2341_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2606) Modular sidebar with remote components

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2606.
--
 Fix Version/s: (was: 1.7.0)
Target Version:   (was: 1.7.0)
Resolution: Won't Do

Removed in YUNIKORN-2954.

> Modular sidebar with remote components
> --
>
> Key: YUNIKORN-2606
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2606
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-07-18-25-08-070.png
>
>
> -We need a link to the external application that will display logs and more 
> details about the application or the pod itself.- 
> -External URLs can be defined in the form of a string template that can be 
> set as an env variable.-
> -If the variable is present on build time, the Logs link will be visible on 
> the UI.-
> To minimize changes in the YuniKorn itself and enable maximal customization 
> and easy connection with the YuniKorn History Server (YHS) that is being 
> developed, the easiest solution would be to add externally loaded component 
> by using module federation. Components will be served by the YHS server 
> (changes on YHS endpoints would reflect in web components as well) and loaded 
> in YuniKorn web with Module Federation. 
> This ticket should add the required configuration for loading a custom module 
> that will be enabled through the env variables. If env is not set, YuniKorn 
> will work as usual (no changes to the default behavior)
> !image-2024-05-07-18-25-08-070.png|width=1240,height=647!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2954) Remove so-called modular sidebar

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2954:
---
Target Version:   (was: 1.7.0)

> Remove so-called modular sidebar
> 
>
> Key: YUNIKORN-2954
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2954
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: webapp
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
>
> We need to revert YUNIKORN-2606, as it should never have been merged. It has 
> become clear that it exists only to provide invasive hooks for adding 
> proprietary and/or non-standard components to YuniKorn. It also opens up 
> YuniKorn to potential remote code execution vulnerabilities. This goes 
> against our open development philosophy. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2954) Remove so-called modular sidebar

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2954.
--

> Remove so-called modular sidebar
> 
>
> Key: YUNIKORN-2954
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2954
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: webapp
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
>
> We need to revert YUNIKORN-2606, as it should never have been merged. It has 
> become clear that it exists only to provide invasive hooks for adding 
> proprietary and/or non-standard components to YuniKorn. It also opens up 
> YuniKorn to potential remote code execution vulnerabilities. This goes 
> against our open development philosophy. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2954) Remove so-called modular sidebar

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2954:
---
Fix Version/s: (was: 1.7.0)

> Remove so-called modular sidebar
> 
>
> Key: YUNIKORN-2954
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2954
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: webapp
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
>
> We need to revert YUNIKORN-2606, as it should never have been merged. It has 
> become clear that it exists only to provide invasive hooks for adding 
> proprietary and/or non-standard components to YuniKorn. It also opens up 
> YuniKorn to potential remote code execution vulnerabilities. This goes 
> against our open development philosophy. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2954) Remove so-called modular sidebar

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2954.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Remove so-called modular sidebar
> 
>
> Key: YUNIKORN-2954
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2954
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: webapp
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> We need to revert YUNIKORN-2606, as it should never have been merged. It has 
> become clear that it exists only to provide invasive hooks for adding 
> proprietary and/or non-standard components to YuniKorn. It also opens up 
> YuniKorn to potential remote code execution vulnerabilities. This goes 
> against our open development philosophy. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2954) Remove so-called modular sidebar

2024-10-29 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2954:
--

 Summary: Remove so-called modular sidebar
 Key: YUNIKORN-2954
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2954
 Project: Apache YuniKorn
  Issue Type: Task
  Components: webapp
Reporter: Craig Condit
Assignee: Craig Condit


We need to revert YUNIKORN-2606, as it should never have been merged. It has 
become clear that it exists only to provide invasive hooks for adding 
proprietary and/or non-standard components to YuniKorn. It also opens up 
YuniKorn to potential remote code execution vulnerabilities. This goes against 
our open development philosophy. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2908) Remove associated metrics when queue is removed

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2908.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Remove associated metrics when queue is removed
> ---
>
> Key: YUNIKORN-2908
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2908
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Hengzhe Guo
>Assignee: Hengzhe Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> 1. after a queue is removed, its metrics will continue to be reported by 
> prometheus. This is fine with metrics like allocated resource because they 
> will just be 0, but it won't make sense for guaranteed and max resources, 
> giving wrong impression that there are still resource given to the queue. I 
> propose to unregister all this queue's metrics when it's removed.
> 2. If queue is not removed but guaranteed or max resource config is removed, 
> or just a resource type is removed from the config, the metrics are also not 
> cleaned up. these metrics are only updated when there's a new valid value, 
> but not 'null' value. I propose to always delete all existing guaranteed and 
> max resources metrics of the queue then add back the new values, every time 
> we apply the configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2908) Remove associated metrics when queue is removed

2024-10-29 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2908:
---
Summary: Remove associated metrics when queue is removed  (was: metrics not 
removed when a queue is removed)

> Remove associated metrics when queue is removed
> ---
>
> Key: YUNIKORN-2908
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2908
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Hengzhe Guo
>Assignee: Hengzhe Guo
>Priority: Major
>  Labels: pull-request-available
>
> 1. after a queue is removed, its metrics will continue to be reported by 
> prometheus. This is fine with metrics like allocated resource because they 
> will just be 0, but it won't make sense for guaranteed and max resources, 
> giving wrong impression that there are still resource given to the queue. I 
> propose to unregister all this queue's metrics when it's removed.
> 2. If queue is not removed but guaranteed or max resource config is removed, 
> or just a resource type is removed from the config, the metrics are also not 
> cleaned up. these metrics are only updated when there's a new valid value, 
> but not 'null' value. I propose to always delete all existing guaranteed and 
> max resources metrics of the queue then add back the new values, every time 
> we apply the configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2948) Add MockScheduler test which verifies foreign pod tracking

2024-10-28 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2948.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Add MockScheduler test which verifies foreign pod tracking
> --
>
> Key: YUNIKORN-2948
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2948
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Based on the design docs, we should create a MockScheduler-based unit test in 
> the shim that validates foreign pod tracking.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2948) Add MockScheduler test which verifies foreign pod tracking

2024-10-28 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2948:
---
Summary: Add MockScheduler test which verifies foreign pod tracking  (was: 
[shim] Write MockScheduler test which verifies foreign pod tracking)

> Add MockScheduler test which verifies foreign pod tracking
> --
>
> Key: YUNIKORN-2948
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2948
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>
> Based on the design docs, we should create a MockScheduler-based unit test in 
> the shim that validates foreign pod tracking.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2931) Create foreign pod e2e tests

2024-10-25 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2931.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Create foreign pod e2e tests 
> -
>
> Key: YUNIKORN-2931
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2931
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2949) Load external Scheduler Service using Module Federation

2024-10-25 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892835#comment-17892835
 ] 

Craig Condit commented on YUNIKORN-2949:


Also, the so-called "YuniKorn History Service" is appropriating an Apache 
trademark without permission. It cannot be called that, as it gives the 
impression it is an official Apache YuniKorn project.

> Load external Scheduler Service using Module Federation
> ---
>
> Key: YUNIKORN-2949
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2949
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Major
>  Labels: pull-request-available
>
> Add an option to load external Scheduler Service in Applications View using 
> the Module Federation.
> This will only be enabled if the correct env variables are set. If not, the 
> application must behave as is.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2949) Load external Scheduler Service using Module Federation

2024-10-25 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892834#comment-17892834
 ] 

Craig Condit commented on YUNIKORN-2949:


This has got to stop. We shouldn't be adding hooks for proprietary or 
unsupported third-party hooks into the YuniKorn codebase. If there's meant to 
be an official history service, it should be done under the Apache umbrella. 
I'm a firm -1 on this.

> Load external Scheduler Service using Module Federation
> ---
>
> Key: YUNIKORN-2949
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2949
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Major
>  Labels: pull-request-available
>
> Add an option to load external Scheduler Service in Applications View using 
> the Module Federation.
> This will only be enabled if the correct env variables are set. If not, the 
> application must behave as is.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2949) Load external Scheduler Service using Module Federation

2024-10-25 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2949.
--
Resolution: Won't Do

> Load external Scheduler Service using Module Federation
> ---
>
> Key: YUNIKORN-2949
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2949
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Major
>  Labels: pull-request-available
>
> Add an option to load external Scheduler Service in Applications View using 
> the Module Federation.
> This will only be enabled if the correct env variables are set. If not, the 
> application must behave as is.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2943) Fix typo in Prometheus monitoring guide

2024-10-25 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2943.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Fix typo in Prometheus monitoring guide
> ---
>
> Key: YUNIKORN-2943
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2943
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Tzu-Hua Lan
>Assignee: Tzu-Hua Lan
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Fix a typo in the Prometheus and Grafana monitoring 
> [documentation|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#3-use-service-mointor-to-define-monitor-yunikorn-service-target].
> Change:
> - Before: "3. Use Service Mointor to Define monitor yunikorn service target"
> - After: "3. Use Service Monitor to Define monitor yunikorn service target"
> This fixes the misspelling of "Monitor" in the section heading.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2941) Remove plugin mode from the install section of Getting Started

2024-10-25 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2941:
---
Summary: Remove plugin mode from the install section of Getting Started  
(was: Remove plugin mode from the install section of Get Started)

> Remove plugin mode from the install section of Getting Started
> --
>
> Key: YUNIKORN-2941
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2941
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: website
>Reporter: Michael Chu
>Assignee: Michael Chu
>Priority: Minor
>  Labels: newbie, pull-request-available
>
> Since plugin mode is now deprecated and will be removed in a future release, 
> it would be better to add a deprecated tag to the plugin mode in the install 
> section to prevent any confusion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2894) Update KubeRay operator documentation for YuniKorn integration

2024-10-25 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2894:
---
Summary: Update KubeRay operator documentation for YuniKorn integration  
(was: [Docs][RayCluster]update KubeRay operator documentation for YuniKorn 
integration)

> Update KubeRay operator documentation for YuniKorn integration
> --
>
> Key: YUNIKORN-2894
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2894
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Hsien-Cheng(Ryan) Huang
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: pull-request-available
>
> kubeRay is now supports gang scheduling via this PR: 
> https://github.com/ray-project/kuberay/pull/2396 
> and is available since its 1.2.0 release: 
> https://github.com/ray-project/kuberay/releases/tag/v1.2.0
> Proposed modifications:
> 1. specify version update to v1.2.2
> 2. document updates based on ray-docs: 
> https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/yunikorn.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2825) Fix the job name of pprof dashboard in Grafana dashboard

2024-10-25 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2825.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Fix the job name of pprof dashboard in Grafana dashboard
> 
>
> Key: YUNIKORN-2825
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2825
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: documentation
>Reporter: Yu-Lin Chen
>Assignee: Tzu-Hua Lan
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.7.0
>
> Attachments: image-2024-08-21-16-49-46-234.png, 
> image-2024-08-21-16-50-44-173.png
>
>
> After following the steps in "[Prometheus and 
> Grafana|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#deploy-prometheus-and-grafana-in-a-cluster]";
>  to deploy Grafana, if you import the pprof dashboard through 
> "[yunikorn-pprof.json|https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard]";,
>  there is no metrics are displayed.
> The reason is the job name in the yunikorn-pprof.json doesn't match what we 
> have from Promethus operator.
> We should fix the job name in yunikorn-pprof.json.
> ex:
> {code:bash}
> "expr": "go_memstats_heap_inuse_bytes{job=~\"yunikorn\"}",
> {code}
> should change to
> {code:bash}
> "expr": "go_memstats_heap_inuse_bytes{job=\"yunikorn-service\"}",
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2941) Remove plugin mode from the install section of Get Started

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2941:
---
Summary: Remove plugin mode from the install section of Get Started  (was: 
Add a deprecated tag to plugin mode in the install section of Get Started)

> Remove plugin mode from the install section of Get Started
> --
>
> Key: YUNIKORN-2941
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2941
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: website
>Reporter: Michael Chu
>Assignee: Michael Chu
>Priority: Minor
>  Labels: newbie, pull-request-available
>
> Since plugin mode is now deprecated and will be removed in a future release, 
> it would be better to add a deprecated tag to the plugin mode in the install 
> section to prevent any confusion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2945) Add punctuation for better clarity in Scheduler Configuration

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2945:
---
Summary: Add punctuation for better clarity in Scheduler Configuration  
(was: Add punctuations for typos and better clarity in Scheduler Configuration)

> Add punctuation for better clarity in Scheduler Configuration
> -
>
> Key: YUNIKORN-2945
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2945
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: website
>Reporter: Michael Chu
>Assignee: Michael Chu
>Priority: Minor
>  Labels: pull-request-available
>
> Add punctuations for typos and better clarity.
> Changes:
>  # "Pre emption setting" >> "Pre-emption setting"
>  # "user based" >> "user-based"
>  # "cluster-wide" >> "cluster wide"
>  # "In other words when the access control list of a queue does not allow 
> access the parent queue is checked." 
> >>
> "In other words, when the access control list of a queue does not allow 
> access, the parent queue is checked." 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2825) Fix the job name of pprof dashboard in Grafana dashboard

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2825:
---
Summary: Fix the job name of pprof dashboard in Grafana dashboard  (was: 
Fix the job name of pprof dashboard in Grafana dashboar)

> Fix the job name of pprof dashboard in Grafana dashboard
> 
>
> Key: YUNIKORN-2825
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2825
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: documentation
>Reporter: Yu-Lin Chen
>Assignee: Tzu-Hua Lan
>Priority: Major
>  Labels: newbie, pull-request-available
> Attachments: image-2024-08-21-16-49-46-234.png, 
> image-2024-08-21-16-50-44-173.png
>
>
> After following the steps in "[Prometheus and 
> Grafana|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#deploy-prometheus-and-grafana-in-a-cluster]";
>  to deploy Grafana, if you import the pprof dashboard through 
> "[yunikorn-pprof.json|https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard]";,
>  there is no metrics are displayed.
> The reason is the job name in the yunikorn-pprof.json doesn't match what we 
> have from Promethus operator.
> We should fix the job name in yunikorn-pprof.json.
> ex:
> {code:bash}
> "expr": "go_memstats_heap_inuse_bytes{job=~\"yunikorn\"}",
> {code}
> should change to
> {code:bash}
> "expr": "go_memstats_heap_inuse_bytes{job=\"yunikorn-service\"}",
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2825) Fix the job name of pprof dashboard in Grafana dashboar

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2825:
---
Summary: Fix the job name of pprof dashboard in Grafana dashboar  (was: Fix 
the job name of pprof dashboard in Grafana dashboard example)

> Fix the job name of pprof dashboard in Grafana dashboar
> ---
>
> Key: YUNIKORN-2825
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2825
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: documentation
>Reporter: Yu-Lin Chen
>Assignee: Tzu-Hua Lan
>Priority: Major
>  Labels: newbie, pull-request-available
> Attachments: image-2024-08-21-16-49-46-234.png, 
> image-2024-08-21-16-50-44-173.png
>
>
> After following the steps in "[Prometheus and 
> Grafana|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#deploy-prometheus-and-grafana-in-a-cluster]";
>  to deploy Grafana, if you import the pprof dashboard through 
> "[yunikorn-pprof.json|https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard]";,
>  there is no metrics are displayed.
> The reason is the job name in the yunikorn-pprof.json doesn't match what we 
> have from Promethus operator.
> We should fix the job name in yunikorn-pprof.json.
> ex:
> {code:bash}
> "expr": "go_memstats_heap_inuse_bytes{job=~\"yunikorn\"}",
> {code}
> should change to
> {code:bash}
> "expr": "go_memstats_heap_inuse_bytes{job=\"yunikorn-service\"}",
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2894) Update KubeRay operator documentation for YuniKorn integration

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2894.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Update KubeRay operator documentation for YuniKorn integration
> --
>
> Key: YUNIKORN-2894
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2894
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Hsien-Cheng(Ryan) Huang
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> kubeRay is now supports gang scheduling via this PR: 
> https://github.com/ray-project/kuberay/pull/2396 
> and is available since its 1.2.0 release: 
> https://github.com/ray-project/kuberay/releases/tag/v1.2.0
> Proposed modifications:
> 1. specify version update to v1.2.2
> 2. document updates based on ray-docs: 
> https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/yunikorn.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2848) Refactor preemption_queue_test.go

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2848.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Refactor preemption_queue_test.go
> -
>
> Key: YUNIKORN-2848
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2848
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Refactor TestGetPreemptableResource test based on variables and syntax 
> constructs used in TestGetRemainingGuaranteedResource.
> For example, variables defined for test resources like res1, res2 etc and 
> t.Run({})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2824) Refactor preemption_test.go

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2824.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Refactor preemption_test.go
> ---
>
> Key: YUNIKORN-2824
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2824
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Manikandan R
>Assignee: Yun Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Off late, lot of new tests has been added into preemption_test.go for several 
> use cases. There is a room of improvement to simplify the whole tests 
> especially by avoiding duplicates. For example,
> TestTryPreemption,  TestTryPreemptionOnNode and 
> TestTryPreemption_NodeWithCapacityLesserThanAsk can be merged together into 
> single one and handle each cases through t.Run({}) construct.
> Need to analyse other tests as well and see if those can be merged together 
> for simplification.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2926) Placeholder counters incorrect

2024-10-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2926.

Fix Version/s: 1.7.0
   1.6.1
   Resolution: Fixed

Merged to master and cherry-picked to branch-1.6.

> Placeholder counters incorrect
> --
>
> Key: YUNIKORN-2926
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2926
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: wangzhihui
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.1
>
> Attachments: image-2024-10-15-11-54-33-458.png, image.png
>
>
> desc:
>  The reason for the real allocation is larger than all placeholder,Then 
> release all allocations。Causing all Pods is Pending state.
> !image-2024-10-15-11-54-33-458.png!
> !image.png!
> {code:java}
> // code placeholder
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: simple-gang-job
> spec:
>   completions: 2
>   parallelism: 2
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "simple-gang-job"
>         queue: root.default
>       annotations:
>         yunikorn.apache.org/schedulingPolicyParameters: 
> "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard"
>         yunikorn.apache.org/task-group-name: task-group-example
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "task-group-example",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "50M"
>               },
>               "nodeSelector": {},
>               "tolerations": [],
>               "affinity": {},
>               "topologySpreadConstraints": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep30
>           image: "alpine:latest"
>           command: ["sleep", ""]
>           resources:
>             requests:
>               cpu: "200m"
>               memory: "50M" {code}
> solution:
> If the app is in Hard mode, it will transition to a Failing state. If it is 
> in Soft mode, it will transition to a Resuming state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2827) Remove unused columns in Nodes view

2024-10-22 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2827.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Remove unused columns in Nodes view
> ---
>
> Key: YUNIKORN-2827
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2827
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: webapp
>Reporter: Craig Condit
>Assignee: Tzu-Hua Lan
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.7.0
>
>
> The nodes view currently has two attributes columns which are unused, as well 
> as Rack Name and Host Name, which are always n/a. We should remove these from 
> teh view as they take almost half the horizontal space.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2926) Placeholder counters incorrect

2024-10-22 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2926:
---
Summary: Placeholder counters incorrect  (was: The Pod using gang 
scheduling is stuck in the Pending state)

> Placeholder counters incorrect
> --
>
> Key: YUNIKORN-2926
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2926
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: wangzhihui
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2024-10-15-11-54-33-458.png, image.png
>
>
> desc:
>  The reason for the real allocation is larger than all placeholder,Then 
> release all allocations。Causing all Pods is Pending state.
> !image-2024-10-15-11-54-33-458.png!
> !image.png!
> {code:java}
> // code placeholder
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: simple-gang-job
> spec:
>   completions: 2
>   parallelism: 2
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "simple-gang-job"
>         queue: root.default
>       annotations:
>         yunikorn.apache.org/schedulingPolicyParameters: 
> "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard"
>         yunikorn.apache.org/task-group-name: task-group-example
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "task-group-example",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "50M"
>               },
>               "nodeSelector": {},
>               "tolerations": [],
>               "affinity": {},
>               "topologySpreadConstraints": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep30
>           image: "alpine:latest"
>           command: ["sleep", ""]
>           resources:
>             requests:
>               cpu: "200m"
>               memory: "50M" {code}
> solution:
> If the app is in Hard mode, it will transition to a Failing state. If it is 
> in Soft mode, it will transition to a Resuming state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2913) Fix contrast issue in Applications view

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2913.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Fix contrast issue in Applications view
> ---
>
> Key: YUNIKORN-2913
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2913
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> There is a small bug on the applications page - when an application is 
> selected resources that are printed using the mat-chip component are not 
> visible due to low contrast.
> The issue is hard to notice as sidebar covers the most of that table, but on 
> some strange resolutions (or window sizing) it can be noticed.
> The solution is to set the font color to white on selected-row



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2913) Fix contrast issue in Applications view

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2913:
---
Priority: Minor  (was: Major)
 Summary: Fix contrast issue in Applications view  (was: FFix contrast 
issue in Applications view)

> Fix contrast issue in Applications view
> ---
>
> Key: YUNIKORN-2913
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2913
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Minor
>  Labels: pull-request-available
>
> There is a small bug on the applications page - when an application is 
> selected resources that are printed using the mat-chip component are not 
> visible due to low contrast.
> The issue is hard to notice as sidebar covers the most of that table, but on 
> some strange resolutions (or window sizing) it can be noticed.
> The solution is to set the font color to white on selected-row



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2913) FFix contrast issue in Applications view

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2913:
---
Summary: FFix contrast issue in Applications view  (was: Applications view 
CSS issue)

> FFix contrast issue in Applications view
> 
>
> Key: YUNIKORN-2913
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2913
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Denis Coric
>Assignee: Denis Coric
>Priority: Major
>  Labels: pull-request-available
>
> There is a small bug on the applications page - when an application is 
> selected resources that are printed using the mat-chip component are not 
> visible due to low contrast.
> The issue is hard to notice as sidebar covers the most of that table, but on 
> some strange resolutions (or window sizing) it can be noticed.
> The solution is to set the font color to white on selected-row



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2923) Fix invalid routerLink setting in header breadcrumb

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2923:
---
Priority: Trivial  (was: Major)

> Fix invalid routerLink setting in header breadcrumb
> ---
>
> Key: YUNIKORN-2923
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2923
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> When click header bearcrumb will navigate to '#/crumb.url' currently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2349) Allow changing the orientation of the queue graph

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2349:
---
Summary: Allow changing the orientation of the queue graph  (was: Change 
the orientation of the queue SVG)

> Allow changing the orientation of the queue graph
> -
>
> Key: YUNIKORN-2349
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2349
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Dong-Lin Hsieh
>Assignee: Dong-Lin Hsieh
>Priority: Major
>  Labels: pull-request-available
>
> Change the orientation of the queue SVG



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2349) Allow changing the orientation of the queue graph

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2349.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Allow changing the orientation of the queue graph
> -
>
> Key: YUNIKORN-2349
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2349
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: webapp
>Reporter: Dong-Lin Hsieh
>Assignee: Dong-Lin Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Change the orientation of the queue SVG



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2791) [Umbrella] Tracking non-Yunikorn allocations in the core

2024-10-17 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890545#comment-17890545
 ] 

Craig Condit commented on YUNIKORN-2791:


This is all shaping up nicely. One suggestion: I've been looking at the full 
state dump now that the feature is active, and it seems to me that we should 
populate allocationTags for foreign allocations as well. This makes things like 
pod name and other metadata available. Especially once we do the web UI changes 
to expose this information, we're going to want it.

> [Umbrella] Tracking non-Yunikorn allocations in the core
> 
>
> Key: YUNIKORN-2791
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2791
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - scheduler, scheduler-interface, shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: release-notes
>
> Currently, we don't know what non-YK pods are assigned to a particular node 
> in the core. We only track the total amount of allocations as 
> {{occupiedResources}} object inside the {{objects.Node}} type. If the 
> tracking somehow becomes out of sync with the actual cluster state, it's very 
> difficult to know what went wrong, because these allocations are not shown in 
> the state dump.
> In order to enhance supportability, we want to track all non-YK pods per node 
> on the core side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2896) [shim] Remove occupiedResource handling logic

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2896.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> [shim] Remove occupiedResource handling logic
> -
>
> Key: YUNIKORN-2896
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2896
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2924) [core] Remove occupiedResource handling logic

2024-10-17 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2924.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Resolving as this was merged to master.

> [core] Remove occupiedResource handling logic
> -
>
> Key: YUNIKORN-2924
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2924
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2929) Implement Skip Allocation Check for Unsuccessful Pods

2024-10-15 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889961#comment-17889961
 ] 

Craig Condit edited comment on YUNIKORN-2929 at 10/16/24 6:39 AM:
--

I don’t think this is wise. It’s tempting to look at this from a Spark-centric 
perspective, but this pattern could be detrimental to other application types. 
There’s also potentially wide-ranging side effects from aborting a scheduling 
cycle and restarting  too quickly. I’m not in favor of this change. 


was (Author: ccondit):
I don’t think this is wise. It’s tempting to look at this from a Spark-centric 
perspective, but this pattern could be detrimental to other application types. 
There’s also potentially wide-ranging side effects from shorting a scheduling 
cycle and restarting  too quickly. I’m not in favor of this change. 

>  Implement Skip Allocation Check for Unsuccessful Pods
> --
>
> Key: YUNIKORN-2929
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2929
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: core - scheduler
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Major
>
> Skip allocation attempts for subsequent pods in an application if previous 
> pods have failed to allocate.
> When running Spark applications, if an executor pod fails to find a suitable 
> node, it is likely that subsequent executor pods will also fail to find 
> nodes. This is particularly problematic when the application has a toleration 
> for a specific taint and there are limited nodes with that taint. The 
> scheduler spends excessive time attempting to allocate pods, ultimately 
> resulting in no pods being bound to nodes.
> To optimize scheduling, we should:
>  # Implement a check to determine if previous pods in the same application 
> were successfully allocated.
>  # Skip processing other pods in the application if previous pods failed to 
> allocate.
>  # Generalize this by:
>  ** Adding an immediate action for Spark applications.
>  ** Introducing a threshold ('n' number of pods) after which the scheduler 
> will stop trying and restart the scheduling cycle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2929) Implement Skip Allocation Check for Unsuccessful Pods

2024-10-15 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889961#comment-17889961
 ] 

Craig Condit commented on YUNIKORN-2929:


I don’t think this is wise. It’s tempting to look at this from a Spark-centric 
perspective, but this pattern could be detrimental to other application types. 
There’s also potentially wide-ranging side effects from shorting a scheduling 
cycle and restarting  too quickly. I’m not in favor of this change. 

>  Implement Skip Allocation Check for Unsuccessful Pods
> --
>
> Key: YUNIKORN-2929
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2929
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: core - scheduler
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Major
>
> Skip allocation attempts for subsequent pods in an application if previous 
> pods have failed to allocate.
> When running Spark applications, if an executor pod fails to find a suitable 
> node, it is likely that subsequent executor pods will also fail to find 
> nodes. This is particularly problematic when the application has a toleration 
> for a specific taint and there are limited nodes with that taint. The 
> scheduler spends excessive time attempting to allocate pods, ultimately 
> resulting in no pods being bound to nodes.
> To optimize scheduling, we should:
>  # Implement a check to determine if previous pods in the same application 
> were successfully allocated.
>  # Skip processing other pods in the application if previous pods failed to 
> allocate.
>  # Generalize this by:
>  ** Adding an immediate action for Spark applications.
>  ** Introducing a threshold ('n' number of pods) after which the scheduler 
> will stop trying and restart the scheduling cycle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2910) Data corruption due to insufficient shim context locking

2024-10-11 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888761#comment-17888761
 ] 

Craig Condit commented on YUNIKORN-2910:


I've started doing some log analysis of these. I haven't narrowed down root 
cause yet, but this is interesting:
{quote}2024-10-10T22:36:37.882Z        INFO    shim.cache.external     
external/scheduler_cache.go:311 Adding occupied resources to node       
\{"nodeID": "amp-dp-prod-spark-exec-yk-1-node-group-b74a85d-h77rt", 
"namespace": "spark-system", "podName": 
"spark-history-server-deployment-078u1pfr-579dbd4b6d-6p6fz", "occupied": 
"resources:{key:\"ephemeral-storage\" value:{value:5368709120}} 
resources:\{key:\"memory\" value:{value:75161927680}} resources:\{key:\"pods\" 
value:{value:1}} resources:\{key:\"vcore\" value:{value:2000}}"}
2024-10-10T22:36:37.882Z        WARN    core.scheduler.node     
objects/node.go:216     Node update triggered over allocated node       
\{"available": "map[ephemeral-storage:1386189349332 memory:-60014637056 
pods:724 vcore:14200 vpc.amazonaws.com/pod-eni:107]", "total": 
"map[ephemeral-storage:1448466375124 hugepages-1Gi:0 hugepages-2Mi:0 
memory:523482255360 pods:737 vcore:63770 vpc.amazonaws.com/pod-eni:107]", 
"occupied": "map[ephemeral-storage:5368709120 memory:75214356480 pods:6 
vcore:2100]", "allocated": "map[ephemeral-storage:56908316672 
memory:508282535936 pods:7 vcore:47470]"}
{quote}
This would seem to indicate a bug on our end, but in fact it's correct. We 
receive an occupied resource update (for a non-YuniKorn pod) which blows past 
the node limits and overallocates memory on the node by ~ 6 GB. Just prior to 
receiving that, we schedule a bunch of spark executors on that node. Because 
the spark history server is scheduled by a non-YuniKorn scheduler, we have a 
case where two schedulers both try to claim resources on the same node, and we 
over-allocate. There's no avoiding this due to the async nature of 
communication with the API server. What's interesting is that this situation 
never gets resolved. My guess is that KWOK's fake nodes don't reject placements 
with OutOfMemory or OutOfCPU like normal nodes. We don't see the allocations go 
away until the node is decommissioned later. In a real cluster, the pod 
rejections come back almost immediately.

> Data corruption due to insufficient shim context locking
> 
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.6.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.1
>
> Attachments: logs-1.6.0+2910, logs-1.6.0+2910+scale-down, 
> state-dump-1.6-context-locking-after-2.json, state-dump-after-1.5.2.json, 
> state-dump-after-1.6.0+2910.json
>
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2917) Add additional buckets for latency histograms

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2917.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Add additional buckets for latency histograms
> -
>
> Key: YUNIKORN-2917
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2917
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> The scheduling latency histograms are defined with a starting bucket of 
> 0.0001 (0.1ms) with a total of 6 buckets with a multiplier of 10. This gives 
> possible [0.0001, 0.001, 0.01, 0.1, 1, 10, +Inf] ranges. We should extend 
> this to 8 buckets so that +100 and +1000 seconds can be discerned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2916) Fix inconsistent resource bucketing in metrics

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2916.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Fix inconsistent resource bucketing in metrics
> --
>
> Key: YUNIKORN-2916
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2916
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> The metrics reporting for resource usage has histogram buckets for each 10% 
> window (0-10%, 10-20%, etc.) However, the 10-20% bucket has inconsistent 
> formatting, leading to potential confusion:
>  
> {code:java}
> var resourceUsageRangeBuckets = []string{
> "[0,10%]",
> "(10%, 20%]", // extra space here
> "(20%,30%]",
> // ...
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2916) Fix inconsistent resource bucketing in metrics

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2916:
---
Priority: Minor  (was: Major)

> Fix inconsistent resource bucketing in metrics
> --
>
> Key: YUNIKORN-2916
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2916
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> The metrics reporting for resource usage has histogram buckets for each 10% 
> window (0-10%, 10-20%, etc.) However, the 10-20% bucket has inconsistent 
> formatting, leading to potential confusion:
>  
> {code:java}
> var resourceUsageRangeBuckets = []string{
> "[0,10%]",
> "(10%, 20%]", // extra space here
> "(20%,30%]",
> // ...
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2915) Scheduling latency metric should update even if no scheduling occurs

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2915.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Scheduling latency metric should update even if no scheduling occurs
> 
>
> Key: YUNIKORN-2915
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2915
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> The scheduler metric scheduling_latency_milliseconds is currently only 
> updated if an allocation actually occurs. This is not particularly useful, as 
> latency could be quite long but in the case where no scheduling was possible 
> after traversing all queues, no reporting is done, so the visible latency is 
> 0. This makes the metric difficult to use for analysis, as in a busy cluster, 
> scheduling can be very slow, but not show up on the histogram at all.
> We should move the reporting of scheduling latency outside the check for an 
> allocation and report even scheduling runs with no result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails

2024-10-10 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888460#comment-17888460
 ] 

Craig Condit commented on YUNIKORN-2895:


Reopened to keep the discussion of the issue going. I'm not sure that the 
description or possible causes are accurate at this point.

> Don't add duplicated allocation to node when the allocation ask fails
> -
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
>
> When i try to revisit the new update allocation logic, the potential 
> duplicated allocation to node will happen if the allocation already 
> allocated.  And we try to add the allocation to the node again and don't 
> revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Reopened] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit reopened YUNIKORN-2895:


> Don't add duplicated allocation to node when the allocation ask fails
> -
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
>
> When i try to revisit the new update allocation logic, the potential 
> duplicated allocation to node will happen if the allocation already 
> allocated.  And we try to add the allocation to the node again and don't 
> revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails

2024-10-10 Thread Craig Condit (Jira)


[ https://issues.apache.org/jira/browse/YUNIKORN-2895 ]


Craig Condit deleted comment on YUNIKORN-2895:


was (Author: ccondit):
Closing this as it's not an issue that can occur in practice.

> Don't add duplicated allocation to node when the allocation ask fails
> -
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
>
> When i try to revisit the new update allocation logic, the potential 
> duplicated allocation to node will happen if the allocation already 
> allocated.  And we try to add the allocation to the node again and don't 
> revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2910) Data corruption due to insufficient shim context locking

2024-10-10 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888458#comment-17888458
 ] 

Craig Condit commented on YUNIKORN-2910:


[~shravan-achar] I think we have a second issue in play then. Do you happen to 
have a log dump from 1.6 with this patch?

> Data corruption due to insufficient shim context locking
> 
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.6.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.1
>
> Attachments: state-dump-1.6-context-locking-after-2.json, 
> state-dump-after-1.5.2.json
>
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2917) Add additional buckets for latency histograms

2024-10-10 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2917:
--

 Summary: Add additional buckets for latency histograms
 Key: YUNIKORN-2917
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2917
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Craig Condit
Assignee: Craig Condit


The scheduling latency histograms are defined with a starting bucket of 0.0001 
(0.1ms) with a total of 6 buckets with a multiplier of 10. This gives possible 
[0.0001, 0.001, 0.01, 0.1, 1, 10, +Inf] ranges. We should extend this to 8 
buckets so that +100 and +1000 seconds can be discerned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2916) Fix inconsistent resource bucketing in metrics

2024-10-10 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2916:
--

 Summary: Fix inconsistent resource bucketing in metrics
 Key: YUNIKORN-2916
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2916
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Craig Condit
Assignee: Craig Condit


The metrics reporting for resource usage has histogram buckets for each 10% 
window (0-10%, 10-20%, etc.) However, the 10-20% bucket has inconsistent 
formatting, leading to potential confusion:

 
{code:java}
var resourceUsageRangeBuckets = []string{
"[0,10%]",
"(10%, 20%]", // extra space here
"(20%,30%]",
// ...
} {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2915) Scheduling latency metric should update even if no scheduling occurs

2024-10-10 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2915:
--

 Summary: Scheduling latency metric should update even if no 
scheduling occurs
 Key: YUNIKORN-2915
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2915
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Craig Condit
Assignee: Craig Condit


The scheduler metric scheduling_latency_milliseconds is currently only updated 
if an allocation actually occurs. This is not particularly useful, as latency 
could be quite long but in the case where no scheduling was possible after 
traversing all queues, no reporting is done, so the visible latency is 0. This 
makes the metric difficult to use for analysis, as in a busy cluster, 
scheduling can be very slow, but not show up on the histogram at all.

We should move the reporting of scheduling latency outside the check for an 
allocation and report even scheduling runs with no result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2908) metrics not removed when queue or queue's guaranteed/max resource config is removed

2024-10-10 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888422#comment-17888422
 ] 

Craig Condit commented on YUNIKORN-2908:


YUNIKORN-2855 had an incomplete fix. As we've looked at it further, it is 
subtly broken – it doesn't take into account the {{state}} parameter when 
calculating the already-seen resources. We should probably rebuild that 
functionality to use the built-in {{Describe()}} method to iterate all the 
existing values and remove those where the state matches but we don't have a 
new value. This is not a simple change.

> metrics not removed when queue or queue's guaranteed/max resource config is 
> removed
> ---
>
> Key: YUNIKORN-2908
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2908
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Hengzhe Guo
>Assignee: Hengzhe Guo
>Priority: Major
>
> 1. after a queue is removed, its metrics will continue to be reported by 
> prometheus. This is fine with metrics like allocated resource because they 
> will just be 0, but it won't make sense for guaranteed and max resources, 
> giving wrong impression that there are still resource given to the queue. I 
> propose to unregister all this queue's metrics when it's removed.
> 2. If queue is not removed but guaranteed or max resource config is removed, 
> or just a resource type is removed from the config, the metrics are also not 
> cleaned up. these metrics are only updated when there's a new valid value, 
> but not 'null' value. I propose to always delete all existing guaranteed and 
> max resources metrics of the queue then add back the new values, every time 
> we apply the configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2914) Update deployment documentation for extra description for hot reload

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2914.
--
Target Version:   (was: 1.7.0)
Resolution: Not A Problem

Closing as the current documentation is correct.

> Update deployment documentation for extra description for hot reload
> 
>
> Key: YUNIKORN-2914
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2914
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yao
>Assignee: Yao
>Priority: Minor
>  Labels: pull-request-available
>
> I just saw there's a user asking about configmap hot reload question in the 
> slack channel, can refer 
> to[[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1728327140557209]]
> I also checked the hot refresh part for Yunikorn's docs, and I think not 
> everyone knows that if you mounted the configmap under the subpath, the hot 
> reload will not be triggered.
> Therefore, I want to add extra description about this part. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2895.
--
Target Version:   (was: 1.7.0)
Resolution: Not A Problem

Closing this as it's not an issue that can occur in practice.

> Don't add duplicated allocation to node when the allocation ask fails
> -
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
>
> When i try to revisit the new update allocation logic, the potential 
> duplicated allocation to node will happen if the allocation already 
> allocated.  And we try to add the allocation to the node again and don't 
> revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails

2024-10-10 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888355#comment-17888355
 ] 

Craig Condit commented on YUNIKORN-2895:


{quote}I think the issue is located in the maintenance of the 
{{sortedRequests}} on the application. That list used to be rebuild each cycle 
but now we insert/delete from the slice. During recovery I think we broke 
things. Recovery is using the same path as a node addition so this *could* 
happen on any node add or maybe even on a simple add of an new ask.
{quote}
This is all controlled from the shim by ensuring that nodes are added first, 
and then referenced pods are added afterwards. The regression fixed in 
YUNIKORN-2910 should address this. With that fix, it's not possible for these 
to be handled out-of-order.

Additionally, the {{sortedRequests}} slice is only ever updated in lockstep 
with the requests map (I have verified this in the code).
{quote}{*}First issue{*}: if the old ask _IS_ allocated we will still replace 
that allocation with the new one in the requests map. We skip adjusting the 
pending resources using the already registered ask. This is where it breaks 
down: the requests list should never contain already allocated objects. It 
means we have a reference leak, and thus a memory leak. Long after the 
allocation is removed a reference will be kept in requests that will not get 
removed until we clean up the application. The GC will thus not remove it. For 
long running applications with lots of requests this can become significant.
{quote}
This is false. We never replace the allocation; we check for an existing one 
and update as necessary. The new allocation passed in (from the SI) is only 
ever read from. It is *never* stored in requests (or allocatedRequests) unless 
that allocationKey had never been seen before.

The request list maintains all requests, whether satisfied or not, as does 
sortedRequests. This allows us to check for an allocation in only one place, 
and increases memory only by the size of a pointer for each one (after 1.6.0, 
asks and allocations are no longer distinct objects, and so we are simply 
storing either 2 or 3 references to the same object (2 in the case of 
not-yet-allocated, in requests and sortedRequests; and 3 in the case of 
allocated (requests, sortedRequests, allocatedRequests). There is no memory 
leak. When an allocation goes away (if it goes into a terminal state), it is 
removed {*}from all three places{*}. This doesn't only happen on application 
termination. We have to keep the allocations around for the lifetime of their 
associated pods anyway due to things like mutable pod resources coming in the 
near future – an allocation's size can change after it has been scheduled.
{quote}{*}Second issue{*}: Caused by the replacement also. The new object is 
not marked allocated which causes a big problem as we will try and schedule it. 
We now could have an unallocated and an allocated object with the same key one 
in requests and one in allocations. After we schedule the second one the 
allocations list will be updated and we lose the original info.
{quote}
This is also false. We never *replace* an allocation object if one already 
exists. We examine the state and may update resource requests or transition to 
allocated state based on the deltas between the existing and new objects. This 
is what allows the shim to notify us that an allocation has been satisfied 
outside YuniKorn.
{quote}{*}Third issue{*}: independent of the state we proceed to add the ask to 
the requests. The requests are stored in a map based on the allocation key. 
Which means we are always only tracking a single ask. Never any duplicates. The 
sorted requests however is a sorted slice of references to objects. There is no 
checks in the add into the sorted request slice to replace the existing entry. 
We will happily add a second one to the slice. Two objects same key they are 
both considered when scheduling which means we can easily cause issues there.
{quote}
Also false. We check the state very carefully and only add to requests (and 
sortedRequests) if we *have never seen that allocationKey.* The only place 
sortedRequests is added to is when requests didn't contain that object. The 
checks are not needed in sortedRequests because the pre-checks before updating 
in the requests map already ensure that duplicates don't happen.

 

> Don't add duplicated allocation to node when the allocation ask fails
> -
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>
> When i try to revisit the new u

[jira] [Resolved] (YUNIKORN-2910) Data corruption due to insufficient shim context locking

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2910.

Fix Version/s: 1.7.0
   1.6.1
   Resolution: Fixed

Merged to master and cherry-picked to branch-1.6.

> Data corruption due to insufficient shim context locking
> 
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.6.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.1
>
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2910) Data corruption due to insufficient shim context locking

2024-10-10 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2910:
---
Target Version: 1.7.0, 1.6.1  (was: 1.7.0)

> Data corruption due to insufficient shim context locking
> 
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.6.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Blocker
>  Labels: pull-request-available
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2908) metrics not removed when queue or queue's guaranteed/max resource config is removed

2024-10-09 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888096#comment-17888096
 ] 

Craig Condit commented on YUNIKORN-2908:


This is actually much more complex than it initially appears. We should split 
this Jira up into separate tasks for queue deletion and guaranteed / pending / 
max changing. The queue removal can simply be removal of the entire metrics 
object. The dynamic updates for the other metrics are more complex. I'd prefer 
to take that task myself.

> metrics not removed when queue or queue's guaranteed/max resource config is 
> removed
> ---
>
> Key: YUNIKORN-2908
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2908
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Hengzhe Guo
>Assignee: Hengzhe Guo
>Priority: Major
>
> 1. after a queue is removed, its metrics will continue to be reported by 
> prometheus. This is fine with metrics like allocated resource because they 
> will just be 0, but it won't make sense for guaranteed and max resources, 
> giving wrong impression that there are still resource given to the queue. I 
> propose to unregister all this queue's metrics when it's removed.
> 2. If queue is not removed but guaranteed or max resource config is removed, 
> or just a resource type is removed from the config, the metrics are also not 
> cleaned up. these metrics are only updated when there's a new valid value, 
> but not 'null' value. I propose to always delete all existing guaranteed and 
> max resources metrics of the queue then add back the new values, every time 
> we apply the configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2908) metrics not removed when queue or queue's guaranteed/max resource config is removed

2024-10-09 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888096#comment-17888096
 ] 

Craig Condit edited comment on YUNIKORN-2908 at 10/9/24 11:56 PM:
--

This is actually much more complex than it initially appears. We should split 
this Jira up into separate tasks for queue deletion and guaranteed / pending / 
max changing. The queue removal can simply be removal of the entire metrics 
object. The dynamic updates for the other metrics are more complex. I'd prefer 
to take that task myself.

 

[~hguo25] can you split this out please?


was (Author: ccondit):
This is actually much more complex than it initially appears. We should split 
this Jira up into separate tasks for queue deletion and guaranteed / pending / 
max changing. The queue removal can simply be removal of the entire metrics 
object. The dynamic updates for the other metrics are more complex. I'd prefer 
to take that task myself.

> metrics not removed when queue or queue's guaranteed/max resource config is 
> removed
> ---
>
> Key: YUNIKORN-2908
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2908
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Hengzhe Guo
>Assignee: Hengzhe Guo
>Priority: Major
>
> 1. after a queue is removed, its metrics will continue to be reported by 
> prometheus. This is fine with metrics like allocated resource because they 
> will just be 0, but it won't make sense for guaranteed and max resources, 
> giving wrong impression that there are still resource given to the queue. I 
> propose to unregister all this queue's metrics when it's removed.
> 2. If queue is not removed but guaranteed or max resource config is removed, 
> or just a resource type is removed from the config, the metrics are also not 
> cleaned up. these metrics are only updated when there's a new valid value, 
> but not 'null' value. I propose to always delete all existing guaranteed and 
> max resources metrics of the queue then add back the new values, every time 
> we apply the configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2911) Add kind-e2e makefile target

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2911:
---
Priority: Minor  (was: Major)

> Add kind-e2e makefile target
> 
>
> Key: YUNIKORN-2911
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2911
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Minor
>  Labels: pull-request-available
>
> Add a simple kind-e2e Makefile target to yunikorn-k8shim. This would spin up 
> a kind cluster (on the latest version), run the e2e tests, then tear down the 
> cluster. This would be easier (especially for new users) than running the 
> scripts/run-e2e-tests.sh script directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2754) Update grafana UI in yunikorn-metric docs

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2754.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Update grafana UI in yunikorn-metric docs 
> --
>
> Key: YUNIKORN-2754
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2754
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Chen Yu Teng
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2754) Update grafana UI in yunikorn-metric docs

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2754:
---
Summary: Update grafana UI in yunikorn-metric docs   (was: Update doc 
grafana UI of yunikorn-metric)

> Update grafana UI in yunikorn-metric docs 
> --
>
> Key: YUNIKORN-2754
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2754
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Chen Yu Teng
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2723) Wordwrap queuename in QueuesV2 page

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2723:
---
Summary: Wordwrap queuename in QueuesV2 page  (was: Wordwrap queuename in 
QueuesV2 (Beta) page)

> Wordwrap queuename in QueuesV2 page
> ---
>
> Key: YUNIKORN-2723
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2723
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Manikandan R
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Please see attached image (captured from Mac M1 chrome)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2723) Wordwrap queuename in QueuesV2 (Beta) page

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2723.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Wordwrap queuename in QueuesV2 (Beta) page
> --
>
> Key: YUNIKORN-2723
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2723
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: webapp
>Reporter: Manikandan R
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Please see attached image (captured from Mac M1 chrome)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2910) Data corruption due to insufficient shim context locking

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2910:
---
Summary: Data corruption due to insufficient shim context locking  (was: 
Multiple events may be processed by shim context)

> Data corruption due to insufficient shim context locking
> 
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.6.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Blocker
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2910) Multiple events may be processed by shim context

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2910:
---
Priority: Blocker  (was: Major)

> Multiple events may be processed by shim context
> 
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.6.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Blocker
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2910) Multiple events may be processed by shim context

2024-10-09 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2910:
--

 Summary: Multiple events may be processed by shim context
 Key: YUNIKORN-2910
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Affects Versions: 1.6.0
Reporter: Craig Condit
Assignee: Craig Condit


We need to restore the context locking that was removed in YUNIKORN-2629. 
Without it, multiple K8s events of different types may be processed in 
parallel. Specifically, pod and node events being processed simultaneously is 
not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2753) Update yunikorn-metrics grafana dashboard

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2753.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Update yunikorn-metrics grafana dashboard
> -
>
> Key: YUNIKORN-2753
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2753
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Chen Yu Teng
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2753) Update yunikorn-metrics grafana dashboard

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2753:
---
Summary: Update yunikorn-metrics grafana dashboard  (was: Update grafana 
context and json)

> Update yunikorn-metrics grafana dashboard
> -
>
> Key: YUNIKORN-2753
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2753
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Chen Yu Teng
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2844) Inject event recorder externally

2024-10-09 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2844.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Inject event recorder externally
> 
>
> Key: YUNIKORN-2844
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2844
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> The current implementation creates an event recorder like that:
> {noformat}
> func GetRecorder() events.EventRecorder {
>   lock.Lock()
>   defer lock.Unlock()
>   once.Do(func() {
>   // note, the initiation of the event recorder requires on a 
> workable Kubernetes client,
>   // in test mode we should skip this and just use a fake 
> recorder instead.
>   configs := conf.GetSchedulerConf()
>   if !configs.IsTestMode() {
>   k8sClient := client.NewKubeClient(configs.KubeConfig)
>   eventBroadcaster := 
> events.NewBroadcaster(&events.EventSinkImpl{
>   Interface: k8sClient.GetClientSet().EventsV1()})
>   eventBroadcaster.StartRecordingToSink(make(<-chan 
> struct{}))
>   eventRecorder = 
> eventBroadcaster.NewRecorder(scheme.Scheme, constants.SchedulerName)
>   }
>   })
>   return eventRecorder
> }
> {noformat}
> The problem with this approach is that we need to indicate "test mode" in the 
> config, which just complicates things. 
> We can simplify this code if the recorder is set during Yunikorn 
> initialization in eg. {{NewShimScheduler()}}. The plugin code already does 
> this in {{NewSchedulerPlugin()}} and calls 
> {{events.SetRecorder(handle.EventRecorder())}}.
> We should also get rid of the default fake recorder. This uses a buffered 
> channel with the size of 1024. This isn't a problem now, but if a new test 
> somehow ends up generating a lot of events, message sending will block. It 
> might not be obvious to someone to understand why running a new or existing 
> unit test just starts to block.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails

2024-10-09 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887898#comment-17887898
 ] 

Craig Condit commented on YUNIKORN-2895:


I suspect all of these issues trace back to this PR: 
[https://github.com/apache/yunikorn-k8shim/pull/859]

it would be very helpful if anyone who is currently seeing this could rebuild 
1.6.0 with the PR reverted and report back. 

> Don't add duplicated allocation to node when the allocation ask fails
> -
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>
> When i try to revisit the new update allocation logic, the potential 
> duplicated allocation to node will happen if the allocation already 
> allocated.  And we try to add the allocation to the node again and don't 
> revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2905) Update deployment documentation for make image

2024-10-08 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2905:
---
Summary: Update deployment documentation for make image  (was: Update 
deployment documentation for outdated make image command description)

> Update deployment documentation for make image
> --
>
> Key: YUNIKORN-2905
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2905
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yao
>Assignee: Yao
>Priority: Minor
>  Labels: pull-request-available
>
> When I tried to do the same thing in the deployment section, I found that the 
> description for the make image command was outdated.
> For example, the IMAGE_TAG variable is no longer exist in the makefile.
> Therefore, I've referred to yunikorn-k8shim's readme.md and its makefile, and 
> rewritten the documentation to make it clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2905) Update deployment documentation for make image

2024-10-08 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2905.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Update deployment documentation for make image
> --
>
> Key: YUNIKORN-2905
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2905
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yao
>Assignee: Yao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> When I tried to do the same thing in the deployment section, I found that the 
> description for the make image command was outdated.
> For example, the IMAGE_TAG variable is no longer exist in the makefile.
> Therefore, I've referred to yunikorn-k8shim's readme.md and its makefile, and 
> rewritten the documentation to make it clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2904) Add Helm download and usage links

2024-10-07 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2904.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Add Helm download and usage links
> -
>
> Key: YUNIKORN-2904
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2904
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yao
>Assignee: Yao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> I am a newbie who just started using Unicorn. Although I have some experience 
> with K8s, as far as I know, not everyone is using Helm charts to manage their 
> applications. 
> Therefore, I just added the link and some description for Helm & Helm's 
> download for those who don't know what Helm is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2904) Add Helm download and usage links

2024-10-07 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2904:
---
Summary: Add Helm download and usage links  (was: Update get_started 
documentation for Helm and Helm's download link)

> Add Helm download and usage links
> -
>
> Key: YUNIKORN-2904
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2904
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yao
>Assignee: Yao
>Priority: Minor
>  Labels: pull-request-available
>
> I am a newbie who just started using Unicorn. Although I have some experience 
> with K8s, as far as I know, not everyone is using Helm charts to manage their 
> applications. 
> Therefore, I just added the link and some description for Helm & Helm's 
> download for those who don't know what Helm is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2899) Update node version to 20.17 and update packages

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2899.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Update node version to 20.17 and update packages
> 
>
> Key: YUNIKORN-2899
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2899
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Hsien-Cheng(Ryan) Huang
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Minor
> Fix For: 1.7.0
>
>
> 1. Browserslist: caniuse-lite is outdated. Please run:
>   npx update-browserslist-db@latest
>   Why you should do it regularly: 
> https://github.com/browserslist/update-db#readme
> 2. update docusaurous to v3.5.2
> 3. resolve security warnings
> 4. update node version to latest LTS version
> 5. add a `pnpm run serve` for serving local built.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2899) Update node version to 20.17 and update packages

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2899:
---
Summary: Update node version to 20.17 and update packages  (was: chore: 
update node version to LTS, update package and solve warnings)

> Update node version to 20.17 and update packages
> 
>
> Key: YUNIKORN-2899
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2899
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Hsien-Cheng(Ryan) Huang
>Assignee: Hsien-Cheng(Ryan) Huang
>Priority: Minor
>
> 1. Browserslist: caniuse-lite is outdated. Please run:
>   npx update-browserslist-db@latest
>   Why you should do it regularly: 
> https://github.com/browserslist/update-db#readme
> 2. update docusaurous to v3.5.2
> 3. resolve security warnings
> 4. update node version to latest LTS version
> 5. add a `pnpm run serve` for serving local built.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2885) Fix security vulnerabilities in dependencies

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2885.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Fix security vulnerabilities in dependencies
> 
>
> Key: YUNIKORN-2885
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2885
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: webapp
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> {{pnpm audit}} report: 
> [audit-report.md|https://github.com/user-attachments/files/17089735/audit-report.md]
> 26 vulnerabilities found
> Severity: 12 moderate | 14 high
> After Upgrade Angular v18 (#YUNIKORN-2861) Audit Report: 
> [audit-report.md|https://github.com/user-attachments/files/17164041/audit-report.md]
> 8 vulnerabilities found
> Severity: 3 moderate | 5 high



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2792) Create design doc

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2792.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Create design doc
> -
>
> Key: YUNIKORN-2792
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2792
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
> Attachments: YUNIKORN-2791.pdf
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2892) Log correct termination type when releasing task in shim

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2892:
---
Summary:  Log correct termination type when releasing task in shim  (was: 
Log correct log for terminate type when releasing task from shim side)

>  Log correct termination type when releasing task in shim
> -
>
> Key: YUNIKORN-2892
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2892
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
>
> Now we will log empty terminate type when releasing task from shim side, we 
> should improve this to consistent with the real terminate type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2892) Log correct termination type when releasing task in shim

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2892.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

>  Log correct termination type when releasing task in shim
> -
>
> Key: YUNIKORN-2892
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2892
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
> Fix For: 1.7.0
>
>
> Now we will log empty terminate type when releasing task from shim side, we 
> should improve this to consistent with the real terminate type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2893) Apply "cancel-in-progress" feature for each PR in Github Action

2024-10-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2893.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged all PRs to master branches.

> Apply "cancel-in-progress" feature for each PR in Github Action
> ---
>
> Key: YUNIKORN-2893
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2893
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: build
>Reporter: Yu-Lin Chen
>Assignee: Tzu-Hua Lan
>Priority: Major
> Fix For: 1.7.0
>
>
> Currently, when a newer commit is pushed to the same PR, the previous build 
> is not canceled. To save cost, we should apply the "cancel-in-progress" to 
> automatically cancel previous build.
> REF:
> [https://docs.github.com/en/enterprise-cloud@latest/actions/writing-workflows/choosing-what-your-workflow-does/control-the-concurrency-of-workflows-and-jobs#using-concurrency-in-different-scenarios]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2884) Task fail with post allocated but the pod will keep pending

2024-09-27 Thread Craig Condit (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885474#comment-17885474
 ] 

Craig Condit commented on YUNIKORN-2884:


I'm not sure terminating the task is how we should handle this, as that makes 
debugging (via events) difficult. We should be looking into how to recover from 
this and re-schedule the pod.

 

> Task fail with post allocated but the pod will keep pending
> ---
>
> Key: YUNIKORN-2884
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2884
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>
> We will fail task post allocated, but we don't update the pod to terminal 
> state.
> For example we bind pod volume failed post allocated, the pod will not go to 
> terminal state, it will fail:
> Pod event:
> {code:java}
> Events:
>   Type     Reason                 Age   From      Message
>        --                         ---
>   Normal   Scheduling             30s   yunikorn  dev-nnjxy/pod-btv0y is 
> queued and waiting for allocation
>   Normal   Scheduled              30s   yunikorn  Successfully assigned 
> dev-nnjxy/pod-btv0y to node yktest-worker
>   Warning  PodVolumesBindFailure  20s   yunikorn  bind volumes to pod failed, 
> name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded
>   Normal   TaskFailed             20s   yunikorn  Task dev-nnjxy/pod-btv0y is 
> failed{code}
> Pod pending not going to terminal state
> {code:java}
> 2024-09-20T11:22:27.601Z    INFO    shim.fsm    cache/task_state.go:381    
> Task state transition    {"app": "yunikorn-dev-03c96-autogen", "task": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", 
> "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
> 2024-09-20T11:22:37.606Z    DEBUG    shim.cache.task    cache/task.go:499    
> prepare to send release request    {"applicationID": 
> "yunikorn-dev-03c96-autogen", "taskID": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", 
> "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated", 
> "terminationType": ""}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler    
> scheduler/scheduler.go:117    enqueued event    {"eventType": 
> "*rmevent.RMUpdateAllocationEvent", "event": 
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task
>  
> completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-09-20T11:22:37.606Z    ERROR    shim.cache.task    cache/task.go:475    
> task failed    {"appID": "yunikorn-dev-03c96-autogen", "taskID": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod 
> failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline 
> exceeded"}
> 2024-09-20T11:22:37.606Z    INFO    shim.fsm    cache/task_state.go:381    
> Task state transition    {"app": "yunikorn-dev-03c96-autogen", "task": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", 
> "source": "Allocated", "destination": "Failed", "event": "TaskFail"}
> 2024-09-20T11:22:37.606Z    INFO    core.scheduler.partition    
> scheduler/partition.go:1359    removing allocation from application    
> {"appID": "yunikorn-dev-03c96-autogen", "allocationKey": 
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/manager.go:132 
>    Decreasing resource usage    {"user": "kubernetes-admin", "queue path": 
> "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource": 
> "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/manager.go:152 
>    Decreasing resource usage for user    {"user": "kubernetes-admin", "queue 
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", 
> "group": "", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:132    Decreasing resource usage    {"queue path": 
> "root", "hierarchy": ["root", "dev-03c96"], "application": 
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z    DEBUG    core.scheduler.ugm    
> ugm/queue_tracker.go:132    Decreasing resource usage    {"queue path": 
> "root.dev-03c96", "hierarchy": ["dev-03c96"], "application": 
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11

[jira] [Updated] (YUNIKORN-2832) [core] Add non-YuniKorn allocation tracking logic

2024-09-27 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2832:
---
Summary: [core] Add non-YuniKorn allocation tracking logic  (was: [core] 
Add non-Yunikorn allocation tracking logic)

> [core] Add non-YuniKorn allocation tracking logic
> -
>
> Key: YUNIKORN-2832
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2832
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2354) Add detailed queue information to new queue view

2024-09-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2354:
---
Summary: Add detailed queue information to new queue view   (was: Visualize 
the current queue that YuniKorn is using)

> Add detailed queue information to new queue view 
> -
>
> Key: YUNIKORN-2354
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2354
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Dong-Lin Hsieh
>Assignee: Dong-Lin Hsieh
>Priority: Major
>  Labels: pull-request-available
>
> # another tab page
>  # additional queue info (running applicaitons)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2354) Add detailed queue information to new queue view

2024-09-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2354.

Fix Version/s: 1.7.0
   Resolution: Fixed

Merged to master.

> Add detailed queue information to new queue view 
> -
>
> Key: YUNIKORN-2354
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2354
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Dong-Lin Hsieh
>Assignee: Dong-Lin Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> # another tab page
>  # additional queue info (running applicaitons)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2254) Display MaxRunningApps and RunningApps on Queue View

2024-09-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2254.

 Fix Version/s: 1.7.0
Target Version: 1.7.0
Resolution: Fixed

Merged to master.

> Display MaxRunningApps and RunningApps on Queue View
> 
>
> Key: YUNIKORN-2254
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2254
> Project: Apache YuniKorn
>  Issue Type: Wish
>  Components: webapp
>Reporter: Chia-Ping Tsai
>Assignee: Yun Sun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> queue view has offered the resource-related information, but it has a lack of 
> application-related information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2254) Display MaxRunningApps and RunningApps on Queue View

2024-09-24 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit updated YUNIKORN-2254:
---
Summary: Display MaxRunningApps and RunningApps on Queue View  (was: queue 
view should display MaxRunningApps and RunningApps)

> Display MaxRunningApps and RunningApps on Queue View
> 
>
> Key: YUNIKORN-2254
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2254
> Project: Apache YuniKorn
>  Issue Type: Wish
>  Components: webapp
>Reporter: Chia-Ping Tsai
>Assignee: Yun Sun
>Priority: Minor
>
> queue view has offered the resource-related information, but it has a lack of 
> application-related information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



  1   2   3   4   5   6   7   8   9   10   >