[jira] [Resolved] (YUNIKORN-2927) Update MockScheduler test case with foreign pod resource update
[ https://issues.apache.org/jira/browse/YUNIKORN-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2927. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Update MockScheduler test case with foreign pod resource update > --- > > Key: YUNIKORN-2927 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2927 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2966) Not all tags are created for foreign allocations
[ https://issues.apache.org/jira/browse/YUNIKORN-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2966. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Not all tags are created for foreign allocations > > > Key: YUNIKORN-2966 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2966 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > When we create an Allocation request to the core, we don't populate > allocation tags properly in {{{}CreateAllocationForForeignPod(){}}}. We miss > the call to {{CreateTagsForTask()}} which adds a number of useful tags such > as namespace, pod name, labels, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2962) Governance clarification: guidance requested on extending Yunikorn core functionality
[ https://issues.apache.org/jira/browse/YUNIKORN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2962. -- Resolution: Information Provided > Governance clarification: guidance requested on extending Yunikorn core > functionality > - > > Key: YUNIKORN-2962 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2962 > Project: Apache YuniKorn > Issue Type: Task >Reporter: David Gantenbein >Priority: Major > > Hey, > > If you’re not aware, G-Research Open Source Software (GR-OSS), has been > working in and around the YuniKorn ecosystem for the last several months. > We’ve actively contributed a number of enhancements to the project, along > with several features related to our need for a persistent record of YuniKorn > events merged[0][1][2]. > > However, recently many of our contributions to Apache YuniKorn appear to have > been reverted unilaterally with minimal explanation. It’s unclear to us where > the open discussion about this removal, as required by the ASF Code of > Conduct, occurred for this – we’re interested and would like to participate > in those technical discussions in the future. We’ve tried to glean the > primary points of contention here: > > First, our choice of name for the (formerly) YuniKorn-history-server project > was unwise given YuniKorn is a trademark of the Apache Software Foundation > (ASF). We’ve rectified this issue by renaming the project to > unicorn-history-server. The original name was driven by our hope that > unicorn-history-server may one day find its home as an official part of the > YuniKorn. We hope our swift resolution of this concern is evidence of our > commitment to the same open philosophies held by the ASF. > > Next, Craig Condit stated a concern[3] that our changes were geared towards > permitting proprietary extensions to YuniKorn. GR-OSS is an open source > policy office that does not write any proprietary code as part of our > mission. In fact, the unicorn-history-server is Apache 2.0 licensed, just > like YuniKorn. Our team made extensive efforts to devise the most minimally > intrusive changes possible after it was suggested to us that it was better to > be out of tree – we’d be thrilled if the solution to this problem would be > the unicorn-history-server being adopted as part of yunikorn-core; in lieu of > this, keeping a plugin mechanism is a base-level requirement for the > unicorn-history-server to function. We hope that our reputation as good > upstream citizens and operators can help you understand that we have no > hidden agendas – we aren’t even a product company. > > Finally, it was suggested[4] that the getApplication API endpoint was > inappropriate due to its exposure of internal YuniKorn data structures. We’re > open to feedback regarding this feature and how to improve it, but again, > we’re confused as to where these discussions are happening and how to get > involved in them. In the original proposal of this feature[5], we added tests > and modified the implementation at the request of project maintainers – it’s > upsetting to have all that work and cooperation discarded without even a > conversation. > > Our desire is to remain a part of the YuniKorn community, but we’re very > confused about the governance and technical design process for the project – > according to [https://yunikorn.apache.org/community/people/], the maintainers > reverting our patches are at the same leadership level as those who approved > the patches originally. Can we get some clarity on the reasoning for > reverting the patches and documentation of the open community collaboration, > as required by the ASF Code of Conduct, that precipitated this removal – this > appeared to have been mentioned in the October 30th meeting[6], but this date > is after the revert of the patches, so we assume there must have been another > discussion. > > If there’s anything further we need to do in order to spawn the technical > dialogue needed to address your concerns with unicorn-history-server and the > supporting elements, please let us know. Our desire is to implement something > in an open way that meets not only the needs of G-Research, but also those of > the overall YuniKorn community. > > Thanks for your time, > Rich Scott, Open Source Developer > Denis Coric, Open Source Developer > Jay Faulkner, Open Source Developer > Dave Gantenbein, Director of Software Development > Alexander Scammon, Head of Open Source Development > G-Research Open Source Software > > 0: https://issues.apache.org/jira/browse/YUNIKORN-2606 > 1: https://issues.apache.org/jira/browse/YUNIKORN-2652 > 2: [http://tiny.cc/ag5tzz] > 3: https://issues.apache.org/jira/browse/YUNIKORN-29
[jira] [Commented] (YUNIKORN-2962) Governance clarification: guidance requested on extending Yunikorn core functionality
[ https://issues.apache.org/jira/browse/YUNIKORN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896047#comment-17896047 ] Craig Condit commented on YUNIKORN-2962: With regard to the ASF Code of Conduct, no violations (alleged or otherwise) have occurred. The ASF Code of Conduct says nothing whatsoever about "requiring open discussion" about code reverts. That said, in each case that you have made this assertion, we have, in fact, operated by your own standard (i.e. "open discussion"). The open discussion happened on the JIRA and GitHub PRs for the relevant issues under discussion here. A point of correction: there have not been "several contributions reverted" There has also only been *one* previously-committed PR that has been subsequently reverted: YUNIKORN-2606 (Modular sidebar with remote components), which was reverted by YUNIKORN-2954. This JIRA was not "unilaterally reverted" – in fact, YUNIKORN-2954 was submitted by a PMC member (myself) along with relevant documentation as to why, and approved by [~pbacsko], another PMC member. In between, YUNIKORN-2949 (Load external Scheduler Service using Module Federation), which you failed to mention here, was submitted, building upon YUNIKORN-2606 and if committed, would have wholesale replaced huge portions of the YuniKorn Web UI without any user-visible indication that what was being displayed on screen was not, in fact, a part of Apache YuniKorn. This was rejected, but its submission triggered further review of YUNIKORN-2606 (the implications of which had not yet been fully understood) and it became apparent that this was not a direction we wanted to pursue. I opened YUNIKORN-2954 and provided my justification {*}in the JIRA description and pull request{*}. Nothing was done arbitrarily or in secret as you allege. Additionally, the assertion that the "patches" (only one in fact) were reverted by maintainers at the "same leadership level as those who approved the patch[es] originally" is false – YUNIKORN-2606 was approved by a single committer, and the reversion was submitted by a PMC member and approved by a second PMC member. The only other potential revert that is on the table is YUNIKORN-2925 (Remove internal objects from application REST endpoint), which was created by [~wilfreds], another PMC member. I agree with this revert as well; we don't want to have internal objects in the REST API, nor historical information. That is what we have built the YuniKorn event system for. [~pbacsko], who approved the original PR, has also commented on the reversion with additional REST API endpoints that should probably be cleaned up as well. This would seem to indicate that he too has had a change of heart regarding the wisdom of keeping the original PR intact. The fact is, we don't revert commits arbitrarily or frequently, but sometimes things slip through and we need to course-correct. Now for some less-technical points... Project naming of G-Research History Server: Simply changing the spelling of "yunikorn" to "unicorn" is not sufficient differentiation under U.S. Trademark Law, as this would almost certainly run afoul of the [confusingly similar|https://www.law.cornell.edu/wex/confusingly_similar] test. Some possible suggestions: Use your company name (i.e. G-Research History Server), a generic identifier (Scheduling History Server) or pick a distinct project name, i.e. Monocerus, another mythical creature related to the unicorn. Trademark law also allows you to reference trademarked entities in your documentation. For example, this would be okay: "G-Research History Server is a history service designed to integrate with the Apache YuniKorn scheduler for Kubernetes". This makes clear that your project is independent, while also providing clarity as to its purpose. Ultimately, it's your project – name it whatever you want (while respecting trademark law). Regarding the comment that "it was suggested to us that it was better [for the history server] to be out of tree", this was discussed during the initial proposal by G-Research of the history server during the May 1, 2024 YuniKorn community meeting. As I recall, there were significant concerns raised about the validity of augmenting the REST API with history information. We were by this point well underway with designs and development of the (now mature) event system for YuniKorn, where real-time events would be emitted to an external consumer. A very large motivator for that design was to ensure that a future history server (yes, you're not the first to think of one) would be able to scale well and not bog down YuniKorn itself with non-scheduler overhead. The G-Research approach was very much at odds with that (already agreed upon) direction. When these concerns were raised, it was suggested that perhaps the G-Research history server would be better d
[jira] [Commented] (YUNIKORN-2925) Remove internal objects from application REST response
[ https://issues.apache.org/jira/browse/YUNIKORN-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895394#comment-17895394 ] Craig Condit commented on YUNIKORN-2925: Historical information has no place in the REST API at all. That's what the event system is for. > Remove internal objects from application REST response > -- > > Key: YUNIKORN-2925 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2925 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - common >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: release-notes > > The REST api for application objects exposes an internal object type > (resource) directly without conversion. That means any internal > representation change will break REST compatibility. This should never have > happened and needs to be reversed ASAP. All other REST calls > The other problem with the exposed information is that it is only accurate > for the COMPLETING or COMPLETED state of an application. The data is > incomplete at any other state as it is only updated when an allocation > finishes. Running allocations are not included. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2953) Placeholder release count incorrect
[ https://issues.apache.org/jira/browse/YUNIKORN-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2953. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Placeholder release count incorrect > --- > > Key: YUNIKORN-2953 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2953 > Project: Apache YuniKorn > Issue Type: Task > Components: core - scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Even after YUNIKORN-2926 we have not fully fixed the placeholder release > count issue. > The release of allocated placeholders is counted double on timeout first on > release as part of the cleanup that is triggered. Then when the allocation is > really removed it is tracked again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2956) Fix layout break on Queues v2 page
[ https://issues.apache.org/jira/browse/YUNIKORN-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2956. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Fix layout break on Queues v2 page > -- > > Key: YUNIKORN-2956 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2956 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2928) [core] Update foreign pod resource usage
[ https://issues.apache.org/jira/browse/YUNIKORN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2928. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > [core] Update foreign pod resource usage > > > Key: YUNIKORN-2928 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2928 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2951) Remove unnecessary locking from RequiredNodePreemptor
[ https://issues.apache.org/jira/browse/YUNIKORN-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2951. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Remove unnecessary locking from RequiredNodePreemptor > - > > Key: YUNIKORN-2951 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2951 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Manikandan R >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.7.0 > > > RequiredNodePreemptor use lock at some places before doing read and write at > some places. Based on the assessment, there is no reason to use locks and > should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Reopened] (YUNIKORN-2606) Modular sidebar with remote components
[ https://issues.apache.org/jira/browse/YUNIKORN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit reopened YUNIKORN-2606: > Modular sidebar with remote components > -- > > Key: YUNIKORN-2606 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2606 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Attachments: image-2024-05-07-18-25-08-070.png > > > -We need a link to the external application that will display logs and more > details about the application or the pod itself.- > -External URLs can be defined in the form of a string template that can be > set as an env variable.- > -If the variable is present on build time, the Logs link will be visible on > the UI.- > To minimize changes in the YuniKorn itself and enable maximal customization > and easy connection with the YuniKorn History Server (YHS) that is being > developed, the easiest solution would be to add externally loaded component > by using module federation. Components will be served by the YHS server > (changes on YHS endpoints would reflect in web components as well) and loaded > in YuniKorn web with Module Federation. > This ticket should add the required configuration for loading a custom module > that will be enabled through the env variables. If env is not set, YuniKorn > will work as usual (no changes to the default behavior) > !image-2024-05-07-18-25-08-070.png|width=1240,height=647! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2951) Remove unnecessary locking from RequiredNodePreemptor
[ https://issues.apache.org/jira/browse/YUNIKORN-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2951: --- Summary: Remove unnecessary locking from RequiredNodePreemptor (was: RequiredNodePreemptor doesn't require lock) > Remove unnecessary locking from RequiredNodePreemptor > - > > Key: YUNIKORN-2951 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2951 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Manikandan R >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: newbie, pull-request-available > > RequiredNodePreemptor use lock at some places before doing read and write at > some places. Based on the assessment, there is no reason to use locks and > should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2609) Improve visual style of the Web UI
[ https://issues.apache.org/jira/browse/YUNIKORN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2609. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed > Improve visual style of the Web UI > -- > > Key: YUNIKORN-2609 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2609 > Project: Apache YuniKorn > Issue Type: Improvement > Components: webapp >Reporter: Denis Coric >Assignee: JunHong Peng >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.7.0 > > > Implement required CSS changes to tweak the overall look and feel of the web > UI. > The full design can be previewed on this link: [ > [DESIGN|https://xd.adobe.com/view/1d84899f-72a8-472f-b03f-de40451b0956-48d7/] > ] > This should include: > * Fix padding/margin values > * Add rounding on elements to match the design (menu selection, dropdowns, > etc) > * Fix font weight on visual elements to match the design > _Note: Queues page can be skipped as it is being redesigned in YUNIKORN-2341_ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2606) Modular sidebar with remote components
[ https://issues.apache.org/jira/browse/YUNIKORN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2606. -- Fix Version/s: (was: 1.7.0) Target Version: (was: 1.7.0) Resolution: Won't Do Removed in YUNIKORN-2954. > Modular sidebar with remote components > -- > > Key: YUNIKORN-2606 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2606 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Major > Labels: pull-request-available > Attachments: image-2024-05-07-18-25-08-070.png > > > -We need a link to the external application that will display logs and more > details about the application or the pod itself.- > -External URLs can be defined in the form of a string template that can be > set as an env variable.- > -If the variable is present on build time, the Logs link will be visible on > the UI.- > To minimize changes in the YuniKorn itself and enable maximal customization > and easy connection with the YuniKorn History Server (YHS) that is being > developed, the easiest solution would be to add externally loaded component > by using module federation. Components will be served by the YHS server > (changes on YHS endpoints would reflect in web components as well) and loaded > in YuniKorn web with Module Federation. > This ticket should add the required configuration for loading a custom module > that will be enabled through the env variables. If env is not set, YuniKorn > will work as usual (no changes to the default behavior) > !image-2024-05-07-18-25-08-070.png|width=1240,height=647! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2954) Remove so-called modular sidebar
[ https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2954: --- Target Version: (was: 1.7.0) > Remove so-called modular sidebar > > > Key: YUNIKORN-2954 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2954 > Project: Apache YuniKorn > Issue Type: Task > Components: webapp >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > > We need to revert YUNIKORN-2606, as it should never have been merged. It has > become clear that it exists only to provide invasive hooks for adding > proprietary and/or non-standard components to YuniKorn. It also opens up > YuniKorn to potential remote code execution vulnerabilities. This goes > against our open development philosophy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2954) Remove so-called modular sidebar
[ https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2954. -- > Remove so-called modular sidebar > > > Key: YUNIKORN-2954 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2954 > Project: Apache YuniKorn > Issue Type: Task > Components: webapp >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > > We need to revert YUNIKORN-2606, as it should never have been merged. It has > become clear that it exists only to provide invasive hooks for adding > proprietary and/or non-standard components to YuniKorn. It also opens up > YuniKorn to potential remote code execution vulnerabilities. This goes > against our open development philosophy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2954) Remove so-called modular sidebar
[ https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2954: --- Fix Version/s: (was: 1.7.0) > Remove so-called modular sidebar > > > Key: YUNIKORN-2954 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2954 > Project: Apache YuniKorn > Issue Type: Task > Components: webapp >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > > We need to revert YUNIKORN-2606, as it should never have been merged. It has > become clear that it exists only to provide invasive hooks for adding > proprietary and/or non-standard components to YuniKorn. It also opens up > YuniKorn to potential remote code execution vulnerabilities. This goes > against our open development philosophy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2954) Remove so-called modular sidebar
[ https://issues.apache.org/jira/browse/YUNIKORN-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2954. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Remove so-called modular sidebar > > > Key: YUNIKORN-2954 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2954 > Project: Apache YuniKorn > Issue Type: Task > Components: webapp >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > We need to revert YUNIKORN-2606, as it should never have been merged. It has > become clear that it exists only to provide invasive hooks for adding > proprietary and/or non-standard components to YuniKorn. It also opens up > YuniKorn to potential remote code execution vulnerabilities. This goes > against our open development philosophy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2954) Remove so-called modular sidebar
Craig Condit created YUNIKORN-2954: -- Summary: Remove so-called modular sidebar Key: YUNIKORN-2954 URL: https://issues.apache.org/jira/browse/YUNIKORN-2954 Project: Apache YuniKorn Issue Type: Task Components: webapp Reporter: Craig Condit Assignee: Craig Condit We need to revert YUNIKORN-2606, as it should never have been merged. It has become clear that it exists only to provide invasive hooks for adding proprietary and/or non-standard components to YuniKorn. It also opens up YuniKorn to potential remote code execution vulnerabilities. This goes against our open development philosophy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2908) Remove associated metrics when queue is removed
[ https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2908. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Remove associated metrics when queue is removed > --- > > Key: YUNIKORN-2908 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2908 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Hengzhe Guo >Assignee: Hengzhe Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > 1. after a queue is removed, its metrics will continue to be reported by > prometheus. This is fine with metrics like allocated resource because they > will just be 0, but it won't make sense for guaranteed and max resources, > giving wrong impression that there are still resource given to the queue. I > propose to unregister all this queue's metrics when it's removed. > 2. If queue is not removed but guaranteed or max resource config is removed, > or just a resource type is removed from the config, the metrics are also not > cleaned up. these metrics are only updated when there's a new valid value, > but not 'null' value. I propose to always delete all existing guaranteed and > max resources metrics of the queue then add back the new values, every time > we apply the configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2908) Remove associated metrics when queue is removed
[ https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2908: --- Summary: Remove associated metrics when queue is removed (was: metrics not removed when a queue is removed) > Remove associated metrics when queue is removed > --- > > Key: YUNIKORN-2908 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2908 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Hengzhe Guo >Assignee: Hengzhe Guo >Priority: Major > Labels: pull-request-available > > 1. after a queue is removed, its metrics will continue to be reported by > prometheus. This is fine with metrics like allocated resource because they > will just be 0, but it won't make sense for guaranteed and max resources, > giving wrong impression that there are still resource given to the queue. I > propose to unregister all this queue's metrics when it's removed. > 2. If queue is not removed but guaranteed or max resource config is removed, > or just a resource type is removed from the config, the metrics are also not > cleaned up. these metrics are only updated when there's a new valid value, > but not 'null' value. I propose to always delete all existing guaranteed and > max resources metrics of the queue then add back the new values, every time > we apply the configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2948) Add MockScheduler test which verifies foreign pod tracking
[ https://issues.apache.org/jira/browse/YUNIKORN-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2948. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Add MockScheduler test which verifies foreign pod tracking > -- > > Key: YUNIKORN-2948 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2948 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Based on the design docs, we should create a MockScheduler-based unit test in > the shim that validates foreign pod tracking. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2948) Add MockScheduler test which verifies foreign pod tracking
[ https://issues.apache.org/jira/browse/YUNIKORN-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2948: --- Summary: Add MockScheduler test which verifies foreign pod tracking (was: [shim] Write MockScheduler test which verifies foreign pod tracking) > Add MockScheduler test which verifies foreign pod tracking > -- > > Key: YUNIKORN-2948 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2948 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > > Based on the design docs, we should create a MockScheduler-based unit test in > the shim that validates foreign pod tracking. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2931) Create foreign pod e2e tests
[ https://issues.apache.org/jira/browse/YUNIKORN-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2931. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Create foreign pod e2e tests > - > > Key: YUNIKORN-2931 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2931 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2949) Load external Scheduler Service using Module Federation
[ https://issues.apache.org/jira/browse/YUNIKORN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892835#comment-17892835 ] Craig Condit commented on YUNIKORN-2949: Also, the so-called "YuniKorn History Service" is appropriating an Apache trademark without permission. It cannot be called that, as it gives the impression it is an official Apache YuniKorn project. > Load external Scheduler Service using Module Federation > --- > > Key: YUNIKORN-2949 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2949 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Major > Labels: pull-request-available > > Add an option to load external Scheduler Service in Applications View using > the Module Federation. > This will only be enabled if the correct env variables are set. If not, the > application must behave as is. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2949) Load external Scheduler Service using Module Federation
[ https://issues.apache.org/jira/browse/YUNIKORN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892834#comment-17892834 ] Craig Condit commented on YUNIKORN-2949: This has got to stop. We shouldn't be adding hooks for proprietary or unsupported third-party hooks into the YuniKorn codebase. If there's meant to be an official history service, it should be done under the Apache umbrella. I'm a firm -1 on this. > Load external Scheduler Service using Module Federation > --- > > Key: YUNIKORN-2949 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2949 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Major > Labels: pull-request-available > > Add an option to load external Scheduler Service in Applications View using > the Module Federation. > This will only be enabled if the correct env variables are set. If not, the > application must behave as is. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2949) Load external Scheduler Service using Module Federation
[ https://issues.apache.org/jira/browse/YUNIKORN-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2949. -- Resolution: Won't Do > Load external Scheduler Service using Module Federation > --- > > Key: YUNIKORN-2949 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2949 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Major > Labels: pull-request-available > > Add an option to load external Scheduler Service in Applications View using > the Module Federation. > This will only be enabled if the correct env variables are set. If not, the > application must behave as is. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2943) Fix typo in Prometheus monitoring guide
[ https://issues.apache.org/jira/browse/YUNIKORN-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2943. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Fix typo in Prometheus monitoring guide > --- > > Key: YUNIKORN-2943 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2943 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Tzu-Hua Lan >Assignee: Tzu-Hua Lan >Priority: Trivial > Labels: pull-request-available > Fix For: 1.7.0 > > > Fix a typo in the Prometheus and Grafana monitoring > [documentation|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#3-use-service-mointor-to-define-monitor-yunikorn-service-target]. > Change: > - Before: "3. Use Service Mointor to Define monitor yunikorn service target" > - After: "3. Use Service Monitor to Define monitor yunikorn service target" > This fixes the misspelling of "Monitor" in the section heading. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2941) Remove plugin mode from the install section of Getting Started
[ https://issues.apache.org/jira/browse/YUNIKORN-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2941: --- Summary: Remove plugin mode from the install section of Getting Started (was: Remove plugin mode from the install section of Get Started) > Remove plugin mode from the install section of Getting Started > -- > > Key: YUNIKORN-2941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: website >Reporter: Michael Chu >Assignee: Michael Chu >Priority: Minor > Labels: newbie, pull-request-available > > Since plugin mode is now deprecated and will be removed in a future release, > it would be better to add a deprecated tag to the plugin mode in the install > section to prevent any confusion. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2894) Update KubeRay operator documentation for YuniKorn integration
[ https://issues.apache.org/jira/browse/YUNIKORN-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2894: --- Summary: Update KubeRay operator documentation for YuniKorn integration (was: [Docs][RayCluster]update KubeRay operator documentation for YuniKorn integration) > Update KubeRay operator documentation for YuniKorn integration > -- > > Key: YUNIKORN-2894 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2894 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Hsien-Cheng(Ryan) Huang >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: pull-request-available > > kubeRay is now supports gang scheduling via this PR: > https://github.com/ray-project/kuberay/pull/2396 > and is available since its 1.2.0 release: > https://github.com/ray-project/kuberay/releases/tag/v1.2.0 > Proposed modifications: > 1. specify version update to v1.2.2 > 2. document updates based on ray-docs: > https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/yunikorn.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2825) Fix the job name of pprof dashboard in Grafana dashboard
[ https://issues.apache.org/jira/browse/YUNIKORN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2825. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Fix the job name of pprof dashboard in Grafana dashboard > > > Key: YUNIKORN-2825 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2825 > Project: Apache YuniKorn > Issue Type: Bug > Components: documentation >Reporter: Yu-Lin Chen >Assignee: Tzu-Hua Lan >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.7.0 > > Attachments: image-2024-08-21-16-49-46-234.png, > image-2024-08-21-16-50-44-173.png > > > After following the steps in "[Prometheus and > Grafana|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#deploy-prometheus-and-grafana-in-a-cluster]"; > to deploy Grafana, if you import the pprof dashboard through > "[yunikorn-pprof.json|https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard]";, > there is no metrics are displayed. > The reason is the job name in the yunikorn-pprof.json doesn't match what we > have from Promethus operator. > We should fix the job name in yunikorn-pprof.json. > ex: > {code:bash} > "expr": "go_memstats_heap_inuse_bytes{job=~\"yunikorn\"}", > {code} > should change to > {code:bash} > "expr": "go_memstats_heap_inuse_bytes{job=\"yunikorn-service\"}", > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2941) Remove plugin mode from the install section of Get Started
[ https://issues.apache.org/jira/browse/YUNIKORN-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2941: --- Summary: Remove plugin mode from the install section of Get Started (was: Add a deprecated tag to plugin mode in the install section of Get Started) > Remove plugin mode from the install section of Get Started > -- > > Key: YUNIKORN-2941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: website >Reporter: Michael Chu >Assignee: Michael Chu >Priority: Minor > Labels: newbie, pull-request-available > > Since plugin mode is now deprecated and will be removed in a future release, > it would be better to add a deprecated tag to the plugin mode in the install > section to prevent any confusion. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2945) Add punctuation for better clarity in Scheduler Configuration
[ https://issues.apache.org/jira/browse/YUNIKORN-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2945: --- Summary: Add punctuation for better clarity in Scheduler Configuration (was: Add punctuations for typos and better clarity in Scheduler Configuration) > Add punctuation for better clarity in Scheduler Configuration > - > > Key: YUNIKORN-2945 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2945 > Project: Apache YuniKorn > Issue Type: Improvement > Components: website >Reporter: Michael Chu >Assignee: Michael Chu >Priority: Minor > Labels: pull-request-available > > Add punctuations for typos and better clarity. > Changes: > # "Pre emption setting" >> "Pre-emption setting" > # "user based" >> "user-based" > # "cluster-wide" >> "cluster wide" > # "In other words when the access control list of a queue does not allow > access the parent queue is checked." > >> > "In other words, when the access control list of a queue does not allow > access, the parent queue is checked." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2825) Fix the job name of pprof dashboard in Grafana dashboard
[ https://issues.apache.org/jira/browse/YUNIKORN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2825: --- Summary: Fix the job name of pprof dashboard in Grafana dashboard (was: Fix the job name of pprof dashboard in Grafana dashboar) > Fix the job name of pprof dashboard in Grafana dashboard > > > Key: YUNIKORN-2825 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2825 > Project: Apache YuniKorn > Issue Type: Bug > Components: documentation >Reporter: Yu-Lin Chen >Assignee: Tzu-Hua Lan >Priority: Major > Labels: newbie, pull-request-available > Attachments: image-2024-08-21-16-49-46-234.png, > image-2024-08-21-16-50-44-173.png > > > After following the steps in "[Prometheus and > Grafana|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#deploy-prometheus-and-grafana-in-a-cluster]"; > to deploy Grafana, if you import the pprof dashboard through > "[yunikorn-pprof.json|https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard]";, > there is no metrics are displayed. > The reason is the job name in the yunikorn-pprof.json doesn't match what we > have from Promethus operator. > We should fix the job name in yunikorn-pprof.json. > ex: > {code:bash} > "expr": "go_memstats_heap_inuse_bytes{job=~\"yunikorn\"}", > {code} > should change to > {code:bash} > "expr": "go_memstats_heap_inuse_bytes{job=\"yunikorn-service\"}", > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2825) Fix the job name of pprof dashboard in Grafana dashboar
[ https://issues.apache.org/jira/browse/YUNIKORN-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2825: --- Summary: Fix the job name of pprof dashboard in Grafana dashboar (was: Fix the job name of pprof dashboard in Grafana dashboard example) > Fix the job name of pprof dashboard in Grafana dashboar > --- > > Key: YUNIKORN-2825 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2825 > Project: Apache YuniKorn > Issue Type: Bug > Components: documentation >Reporter: Yu-Lin Chen >Assignee: Tzu-Hua Lan >Priority: Major > Labels: newbie, pull-request-available > Attachments: image-2024-08-21-16-49-46-234.png, > image-2024-08-21-16-50-44-173.png > > > After following the steps in "[Prometheus and > Grafana|https://yunikorn.apache.org/docs/next/user_guide/observability/prometheus#deploy-prometheus-and-grafana-in-a-cluster]"; > to deploy Grafana, if you import the pprof dashboard through > "[yunikorn-pprof.json|https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard]";, > there is no metrics are displayed. > The reason is the job name in the yunikorn-pprof.json doesn't match what we > have from Promethus operator. > We should fix the job name in yunikorn-pprof.json. > ex: > {code:bash} > "expr": "go_memstats_heap_inuse_bytes{job=~\"yunikorn\"}", > {code} > should change to > {code:bash} > "expr": "go_memstats_heap_inuse_bytes{job=\"yunikorn-service\"}", > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2894) Update KubeRay operator documentation for YuniKorn integration
[ https://issues.apache.org/jira/browse/YUNIKORN-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2894. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Update KubeRay operator documentation for YuniKorn integration > -- > > Key: YUNIKORN-2894 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2894 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Hsien-Cheng(Ryan) Huang >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > kubeRay is now supports gang scheduling via this PR: > https://github.com/ray-project/kuberay/pull/2396 > and is available since its 1.2.0 release: > https://github.com/ray-project/kuberay/releases/tag/v1.2.0 > Proposed modifications: > 1. specify version update to v1.2.2 > 2. document updates based on ray-docs: > https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/yunikorn.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2848) Refactor preemption_queue_test.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2848. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Refactor preemption_queue_test.go > - > > Key: YUNIKORN-2848 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2848 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Manikandan R >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Refactor TestGetPreemptableResource test based on variables and syntax > constructs used in TestGetRemainingGuaranteedResource. > For example, variables defined for test resources like res1, res2 etc and > t.Run({}) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2824) Refactor preemption_test.go
[ https://issues.apache.org/jira/browse/YUNIKORN-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2824. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Refactor preemption_test.go > --- > > Key: YUNIKORN-2824 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2824 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Manikandan R >Assignee: Yun Sun >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Off late, lot of new tests has been added into preemption_test.go for several > use cases. There is a room of improvement to simplify the whole tests > especially by avoiding duplicates. For example, > TestTryPreemption, TestTryPreemptionOnNode and > TestTryPreemption_NodeWithCapacityLesserThanAsk can be merged together into > single one and handle each cases through t.Run({}) construct. > Need to analyse other tests as well and see if those can be merged together > for simplification. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2926) Placeholder counters incorrect
[ https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2926. Fix Version/s: 1.7.0 1.6.1 Resolution: Fixed Merged to master and cherry-picked to branch-1.6. > Placeholder counters incorrect > -- > > Key: YUNIKORN-2926 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2926 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: wangzhihui >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Fix For: 1.7.0, 1.6.1 > > Attachments: image-2024-10-15-11-54-33-458.png, image.png > > > desc: > The reason for the real allocation is larger than all placeholder,Then > release all allocations。Causing all Pods is Pending state. > !image-2024-10-15-11-54-33-458.png! > !image.png! > {code:java} > // code placeholder > apiVersion: batch/v1 > kind: Job > metadata: > name: simple-gang-job > spec: > completions: 2 > parallelism: 2 > template: > metadata: > labels: > app: sleep > applicationId: "simple-gang-job" > queue: root.default > annotations: > yunikorn.apache.org/schedulingPolicyParameters: > "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard" > yunikorn.apache.org/task-group-name: task-group-example > yunikorn.apache.org/task-groups: |- > [{ > "name": "task-group-example", > "minMember": 2, > "minResource": { > "cpu": "100m", > "memory": "50M" > }, > "nodeSelector": {}, > "tolerations": [], > "affinity": {}, > "topologySpreadConstraints": [] > }] > spec: > schedulerName: yunikorn > restartPolicy: Never > containers: > - name: sleep30 > image: "alpine:latest" > command: ["sleep", ""] > resources: > requests: > cpu: "200m" > memory: "50M" {code} > solution: > If the app is in Hard mode, it will transition to a Failing state. If it is > in Soft mode, it will transition to a Resuming state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2827) Remove unused columns in Nodes view
[ https://issues.apache.org/jira/browse/YUNIKORN-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2827. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Remove unused columns in Nodes view > --- > > Key: YUNIKORN-2827 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2827 > Project: Apache YuniKorn > Issue Type: Improvement > Components: webapp >Reporter: Craig Condit >Assignee: Tzu-Hua Lan >Priority: Major > Labels: newbie, pull-request-available > Fix For: 1.7.0 > > > The nodes view currently has two attributes columns which are unused, as well > as Rack Name and Host Name, which are always n/a. We should remove these from > teh view as they take almost half the horizontal space. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2926) Placeholder counters incorrect
[ https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2926: --- Summary: Placeholder counters incorrect (was: The Pod using gang scheduling is stuck in the Pending state) > Placeholder counters incorrect > -- > > Key: YUNIKORN-2926 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2926 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: wangzhihui >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: pull-request-available > Attachments: image-2024-10-15-11-54-33-458.png, image.png > > > desc: > The reason for the real allocation is larger than all placeholder,Then > release all allocations。Causing all Pods is Pending state. > !image-2024-10-15-11-54-33-458.png! > !image.png! > {code:java} > // code placeholder > apiVersion: batch/v1 > kind: Job > metadata: > name: simple-gang-job > spec: > completions: 2 > parallelism: 2 > template: > metadata: > labels: > app: sleep > applicationId: "simple-gang-job" > queue: root.default > annotations: > yunikorn.apache.org/schedulingPolicyParameters: > "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard" > yunikorn.apache.org/task-group-name: task-group-example > yunikorn.apache.org/task-groups: |- > [{ > "name": "task-group-example", > "minMember": 2, > "minResource": { > "cpu": "100m", > "memory": "50M" > }, > "nodeSelector": {}, > "tolerations": [], > "affinity": {}, > "topologySpreadConstraints": [] > }] > spec: > schedulerName: yunikorn > restartPolicy: Never > containers: > - name: sleep30 > image: "alpine:latest" > command: ["sleep", ""] > resources: > requests: > cpu: "200m" > memory: "50M" {code} > solution: > If the app is in Hard mode, it will transition to a Failing state. If it is > in Soft mode, it will transition to a Resuming state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2913) Fix contrast issue in Applications view
[ https://issues.apache.org/jira/browse/YUNIKORN-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2913. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Fix contrast issue in Applications view > --- > > Key: YUNIKORN-2913 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2913 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Minor > Labels: pull-request-available > Fix For: 1.7.0 > > > There is a small bug on the applications page - when an application is > selected resources that are printed using the mat-chip component are not > visible due to low contrast. > The issue is hard to notice as sidebar covers the most of that table, but on > some strange resolutions (or window sizing) it can be noticed. > The solution is to set the font color to white on selected-row -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2913) Fix contrast issue in Applications view
[ https://issues.apache.org/jira/browse/YUNIKORN-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2913: --- Priority: Minor (was: Major) Summary: Fix contrast issue in Applications view (was: FFix contrast issue in Applications view) > Fix contrast issue in Applications view > --- > > Key: YUNIKORN-2913 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2913 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Minor > Labels: pull-request-available > > There is a small bug on the applications page - when an application is > selected resources that are printed using the mat-chip component are not > visible due to low contrast. > The issue is hard to notice as sidebar covers the most of that table, but on > some strange resolutions (or window sizing) it can be noticed. > The solution is to set the font color to white on selected-row -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2913) FFix contrast issue in Applications view
[ https://issues.apache.org/jira/browse/YUNIKORN-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2913: --- Summary: FFix contrast issue in Applications view (was: Applications view CSS issue) > FFix contrast issue in Applications view > > > Key: YUNIKORN-2913 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2913 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: Denis Coric >Assignee: Denis Coric >Priority: Major > Labels: pull-request-available > > There is a small bug on the applications page - when an application is > selected resources that are printed using the mat-chip component are not > visible due to low contrast. > The issue is hard to notice as sidebar covers the most of that table, but on > some strange resolutions (or window sizing) it can be noticed. > The solution is to set the font color to white on selected-row -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2923) Fix invalid routerLink setting in header breadcrumb
[ https://issues.apache.org/jira/browse/YUNIKORN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2923: --- Priority: Trivial (was: Major) > Fix invalid routerLink setting in header breadcrumb > --- > > Key: YUNIKORN-2923 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2923 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Trivial > Labels: pull-request-available > Fix For: 1.7.0 > > > When click header bearcrumb will navigate to '#/crumb.url' currently. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2349) Allow changing the orientation of the queue graph
[ https://issues.apache.org/jira/browse/YUNIKORN-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2349: --- Summary: Allow changing the orientation of the queue graph (was: Change the orientation of the queue SVG) > Allow changing the orientation of the queue graph > - > > Key: YUNIKORN-2349 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2349 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Dong-Lin Hsieh >Assignee: Dong-Lin Hsieh >Priority: Major > Labels: pull-request-available > > Change the orientation of the queue SVG -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2349) Allow changing the orientation of the queue graph
[ https://issues.apache.org/jira/browse/YUNIKORN-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2349. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Allow changing the orientation of the queue graph > - > > Key: YUNIKORN-2349 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2349 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: webapp >Reporter: Dong-Lin Hsieh >Assignee: Dong-Lin Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Change the orientation of the queue SVG -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2791) [Umbrella] Tracking non-Yunikorn allocations in the core
[ https://issues.apache.org/jira/browse/YUNIKORN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890545#comment-17890545 ] Craig Condit commented on YUNIKORN-2791: This is all shaping up nicely. One suggestion: I've been looking at the full state dump now that the feature is active, and it seems to me that we should populate allocationTags for foreign allocations as well. This makes things like pod name and other metadata available. Especially once we do the web UI changes to expose this information, we're going to want it. > [Umbrella] Tracking non-Yunikorn allocations in the core > > > Key: YUNIKORN-2791 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2791 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - scheduler, scheduler-interface, shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: release-notes > > Currently, we don't know what non-YK pods are assigned to a particular node > in the core. We only track the total amount of allocations as > {{occupiedResources}} object inside the {{objects.Node}} type. If the > tracking somehow becomes out of sync with the actual cluster state, it's very > difficult to know what went wrong, because these allocations are not shown in > the state dump. > In order to enhance supportability, we want to track all non-YK pods per node > on the core side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2896) [shim] Remove occupiedResource handling logic
[ https://issues.apache.org/jira/browse/YUNIKORN-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2896. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > [shim] Remove occupiedResource handling logic > - > > Key: YUNIKORN-2896 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2896 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2924) [core] Remove occupiedResource handling logic
[ https://issues.apache.org/jira/browse/YUNIKORN-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2924. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Resolving as this was merged to master. > [core] Remove occupiedResource handling logic > - > > Key: YUNIKORN-2924 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2924 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2929) Implement Skip Allocation Check for Unsuccessful Pods
[ https://issues.apache.org/jira/browse/YUNIKORN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889961#comment-17889961 ] Craig Condit edited comment on YUNIKORN-2929 at 10/16/24 6:39 AM: -- I don’t think this is wise. It’s tempting to look at this from a Spark-centric perspective, but this pattern could be detrimental to other application types. There’s also potentially wide-ranging side effects from aborting a scheduling cycle and restarting too quickly. I’m not in favor of this change. was (Author: ccondit): I don’t think this is wise. It’s tempting to look at this from a Spark-centric perspective, but this pattern could be detrimental to other application types. There’s also potentially wide-ranging side effects from shorting a scheduling cycle and restarting too quickly. I’m not in favor of this change. > Implement Skip Allocation Check for Unsuccessful Pods > -- > > Key: YUNIKORN-2929 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2929 > Project: Apache YuniKorn > Issue Type: Task > Components: core - scheduler >Reporter: Mit Desai >Assignee: Mit Desai >Priority: Major > > Skip allocation attempts for subsequent pods in an application if previous > pods have failed to allocate. > When running Spark applications, if an executor pod fails to find a suitable > node, it is likely that subsequent executor pods will also fail to find > nodes. This is particularly problematic when the application has a toleration > for a specific taint and there are limited nodes with that taint. The > scheduler spends excessive time attempting to allocate pods, ultimately > resulting in no pods being bound to nodes. > To optimize scheduling, we should: > # Implement a check to determine if previous pods in the same application > were successfully allocated. > # Skip processing other pods in the application if previous pods failed to > allocate. > # Generalize this by: > ** Adding an immediate action for Spark applications. > ** Introducing a threshold ('n' number of pods) after which the scheduler > will stop trying and restart the scheduling cycle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2929) Implement Skip Allocation Check for Unsuccessful Pods
[ https://issues.apache.org/jira/browse/YUNIKORN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889961#comment-17889961 ] Craig Condit commented on YUNIKORN-2929: I don’t think this is wise. It’s tempting to look at this from a Spark-centric perspective, but this pattern could be detrimental to other application types. There’s also potentially wide-ranging side effects from shorting a scheduling cycle and restarting too quickly. I’m not in favor of this change. > Implement Skip Allocation Check for Unsuccessful Pods > -- > > Key: YUNIKORN-2929 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2929 > Project: Apache YuniKorn > Issue Type: Task > Components: core - scheduler >Reporter: Mit Desai >Assignee: Mit Desai >Priority: Major > > Skip allocation attempts for subsequent pods in an application if previous > pods have failed to allocate. > When running Spark applications, if an executor pod fails to find a suitable > node, it is likely that subsequent executor pods will also fail to find > nodes. This is particularly problematic when the application has a toleration > for a specific taint and there are limited nodes with that taint. The > scheduler spends excessive time attempting to allocate pods, ultimately > resulting in no pods being bound to nodes. > To optimize scheduling, we should: > # Implement a check to determine if previous pods in the same application > were successfully allocated. > # Skip processing other pods in the application if previous pods failed to > allocate. > # Generalize this by: > ** Adding an immediate action for Spark applications. > ** Introducing a threshold ('n' number of pods) after which the scheduler > will stop trying and restart the scheduling cycle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2910) Data corruption due to insufficient shim context locking
[ https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888761#comment-17888761 ] Craig Condit commented on YUNIKORN-2910: I've started doing some log analysis of these. I haven't narrowed down root cause yet, but this is interesting: {quote}2024-10-10T22:36:37.882Z INFO shim.cache.external external/scheduler_cache.go:311 Adding occupied resources to node \{"nodeID": "amp-dp-prod-spark-exec-yk-1-node-group-b74a85d-h77rt", "namespace": "spark-system", "podName": "spark-history-server-deployment-078u1pfr-579dbd4b6d-6p6fz", "occupied": "resources:{key:\"ephemeral-storage\" value:{value:5368709120}} resources:\{key:\"memory\" value:{value:75161927680}} resources:\{key:\"pods\" value:{value:1}} resources:\{key:\"vcore\" value:{value:2000}}"} 2024-10-10T22:36:37.882Z WARN core.scheduler.node objects/node.go:216 Node update triggered over allocated node \{"available": "map[ephemeral-storage:1386189349332 memory:-60014637056 pods:724 vcore:14200 vpc.amazonaws.com/pod-eni:107]", "total": "map[ephemeral-storage:1448466375124 hugepages-1Gi:0 hugepages-2Mi:0 memory:523482255360 pods:737 vcore:63770 vpc.amazonaws.com/pod-eni:107]", "occupied": "map[ephemeral-storage:5368709120 memory:75214356480 pods:6 vcore:2100]", "allocated": "map[ephemeral-storage:56908316672 memory:508282535936 pods:7 vcore:47470]"} {quote} This would seem to indicate a bug on our end, but in fact it's correct. We receive an occupied resource update (for a non-YuniKorn pod) which blows past the node limits and overallocates memory on the node by ~ 6 GB. Just prior to receiving that, we schedule a bunch of spark executors on that node. Because the spark history server is scheduled by a non-YuniKorn scheduler, we have a case where two schedulers both try to claim resources on the same node, and we over-allocate. There's no avoiding this due to the async nature of communication with the API server. What's interesting is that this situation never gets resolved. My guess is that KWOK's fake nodes don't reject placements with OutOfMemory or OutOfCPU like normal nodes. We don't see the allocations go away until the node is decommissioned later. In a real cluster, the pod rejections come back almost immediately. > Data corruption due to insufficient shim context locking > > > Key: YUNIKORN-2910 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.6.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.7.0, 1.6.1 > > Attachments: logs-1.6.0+2910, logs-1.6.0+2910+scale-down, > state-dump-1.6-context-locking-after-2.json, state-dump-after-1.5.2.json, > state-dump-after-1.6.0+2910.json > > > We need to restore the context locking that was removed in YUNIKORN-2629. > Without it, multiple K8s events of different types may be processed in > parallel. Specifically, pod and node events being processed simultaneously is > not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2917) Add additional buckets for latency histograms
[ https://issues.apache.org/jira/browse/YUNIKORN-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2917. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Add additional buckets for latency histograms > - > > Key: YUNIKORN-2917 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2917 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > The scheduling latency histograms are defined with a starting bucket of > 0.0001 (0.1ms) with a total of 6 buckets with a multiplier of 10. This gives > possible [0.0001, 0.001, 0.01, 0.1, 1, 10, +Inf] ranges. We should extend > this to 8 buckets so that +100 and +1000 seconds can be discerned. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2916) Fix inconsistent resource bucketing in metrics
[ https://issues.apache.org/jira/browse/YUNIKORN-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2916. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Fix inconsistent resource bucketing in metrics > -- > > Key: YUNIKORN-2916 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2916 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > The metrics reporting for resource usage has histogram buckets for each 10% > window (0-10%, 10-20%, etc.) However, the 10-20% bucket has inconsistent > formatting, leading to potential confusion: > > {code:java} > var resourceUsageRangeBuckets = []string{ > "[0,10%]", > "(10%, 20%]", // extra space here > "(20%,30%]", > // ... > } {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2916) Fix inconsistent resource bucketing in metrics
[ https://issues.apache.org/jira/browse/YUNIKORN-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2916: --- Priority: Minor (was: Major) > Fix inconsistent resource bucketing in metrics > -- > > Key: YUNIKORN-2916 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2916 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Minor > Labels: pull-request-available > Fix For: 1.7.0 > > > The metrics reporting for resource usage has histogram buckets for each 10% > window (0-10%, 10-20%, etc.) However, the 10-20% bucket has inconsistent > formatting, leading to potential confusion: > > {code:java} > var resourceUsageRangeBuckets = []string{ > "[0,10%]", > "(10%, 20%]", // extra space here > "(20%,30%]", > // ... > } {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2915) Scheduling latency metric should update even if no scheduling occurs
[ https://issues.apache.org/jira/browse/YUNIKORN-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2915. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Scheduling latency metric should update even if no scheduling occurs > > > Key: YUNIKORN-2915 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2915 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > The scheduler metric scheduling_latency_milliseconds is currently only > updated if an allocation actually occurs. This is not particularly useful, as > latency could be quite long but in the case where no scheduling was possible > after traversing all queues, no reporting is done, so the visible latency is > 0. This makes the metric difficult to use for analysis, as in a busy cluster, > scheduling can be very slow, but not show up on the histogram at all. > We should move the reporting of scheduling latency outside the check for an > allocation and report even scheduling runs with no result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails
[ https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888460#comment-17888460 ] Craig Condit commented on YUNIKORN-2895: Reopened to keep the discussion of the issue going. I'm not sure that the description or possible causes are accurate at this point. > Don't add duplicated allocation to node when the allocation ask fails > - > > Key: YUNIKORN-2895 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2895 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > > When i try to revisit the new update allocation logic, the potential > duplicated allocation to node will happen if the allocation already > allocated. And we try to add the allocation to the node again and don't > revert it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Reopened] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails
[ https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit reopened YUNIKORN-2895: > Don't add duplicated allocation to node when the allocation ask fails > - > > Key: YUNIKORN-2895 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2895 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > > When i try to revisit the new update allocation logic, the potential > duplicated allocation to node will happen if the allocation already > allocated. And we try to add the allocation to the node again and don't > revert it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails
[ https://issues.apache.org/jira/browse/YUNIKORN-2895 ] Craig Condit deleted comment on YUNIKORN-2895: was (Author: ccondit): Closing this as it's not an issue that can occur in practice. > Don't add duplicated allocation to node when the allocation ask fails > - > > Key: YUNIKORN-2895 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2895 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > > When i try to revisit the new update allocation logic, the potential > duplicated allocation to node will happen if the allocation already > allocated. And we try to add the allocation to the node again and don't > revert it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2910) Data corruption due to insufficient shim context locking
[ https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888458#comment-17888458 ] Craig Condit commented on YUNIKORN-2910: [~shravan-achar] I think we have a second issue in play then. Do you happen to have a log dump from 1.6 with this patch? > Data corruption due to insufficient shim context locking > > > Key: YUNIKORN-2910 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.6.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.7.0, 1.6.1 > > Attachments: state-dump-1.6-context-locking-after-2.json, > state-dump-after-1.5.2.json > > > We need to restore the context locking that was removed in YUNIKORN-2629. > Without it, multiple K8s events of different types may be processed in > parallel. Specifically, pod and node events being processed simultaneously is > not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2917) Add additional buckets for latency histograms
Craig Condit created YUNIKORN-2917: -- Summary: Add additional buckets for latency histograms Key: YUNIKORN-2917 URL: https://issues.apache.org/jira/browse/YUNIKORN-2917 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Craig Condit Assignee: Craig Condit The scheduling latency histograms are defined with a starting bucket of 0.0001 (0.1ms) with a total of 6 buckets with a multiplier of 10. This gives possible [0.0001, 0.001, 0.01, 0.1, 1, 10, +Inf] ranges. We should extend this to 8 buckets so that +100 and +1000 seconds can be discerned. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2916) Fix inconsistent resource bucketing in metrics
Craig Condit created YUNIKORN-2916: -- Summary: Fix inconsistent resource bucketing in metrics Key: YUNIKORN-2916 URL: https://issues.apache.org/jira/browse/YUNIKORN-2916 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Reporter: Craig Condit Assignee: Craig Condit The metrics reporting for resource usage has histogram buckets for each 10% window (0-10%, 10-20%, etc.) However, the 10-20% bucket has inconsistent formatting, leading to potential confusion: {code:java} var resourceUsageRangeBuckets = []string{ "[0,10%]", "(10%, 20%]", // extra space here "(20%,30%]", // ... } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2915) Scheduling latency metric should update even if no scheduling occurs
Craig Condit created YUNIKORN-2915: -- Summary: Scheduling latency metric should update even if no scheduling occurs Key: YUNIKORN-2915 URL: https://issues.apache.org/jira/browse/YUNIKORN-2915 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler Reporter: Craig Condit Assignee: Craig Condit The scheduler metric scheduling_latency_milliseconds is currently only updated if an allocation actually occurs. This is not particularly useful, as latency could be quite long but in the case where no scheduling was possible after traversing all queues, no reporting is done, so the visible latency is 0. This makes the metric difficult to use for analysis, as in a busy cluster, scheduling can be very slow, but not show up on the histogram at all. We should move the reporting of scheduling latency outside the check for an allocation and report even scheduling runs with no result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2908) metrics not removed when queue or queue's guaranteed/max resource config is removed
[ https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888422#comment-17888422 ] Craig Condit commented on YUNIKORN-2908: YUNIKORN-2855 had an incomplete fix. As we've looked at it further, it is subtly broken – it doesn't take into account the {{state}} parameter when calculating the already-seen resources. We should probably rebuild that functionality to use the built-in {{Describe()}} method to iterate all the existing values and remove those where the state matches but we don't have a new value. This is not a simple change. > metrics not removed when queue or queue's guaranteed/max resource config is > removed > --- > > Key: YUNIKORN-2908 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2908 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Hengzhe Guo >Assignee: Hengzhe Guo >Priority: Major > > 1. after a queue is removed, its metrics will continue to be reported by > prometheus. This is fine with metrics like allocated resource because they > will just be 0, but it won't make sense for guaranteed and max resources, > giving wrong impression that there are still resource given to the queue. I > propose to unregister all this queue's metrics when it's removed. > 2. If queue is not removed but guaranteed or max resource config is removed, > or just a resource type is removed from the config, the metrics are also not > cleaned up. these metrics are only updated when there's a new valid value, > but not 'null' value. I propose to always delete all existing guaranteed and > max resources metrics of the queue then add back the new values, every time > we apply the configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2914) Update deployment documentation for extra description for hot reload
[ https://issues.apache.org/jira/browse/YUNIKORN-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2914. -- Target Version: (was: 1.7.0) Resolution: Not A Problem Closing as the current documentation is correct. > Update deployment documentation for extra description for hot reload > > > Key: YUNIKORN-2914 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2914 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Yao >Assignee: Yao >Priority: Minor > Labels: pull-request-available > > I just saw there's a user asking about configmap hot reload question in the > slack channel, can refer > to[[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1728327140557209]] > I also checked the hot refresh part for Yunikorn's docs, and I think not > everyone knows that if you mounted the configmap under the subpath, the hot > reload will not be triggered. > Therefore, I want to add extra description about this part. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails
[ https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2895. -- Target Version: (was: 1.7.0) Resolution: Not A Problem Closing this as it's not an issue that can occur in practice. > Don't add duplicated allocation to node when the allocation ask fails > - > > Key: YUNIKORN-2895 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2895 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Labels: pull-request-available > > When i try to revisit the new update allocation logic, the potential > duplicated allocation to node will happen if the allocation already > allocated. And we try to add the allocation to the node again and don't > revert it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails
[ https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888355#comment-17888355 ] Craig Condit commented on YUNIKORN-2895: {quote}I think the issue is located in the maintenance of the {{sortedRequests}} on the application. That list used to be rebuild each cycle but now we insert/delete from the slice. During recovery I think we broke things. Recovery is using the same path as a node addition so this *could* happen on any node add or maybe even on a simple add of an new ask. {quote} This is all controlled from the shim by ensuring that nodes are added first, and then referenced pods are added afterwards. The regression fixed in YUNIKORN-2910 should address this. With that fix, it's not possible for these to be handled out-of-order. Additionally, the {{sortedRequests}} slice is only ever updated in lockstep with the requests map (I have verified this in the code). {quote}{*}First issue{*}: if the old ask _IS_ allocated we will still replace that allocation with the new one in the requests map. We skip adjusting the pending resources using the already registered ask. This is where it breaks down: the requests list should never contain already allocated objects. It means we have a reference leak, and thus a memory leak. Long after the allocation is removed a reference will be kept in requests that will not get removed until we clean up the application. The GC will thus not remove it. For long running applications with lots of requests this can become significant. {quote} This is false. We never replace the allocation; we check for an existing one and update as necessary. The new allocation passed in (from the SI) is only ever read from. It is *never* stored in requests (or allocatedRequests) unless that allocationKey had never been seen before. The request list maintains all requests, whether satisfied or not, as does sortedRequests. This allows us to check for an allocation in only one place, and increases memory only by the size of a pointer for each one (after 1.6.0, asks and allocations are no longer distinct objects, and so we are simply storing either 2 or 3 references to the same object (2 in the case of not-yet-allocated, in requests and sortedRequests; and 3 in the case of allocated (requests, sortedRequests, allocatedRequests). There is no memory leak. When an allocation goes away (if it goes into a terminal state), it is removed {*}from all three places{*}. This doesn't only happen on application termination. We have to keep the allocations around for the lifetime of their associated pods anyway due to things like mutable pod resources coming in the near future – an allocation's size can change after it has been scheduled. {quote}{*}Second issue{*}: Caused by the replacement also. The new object is not marked allocated which causes a big problem as we will try and schedule it. We now could have an unallocated and an allocated object with the same key one in requests and one in allocations. After we schedule the second one the allocations list will be updated and we lose the original info. {quote} This is also false. We never *replace* an allocation object if one already exists. We examine the state and may update resource requests or transition to allocated state based on the deltas between the existing and new objects. This is what allows the shim to notify us that an allocation has been satisfied outside YuniKorn. {quote}{*}Third issue{*}: independent of the state we proceed to add the ask to the requests. The requests are stored in a map based on the allocation key. Which means we are always only tracking a single ask. Never any duplicates. The sorted requests however is a sorted slice of references to objects. There is no checks in the add into the sorted request slice to replace the existing entry. We will happily add a second one to the slice. Two objects same key they are both considered when scheduling which means we can easily cause issues there. {quote} Also false. We check the state very carefully and only add to requests (and sortedRequests) if we *have never seen that allocationKey.* The only place sortedRequests is added to is when requests didn't contain that object. The checks are not needed in sortedRequests because the pre-checks before updating in the requests map already ensure that duplicates don't happen. > Don't add duplicated allocation to node when the allocation ask fails > - > > Key: YUNIKORN-2895 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2895 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > > When i try to revisit the new u
[jira] [Resolved] (YUNIKORN-2910) Data corruption due to insufficient shim context locking
[ https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2910. Fix Version/s: 1.7.0 1.6.1 Resolution: Fixed Merged to master and cherry-picked to branch-1.6. > Data corruption due to insufficient shim context locking > > > Key: YUNIKORN-2910 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.6.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.7.0, 1.6.1 > > > We need to restore the context locking that was removed in YUNIKORN-2629. > Without it, multiple K8s events of different types may be processed in > parallel. Specifically, pod and node events being processed simultaneously is > not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2910) Data corruption due to insufficient shim context locking
[ https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2910: --- Target Version: 1.7.0, 1.6.1 (was: 1.7.0) > Data corruption due to insufficient shim context locking > > > Key: YUNIKORN-2910 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.6.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Blocker > Labels: pull-request-available > > We need to restore the context locking that was removed in YUNIKORN-2629. > Without it, multiple K8s events of different types may be processed in > parallel. Specifically, pod and node events being processed simultaneously is > not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2908) metrics not removed when queue or queue's guaranteed/max resource config is removed
[ https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888096#comment-17888096 ] Craig Condit commented on YUNIKORN-2908: This is actually much more complex than it initially appears. We should split this Jira up into separate tasks for queue deletion and guaranteed / pending / max changing. The queue removal can simply be removal of the entire metrics object. The dynamic updates for the other metrics are more complex. I'd prefer to take that task myself. > metrics not removed when queue or queue's guaranteed/max resource config is > removed > --- > > Key: YUNIKORN-2908 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2908 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Hengzhe Guo >Assignee: Hengzhe Guo >Priority: Major > > 1. after a queue is removed, its metrics will continue to be reported by > prometheus. This is fine with metrics like allocated resource because they > will just be 0, but it won't make sense for guaranteed and max resources, > giving wrong impression that there are still resource given to the queue. I > propose to unregister all this queue's metrics when it's removed. > 2. If queue is not removed but guaranteed or max resource config is removed, > or just a resource type is removed from the config, the metrics are also not > cleaned up. these metrics are only updated when there's a new valid value, > but not 'null' value. I propose to always delete all existing guaranteed and > max resources metrics of the queue then add back the new values, every time > we apply the configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Comment Edited] (YUNIKORN-2908) metrics not removed when queue or queue's guaranteed/max resource config is removed
[ https://issues.apache.org/jira/browse/YUNIKORN-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888096#comment-17888096 ] Craig Condit edited comment on YUNIKORN-2908 at 10/9/24 11:56 PM: -- This is actually much more complex than it initially appears. We should split this Jira up into separate tasks for queue deletion and guaranteed / pending / max changing. The queue removal can simply be removal of the entire metrics object. The dynamic updates for the other metrics are more complex. I'd prefer to take that task myself. [~hguo25] can you split this out please? was (Author: ccondit): This is actually much more complex than it initially appears. We should split this Jira up into separate tasks for queue deletion and guaranteed / pending / max changing. The queue removal can simply be removal of the entire metrics object. The dynamic updates for the other metrics are more complex. I'd prefer to take that task myself. > metrics not removed when queue or queue's guaranteed/max resource config is > removed > --- > > Key: YUNIKORN-2908 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2908 > Project: Apache YuniKorn > Issue Type: Bug >Reporter: Hengzhe Guo >Assignee: Hengzhe Guo >Priority: Major > > 1. after a queue is removed, its metrics will continue to be reported by > prometheus. This is fine with metrics like allocated resource because they > will just be 0, but it won't make sense for guaranteed and max resources, > giving wrong impression that there are still resource given to the queue. I > propose to unregister all this queue's metrics when it's removed. > 2. If queue is not removed but guaranteed or max resource config is removed, > or just a resource type is removed from the config, the metrics are also not > cleaned up. these metrics are only updated when there's a new valid value, > but not 'null' value. I propose to always delete all existing guaranteed and > max resources metrics of the queue then add back the new values, every time > we apply the configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2911) Add kind-e2e makefile target
[ https://issues.apache.org/jira/browse/YUNIKORN-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2911: --- Priority: Minor (was: Major) > Add kind-e2e makefile target > > > Key: YUNIKORN-2911 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2911 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Minor > Labels: pull-request-available > > Add a simple kind-e2e Makefile target to yunikorn-k8shim. This would spin up > a kind cluster (on the latest version), run the e2e tests, then tear down the > cluster. This would be easier (especially for new users) than running the > scripts/run-e2e-tests.sh script directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2754) Update grafana UI in yunikorn-metric docs
[ https://issues.apache.org/jira/browse/YUNIKORN-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2754. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Update grafana UI in yunikorn-metric docs > -- > > Key: YUNIKORN-2754 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2754 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Chen Yu Teng >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2754) Update grafana UI in yunikorn-metric docs
[ https://issues.apache.org/jira/browse/YUNIKORN-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2754: --- Summary: Update grafana UI in yunikorn-metric docs (was: Update doc grafana UI of yunikorn-metric) > Update grafana UI in yunikorn-metric docs > -- > > Key: YUNIKORN-2754 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2754 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Chen Yu Teng >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2723) Wordwrap queuename in QueuesV2 page
[ https://issues.apache.org/jira/browse/YUNIKORN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2723: --- Summary: Wordwrap queuename in QueuesV2 page (was: Wordwrap queuename in QueuesV2 (Beta) page) > Wordwrap queuename in QueuesV2 page > --- > > Key: YUNIKORN-2723 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2723 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: Manikandan R >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Please see attached image (captured from Mac M1 chrome) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2723) Wordwrap queuename in QueuesV2 (Beta) page
[ https://issues.apache.org/jira/browse/YUNIKORN-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2723. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Wordwrap queuename in QueuesV2 (Beta) page > -- > > Key: YUNIKORN-2723 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2723 > Project: Apache YuniKorn > Issue Type: Bug > Components: webapp >Reporter: Manikandan R >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > Please see attached image (captured from Mac M1 chrome) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2910) Data corruption due to insufficient shim context locking
[ https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2910: --- Summary: Data corruption due to insufficient shim context locking (was: Multiple events may be processed by shim context) > Data corruption due to insufficient shim context locking > > > Key: YUNIKORN-2910 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.6.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Blocker > > We need to restore the context locking that was removed in YUNIKORN-2629. > Without it, multiple K8s events of different types may be processed in > parallel. Specifically, pod and node events being processed simultaneously is > not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2910) Multiple events may be processed by shim context
[ https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2910: --- Priority: Blocker (was: Major) > Multiple events may be processed by shim context > > > Key: YUNIKORN-2910 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.6.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Blocker > > We need to restore the context locking that was removed in YUNIKORN-2629. > Without it, multiple K8s events of different types may be processed in > parallel. Specifically, pod and node events being processed simultaneously is > not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2910) Multiple events may be processed by shim context
Craig Condit created YUNIKORN-2910: -- Summary: Multiple events may be processed by shim context Key: YUNIKORN-2910 URL: https://issues.apache.org/jira/browse/YUNIKORN-2910 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Affects Versions: 1.6.0 Reporter: Craig Condit Assignee: Craig Condit We need to restore the context locking that was removed in YUNIKORN-2629. Without it, multiple K8s events of different types may be processed in parallel. Specifically, pod and node events being processed simultaneously is not safe, and results in data corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2753) Update yunikorn-metrics grafana dashboard
[ https://issues.apache.org/jira/browse/YUNIKORN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2753. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Update yunikorn-metrics grafana dashboard > - > > Key: YUNIKORN-2753 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2753 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Chen Yu Teng >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2753) Update yunikorn-metrics grafana dashboard
[ https://issues.apache.org/jira/browse/YUNIKORN-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2753: --- Summary: Update yunikorn-metrics grafana dashboard (was: Update grafana context and json) > Update yunikorn-metrics grafana dashboard > - > > Key: YUNIKORN-2753 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2753 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Chen Yu Teng >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > > https://github.com/apache/yunikorn-k8shim/tree/master/deployments/grafana-dashboard -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2844) Inject event recorder externally
[ https://issues.apache.org/jira/browse/YUNIKORN-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2844. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Inject event recorder externally > > > Key: YUNIKORN-2844 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2844 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > The current implementation creates an event recorder like that: > {noformat} > func GetRecorder() events.EventRecorder { > lock.Lock() > defer lock.Unlock() > once.Do(func() { > // note, the initiation of the event recorder requires on a > workable Kubernetes client, > // in test mode we should skip this and just use a fake > recorder instead. > configs := conf.GetSchedulerConf() > if !configs.IsTestMode() { > k8sClient := client.NewKubeClient(configs.KubeConfig) > eventBroadcaster := > events.NewBroadcaster(&events.EventSinkImpl{ > Interface: k8sClient.GetClientSet().EventsV1()}) > eventBroadcaster.StartRecordingToSink(make(<-chan > struct{})) > eventRecorder = > eventBroadcaster.NewRecorder(scheme.Scheme, constants.SchedulerName) > } > }) > return eventRecorder > } > {noformat} > The problem with this approach is that we need to indicate "test mode" in the > config, which just complicates things. > We can simplify this code if the recorder is set during Yunikorn > initialization in eg. {{NewShimScheduler()}}. The plugin code already does > this in {{NewSchedulerPlugin()}} and calls > {{events.SetRecorder(handle.EventRecorder())}}. > We should also get rid of the default fake recorder. This uses a buffered > channel with the size of 1024. This isn't a problem now, but if a new test > somehow ends up generating a lot of events, message sending will block. It > might not be obvious to someone to understand why running a new or existing > unit test just starts to block. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2895) Don't add duplicated allocation to node when the allocation ask fails
[ https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887898#comment-17887898 ] Craig Condit commented on YUNIKORN-2895: I suspect all of these issues trace back to this PR: [https://github.com/apache/yunikorn-k8shim/pull/859] it would be very helpful if anyone who is currently seeing this could rebuild 1.6.0 with the PR reverted and report back. > Don't add duplicated allocation to node when the allocation ask fails > - > > Key: YUNIKORN-2895 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2895 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > > When i try to revisit the new update allocation logic, the potential > duplicated allocation to node will happen if the allocation already > allocated. And we try to add the allocation to the node again and don't > revert it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2905) Update deployment documentation for make image
[ https://issues.apache.org/jira/browse/YUNIKORN-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2905: --- Summary: Update deployment documentation for make image (was: Update deployment documentation for outdated make image command description) > Update deployment documentation for make image > -- > > Key: YUNIKORN-2905 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2905 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Yao >Assignee: Yao >Priority: Minor > Labels: pull-request-available > > When I tried to do the same thing in the deployment section, I found that the > description for the make image command was outdated. > For example, the IMAGE_TAG variable is no longer exist in the makefile. > Therefore, I've referred to yunikorn-k8shim's readme.md and its makefile, and > rewritten the documentation to make it clearer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2905) Update deployment documentation for make image
[ https://issues.apache.org/jira/browse/YUNIKORN-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2905. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Update deployment documentation for make image > -- > > Key: YUNIKORN-2905 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2905 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Yao >Assignee: Yao >Priority: Minor > Labels: pull-request-available > Fix For: 1.7.0 > > > When I tried to do the same thing in the deployment section, I found that the > description for the make image command was outdated. > For example, the IMAGE_TAG variable is no longer exist in the makefile. > Therefore, I've referred to yunikorn-k8shim's readme.md and its makefile, and > rewritten the documentation to make it clearer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2904) Add Helm download and usage links
[ https://issues.apache.org/jira/browse/YUNIKORN-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2904. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Add Helm download and usage links > - > > Key: YUNIKORN-2904 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2904 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Yao >Assignee: Yao >Priority: Minor > Labels: pull-request-available > Fix For: 1.7.0 > > > I am a newbie who just started using Unicorn. Although I have some experience > with K8s, as far as I know, not everyone is using Helm charts to manage their > applications. > Therefore, I just added the link and some description for Helm & Helm's > download for those who don't know what Helm is. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2904) Add Helm download and usage links
[ https://issues.apache.org/jira/browse/YUNIKORN-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2904: --- Summary: Add Helm download and usage links (was: Update get_started documentation for Helm and Helm's download link) > Add Helm download and usage links > - > > Key: YUNIKORN-2904 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2904 > Project: Apache YuniKorn > Issue Type: Improvement > Components: documentation >Reporter: Yao >Assignee: Yao >Priority: Minor > Labels: pull-request-available > > I am a newbie who just started using Unicorn. Although I have some experience > with K8s, as far as I know, not everyone is using Helm charts to manage their > applications. > Therefore, I just added the link and some description for Helm & Helm's > download for those who don't know what Helm is. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2899) Update node version to 20.17 and update packages
[ https://issues.apache.org/jira/browse/YUNIKORN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2899. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Update node version to 20.17 and update packages > > > Key: YUNIKORN-2899 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2899 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Hsien-Cheng(Ryan) Huang >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Minor > Fix For: 1.7.0 > > > 1. Browserslist: caniuse-lite is outdated. Please run: > npx update-browserslist-db@latest > Why you should do it regularly: > https://github.com/browserslist/update-db#readme > 2. update docusaurous to v3.5.2 > 3. resolve security warnings > 4. update node version to latest LTS version > 5. add a `pnpm run serve` for serving local built. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2899) Update node version to 20.17 and update packages
[ https://issues.apache.org/jira/browse/YUNIKORN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2899: --- Summary: Update node version to 20.17 and update packages (was: chore: update node version to LTS, update package and solve warnings) > Update node version to 20.17 and update packages > > > Key: YUNIKORN-2899 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2899 > Project: Apache YuniKorn > Issue Type: Improvement >Reporter: Hsien-Cheng(Ryan) Huang >Assignee: Hsien-Cheng(Ryan) Huang >Priority: Minor > > 1. Browserslist: caniuse-lite is outdated. Please run: > npx update-browserslist-db@latest > Why you should do it regularly: > https://github.com/browserslist/update-db#readme > 2. update docusaurous to v3.5.2 > 3. resolve security warnings > 4. update node version to latest LTS version > 5. add a `pnpm run serve` for serving local built. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2885) Fix security vulnerabilities in dependencies
[ https://issues.apache.org/jira/browse/YUNIKORN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2885. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Fix security vulnerabilities in dependencies > > > Key: YUNIKORN-2885 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2885 > Project: Apache YuniKorn > Issue Type: Improvement > Components: webapp >Reporter: JunHong Peng >Assignee: JunHong Peng >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > {{pnpm audit}} report: > [audit-report.md|https://github.com/user-attachments/files/17089735/audit-report.md] > 26 vulnerabilities found > Severity: 12 moderate | 14 high > After Upgrade Angular v18 (#YUNIKORN-2861) Audit Report: > [audit-report.md|https://github.com/user-attachments/files/17164041/audit-report.md] > 8 vulnerabilities found > Severity: 3 moderate | 5 high -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2792) Create design doc
[ https://issues.apache.org/jira/browse/YUNIKORN-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2792. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Create design doc > - > > Key: YUNIKORN-2792 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2792 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Attachments: YUNIKORN-2791.pdf > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2892) Log correct termination type when releasing task in shim
[ https://issues.apache.org/jira/browse/YUNIKORN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2892: --- Summary: Log correct termination type when releasing task in shim (was: Log correct log for terminate type when releasing task from shim side) > Log correct termination type when releasing task in shim > - > > Key: YUNIKORN-2892 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2892 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Minor > > Now we will log empty terminate type when releasing task from shim side, we > should improve this to consistent with the real terminate type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2892) Log correct termination type when releasing task in shim
[ https://issues.apache.org/jira/browse/YUNIKORN-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2892. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Log correct termination type when releasing task in shim > - > > Key: YUNIKORN-2892 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2892 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Minor > Fix For: 1.7.0 > > > Now we will log empty terminate type when releasing task from shim side, we > should improve this to consistent with the real terminate type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2893) Apply "cancel-in-progress" feature for each PR in Github Action
[ https://issues.apache.org/jira/browse/YUNIKORN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2893. Fix Version/s: 1.7.0 Resolution: Fixed Merged all PRs to master branches. > Apply "cancel-in-progress" feature for each PR in Github Action > --- > > Key: YUNIKORN-2893 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2893 > Project: Apache YuniKorn > Issue Type: Improvement > Components: build >Reporter: Yu-Lin Chen >Assignee: Tzu-Hua Lan >Priority: Major > Fix For: 1.7.0 > > > Currently, when a newer commit is pushed to the same PR, the previous build > is not canceled. To save cost, we should apply the "cancel-in-progress" to > automatically cancel previous build. > REF: > [https://docs.github.com/en/enterprise-cloud@latest/actions/writing-workflows/choosing-what-your-workflow-does/control-the-concurrency-of-workflows-and-jobs#using-concurrency-in-different-scenarios] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2884) Task fail with post allocated but the pod will keep pending
[ https://issues.apache.org/jira/browse/YUNIKORN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885474#comment-17885474 ] Craig Condit commented on YUNIKORN-2884: I'm not sure terminating the task is how we should handle this, as that makes debugging (via events) difficult. We should be looking into how to recover from this and re-schedule the pod. > Task fail with post allocated but the pod will keep pending > --- > > Key: YUNIKORN-2884 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2884 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > > We will fail task post allocated, but we don't update the pod to terminal > state. > For example we bind pod volume failed post allocated, the pod will not go to > terminal state, it will fail: > Pod event: > {code:java} > Events: > Type Reason Age From Message > -- --- > Normal Scheduling 30s yunikorn dev-nnjxy/pod-btv0y is > queued and waiting for allocation > Normal Scheduled 30s yunikorn Successfully assigned > dev-nnjxy/pod-btv0y to node yktest-worker > Warning PodVolumesBindFailure 20s yunikorn bind volumes to pod failed, > name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded > Normal TaskFailed 20s yunikorn Task dev-nnjxy/pod-btv0y is > failed{code} > Pod pending not going to terminal state > {code:java} > 2024-09-20T11:22:27.601Z INFO shim.fsm cache/task_state.go:381 > Task state transition {"app": "yunikorn-dev-03c96-autogen", "task": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", > "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"} > 2024-09-20T11:22:37.606Z DEBUG shim.cache.task cache/task.go:499 > prepare to send release request {"applicationID": > "yunikorn-dev-03c96-autogen", "taskID": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", > "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated", > "terminationType": ""} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler > scheduler/scheduler.go:117 enqueued event {"eventType": > "*rmevent.RMUpdateAllocationEvent", "event": > {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task > > completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}}, > "currentQueueSize": 0} > 2024-09-20T11:22:37.606Z ERROR shim.cache.task cache/task.go:475 > task failed {"appID": "yunikorn-dev-03c96-autogen", "taskID": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod > failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline > exceeded"} > 2024-09-20T11:22:37.606Z INFO shim.fsm cache/task_state.go:381 > Task state transition {"app": "yunikorn-dev-03c96-autogen", "task": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", > "source": "Allocated", "destination": "Failed", "event": "TaskFail"} > 2024-09-20T11:22:37.606Z INFO core.scheduler.partition > scheduler/partition.go:1359 removing allocation from application > {"appID": "yunikorn-dev-03c96-autogen", "allocationKey": > "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:132 > Decreasing resource usage {"user": "kubernetes-admin", "queue path": > "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource": > "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:152 > Decreasing resource usage for user {"user": "kubernetes-admin", "queue > path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", > "group": "", "resource": "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:132 Decreasing resource usage {"queue path": > "root", "hierarchy": ["root", "dev-03c96"], "application": > "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true} > 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm > ugm/queue_tracker.go:132 Decreasing resource usage {"queue path": > "root.dev-03c96", "hierarchy": ["dev-03c96"], "application": > "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true} > 2024-09-20T11
[jira] [Updated] (YUNIKORN-2832) [core] Add non-YuniKorn allocation tracking logic
[ https://issues.apache.org/jira/browse/YUNIKORN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2832: --- Summary: [core] Add non-YuniKorn allocation tracking logic (was: [core] Add non-Yunikorn allocation tracking logic) > [core] Add non-YuniKorn allocation tracking logic > - > > Key: YUNIKORN-2832 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2832 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2354) Add detailed queue information to new queue view
[ https://issues.apache.org/jira/browse/YUNIKORN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2354: --- Summary: Add detailed queue information to new queue view (was: Visualize the current queue that YuniKorn is using) > Add detailed queue information to new queue view > - > > Key: YUNIKORN-2354 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2354 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Dong-Lin Hsieh >Assignee: Dong-Lin Hsieh >Priority: Major > Labels: pull-request-available > > # another tab page > # additional queue info (running applicaitons) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2354) Add detailed queue information to new queue view
[ https://issues.apache.org/jira/browse/YUNIKORN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2354. Fix Version/s: 1.7.0 Resolution: Fixed Merged to master. > Add detailed queue information to new queue view > - > > Key: YUNIKORN-2354 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2354 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Dong-Lin Hsieh >Assignee: Dong-Lin Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > > # another tab page > # additional queue info (running applicaitons) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2254) Display MaxRunningApps and RunningApps on Queue View
[ https://issues.apache.org/jira/browse/YUNIKORN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2254. Fix Version/s: 1.7.0 Target Version: 1.7.0 Resolution: Fixed Merged to master. > Display MaxRunningApps and RunningApps on Queue View > > > Key: YUNIKORN-2254 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2254 > Project: Apache YuniKorn > Issue Type: Wish > Components: webapp >Reporter: Chia-Ping Tsai >Assignee: Yun Sun >Priority: Minor > Labels: pull-request-available > Fix For: 1.7.0 > > > queue view has offered the resource-related information, but it has a lack of > application-related information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2254) Display MaxRunningApps and RunningApps on Queue View
[ https://issues.apache.org/jira/browse/YUNIKORN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit updated YUNIKORN-2254: --- Summary: Display MaxRunningApps and RunningApps on Queue View (was: queue view should display MaxRunningApps and RunningApps) > Display MaxRunningApps and RunningApps on Queue View > > > Key: YUNIKORN-2254 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2254 > Project: Apache YuniKorn > Issue Type: Wish > Components: webapp >Reporter: Chia-Ping Tsai >Assignee: Yun Sun >Priority: Minor > > queue view has offered the resource-related information, but it has a lack of > application-related information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org