Re: [DISCUSSION] Yunikorn release 1.5.1
I'd like to start the release process. The following items will be delivered as part of 1.5.1: https://issues.apache.org/jira/issues/?filter=12353383 No features in this release, only bugfixes. No further items are considered (unless something critical is found). Planned schedule: RC1 out: 10th May Voting: from 10th May to early next week, 13th-14th May Release: 15-16th May Thanks, Peter On Thu, May 2, 2024 at 2:11 AM Shravan Achar wrote: > Have been helping Peter with YUNIKORN-2526, and it has been a tricky > problem to reproduce and resolve. It makes sense to continue to make > progress on it without blocking the 1.5.1 patch release as it has > considerable fixes already (re: deadlock) > > Shravan > > On 2024/04/29 15:20:27 Peter Bacsko wrote: > > Hey Wilfred, > > > > Yes, I'm taking the role of release manager. > > I cherry-picked YUNIKORN-2520 to branch-1.5. > > > > Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look > at > > YUNIKORN-2057 as he originally volunteered to solve it. I told him that > it > > was not urgent, but depending on how quickly he makes progress, we might > > re-consider our position later. > > > > Peter > > > > On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg > > wrote: > > > > > Peter, > > > > > > Thank you for starting this discussion. See inline for further > comments. > > > > > > > Hi all, > > > > > > > > Due to the number of problems that we have discovered since the > release > > > of > > > > 1.5.0, I believe it makes sense to create a new Yunikorn release > which > > > > consists of bug fixes only. If I'm not mistaken we haven't done this > > > before > > > > (at least since leaving the ASF incubator), so this would be the > first > > > > minor Yunikorn release. > > > > > > +1 > > > I am totally for releasing YuniKorn 1.5.1 with the lock fixes. > > > Looking at all the work you have done for this release: would you be > > > willing to also step up as a release manager for the 1.5.1 release? > > > > > > > There are a bunch of fixes that are already on branch-1.5: > > > > > > > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by > > > YUNIKORN-2544) > > > >- YUNIKORN-2539 Add optional deadlock detection > > > >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues > > > > - YUNIKORN-2543 Fix locking in RMProxy > > > > - YUNIKORN-2545 Eliminate multiple lock calls from Queue > > > > - YUNIKORN-2548 Potential deadlock during concurrent > > > > bottom-up/top-down queue traversal > > > > - YUNIKORN-2550 Fix locking in PartitionContext > > > > - YUNIKORN-2552 Recursive locking when sending remove queue > event > > > > - YUNIKORN-2553 [core] Enable deadlock detection during unit > tests > > > > - YUNIKORN-2563 [shim] Enable deadlock detection during unit > tests > > > > - YUNIKORN-2574 totalPartitionResource should not be mutated > with > > > > AddTo/SubFrom > > > > - YUNIKORN-2562 Nil pointer panic in > > > Application.ReplaceAllocation() > > > > > > > > > > Yes for all the above. > > > > > > > The following is In Progress for 1.5.1: > > > > > > > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task > list > > > >after scheduler restart > > > > > > This would be a good one to get in if we have some progress on this. > > > Do we understand what is going on yet? I looked at the jira and am not > > > sure if we understand the root cause. > > > > > > > Candidates: > > > > > > > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled > properly - > > > >Resolved, only cherry-picking is needed > > > > > > Yes, this could be added. > > > > > > I also think we need to check if we have any CVE fixes that need to be > > > added. > > > Quick check shows these two: > > > * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via > YUNIKORN-2541) > > > * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via > YUNIKORN-2469) > > > * build with golang 1.21.9 > > > > > > To satisfy the scanners, although we are not affected: > > > * K8s 1.29.4 (CVE-2024-3177) > > > > > > > > > >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In > > > >progress" since Oct 2023 > > > >- YUNIKORN-1089 Application handling with invalid task group > > > annotations > > > >- Critical priority, no progress > > > >- YUNIKORN-1988 Preemption happens when a queue lower than its > > > >guaranteed capacity - Critical priority, "In progress" since Sep > 2023 > > > > > > No for the last 3 mentioned. We did not block the 1.5.0 release on > > > these and they have not made enough progress since then. > > > I would not consider them as a possible candidate for 1.5.1 > > > > > > Wilfred > > > > > > > > > > > Thoughts, opinions? What should be the scope of 1.5.1? > > > > > > > > Thanks, > > > > Peter > > > > > > - > > > To unsubscribe, e-mail:
[DISCUSSION] Yunikorn release 1.5.1
Have been helping Peter with YUNIKORN-2526, and it has been a tricky problem to reproduce and resolve. It makes sense to continue to make progress on it without blocking the 1.5.1 patch release as it has considerable fixes already (re: deadlock) Shravan On 2024/04/29 15:20:27 Peter Bacsko wrote: > Hey Wilfred, > > Yes, I'm taking the role of release manager. > I cherry-picked YUNIKORN-2520 to branch-1.5. > > Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look at > YUNIKORN-2057 as he originally volunteered to solve it. I told him that it > was not urgent, but depending on how quickly he makes progress, we might > re-consider our position later. > > Peter > > On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg > wrote: > > > Peter, > > > > Thank you for starting this discussion. See inline for further comments. > > > > > Hi all, > > > > > > Due to the number of problems that we have discovered since the release > > of > > > 1.5.0, I believe it makes sense to create a new Yunikorn release which > > > consists of bug fixes only. If I'm not mistaken we haven't done this > > before > > > (at least since leaving the ASF incubator), so this would be the first > > > minor Yunikorn release. > > > > +1 > > I am totally for releasing YuniKorn 1.5.1 with the lock fixes. > > Looking at all the work you have done for this release: would you be > > willing to also step up as a release manager for the 1.5.1 release? > > > > > There are a bunch of fixes that are already on branch-1.5: > > > > > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by > > YUNIKORN-2544) > > >- YUNIKORN-2539 Add optional deadlock detection > > >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues > > > - YUNIKORN-2543 Fix locking in RMProxy > > > - YUNIKORN-2545 Eliminate multiple lock calls from Queue > > > - YUNIKORN-2548 Potential deadlock during concurrent > > > bottom-up/top-down queue traversal > > > - YUNIKORN-2550 Fix locking in PartitionContext > > > - YUNIKORN-2552 Recursive locking when sending remove queue event > > > - YUNIKORN-2553 [core] Enable deadlock detection during unit tests > > > - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests > > > - YUNIKORN-2574 totalPartitionResource should not be mutated with > > > AddTo/SubFrom > > > - YUNIKORN-2562 Nil pointer panic in > > Application.ReplaceAllocation() > > > > > > > Yes for all the above. > > > > > The following is In Progress for 1.5.1: > > > > > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task list > > >after scheduler restart > > > > This would be a good one to get in if we have some progress on this. > > Do we understand what is going on yet? I looked at the jira and am not > > sure if we understand the root cause. > > > > > Candidates: > > > > > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled properly - > > >Resolved, only cherry-picking is needed > > > > Yes, this could be added. > > > > I also think we need to check if we have any CVE fixes that need to be > > added. > > Quick check shows these two: > > * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541) > > * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469) > > * build with golang 1.21.9 > > > > To satisfy the scanners, although we are not affected: > > * K8s 1.29.4 (CVE-2024-3177) > > > > > > >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In > > >progress" since Oct 2023 > > >- YUNIKORN-1089 Application handling with invalid task group > > annotations > > >- Critical priority, no progress > > >- YUNIKORN-1988 Preemption happens when a queue lower than its > > >guaranteed capacity - Critical priority, "In progress" since Sep 2023 > > > > No for the last 3 mentioned. We did not block the 1.5.0 release on > > these and they have not made enough progress since then. > > I would not consider them as a possible candidate for 1.5.1 > > > > Wilfred > > > > > > > > Thoughts, opinions? What should be the scope of 1.5.1? > > > > > > Thanks, > > > Peter > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org > > For additional commands, e-mail: dev-h...@yunikorn.apache.org > > > > >
Re: [DISCUSSION] Yunikorn release 1.5.1
Hey Wilfred, Yes, I'm taking the role of release manager. I cherry-picked YUNIKORN-2520 to branch-1.5. Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look at YUNIKORN-2057 as he originally volunteered to solve it. I told him that it was not urgent, but depending on how quickly he makes progress, we might re-consider our position later. Peter On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg wrote: > Peter, > > Thank you for starting this discussion. See inline for further comments. > > > Hi all, > > > > Due to the number of problems that we have discovered since the release > of > > 1.5.0, I believe it makes sense to create a new Yunikorn release which > > consists of bug fixes only. If I'm not mistaken we haven't done this > before > > (at least since leaving the ASF incubator), so this would be the first > > minor Yunikorn release. > > +1 > I am totally for releasing YuniKorn 1.5.1 with the lock fixes. > Looking at all the work you have done for this release: would you be > willing to also step up as a release manager for the 1.5.1 release? > > > There are a bunch of fixes that are already on branch-1.5: > > > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by > YUNIKORN-2544) > >- YUNIKORN-2539 Add optional deadlock detection > >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues > > - YUNIKORN-2543 Fix locking in RMProxy > > - YUNIKORN-2545 Eliminate multiple lock calls from Queue > > - YUNIKORN-2548 Potential deadlock during concurrent > > bottom-up/top-down queue traversal > > - YUNIKORN-2550 Fix locking in PartitionContext > > - YUNIKORN-2552 Recursive locking when sending remove queue event > > - YUNIKORN-2553 [core] Enable deadlock detection during unit tests > > - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests > > - YUNIKORN-2574 totalPartitionResource should not be mutated with > > AddTo/SubFrom > > - YUNIKORN-2562 Nil pointer panic in > Application.ReplaceAllocation() > > > > Yes for all the above. > > > The following is In Progress for 1.5.1: > > > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task list > >after scheduler restart > > This would be a good one to get in if we have some progress on this. > Do we understand what is going on yet? I looked at the jira and am not > sure if we understand the root cause. > > > Candidates: > > > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled properly - > >Resolved, only cherry-picking is needed > > Yes, this could be added. > > I also think we need to check if we have any CVE fixes that need to be > added. > Quick check shows these two: > * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541) > * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469) > * build with golang 1.21.9 > > To satisfy the scanners, although we are not affected: > * K8s 1.29.4 (CVE-2024-3177) > > > >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In > >progress" since Oct 2023 > >- YUNIKORN-1089 Application handling with invalid task group > annotations > >- Critical priority, no progress > >- YUNIKORN-1988 Preemption happens when a queue lower than its > >guaranteed capacity - Critical priority, "In progress" since Sep 2023 > > No for the last 3 mentioned. We did not block the 1.5.0 release on > these and they have not made enough progress since then. > I would not consider them as a possible candidate for 1.5.1 > > Wilfred > > > > > Thoughts, opinions? What should be the scope of 1.5.1? > > > > Thanks, > > Peter > > - > To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org > For additional commands, e-mail: dev-h...@yunikorn.apache.org > >
Re: [DISCUSSION] Yunikorn release 1.5.1
Peter, Thank you for starting this discussion. See inline for further comments. > Hi all, > > Due to the number of problems that we have discovered since the release of > 1.5.0, I believe it makes sense to create a new Yunikorn release which > consists of bug fixes only. If I'm not mistaken we haven't done this before > (at least since leaving the ASF incubator), so this would be the first > minor Yunikorn release. +1 I am totally for releasing YuniKorn 1.5.1 with the lock fixes. Looking at all the work you have done for this release: would you be willing to also step up as a release manager for the 1.5.1 release? > There are a bunch of fixes that are already on branch-1.5: > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by YUNIKORN-2544) >- YUNIKORN-2539 Add optional deadlock detection >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues > - YUNIKORN-2543 Fix locking in RMProxy > - YUNIKORN-2545 Eliminate multiple lock calls from Queue > - YUNIKORN-2548 Potential deadlock during concurrent > bottom-up/top-down queue traversal > - YUNIKORN-2550 Fix locking in PartitionContext > - YUNIKORN-2552 Recursive locking when sending remove queue event > - YUNIKORN-2553 [core] Enable deadlock detection during unit tests > - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests > - YUNIKORN-2574 totalPartitionResource should not be mutated with > AddTo/SubFrom > - YUNIKORN-2562 Nil pointer panic in Application.ReplaceAllocation() > Yes for all the above. > The following is In Progress for 1.5.1: > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task list >after scheduler restart This would be a good one to get in if we have some progress on this. Do we understand what is going on yet? I looked at the jira and am not sure if we understand the root cause. > Candidates: > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled properly - >Resolved, only cherry-picking is needed Yes, this could be added. I also think we need to check if we have any CVE fixes that need to be added. Quick check shows these two: * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541) * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469) * build with golang 1.21.9 To satisfy the scanners, although we are not affected: * K8s 1.29.4 (CVE-2024-3177) >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In >progress" since Oct 2023 >- YUNIKORN-1089 Application handling with invalid task group annotations >- Critical priority, no progress >- YUNIKORN-1988 Preemption happens when a queue lower than its >guaranteed capacity - Critical priority, "In progress" since Sep 2023 No for the last 3 mentioned. We did not block the 1.5.0 release on these and they have not made enough progress since then. I would not consider them as a possible candidate for 1.5.1 Wilfred > > Thoughts, opinions? What should be the scope of 1.5.1? > > Thanks, > Peter - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[DISCUSSION] Yunikorn release 1.5.1
Hi all, Due to the number of problems that we have discovered since the release of 1.5.0, I believe it makes sense to create a new Yunikorn release which consists of bug fixes only. If I'm not mistaken we haven't done this before (at least since leaving the ASF incubator), so this would be the first minor Yunikorn release. There are a bunch of fixes that are already on branch-1.5: - YUNIKORN-2521 Scheduler deadlock (resolved indirectly by YUNIKORN-2544) - YUNIKORN-2539 Add optional deadlock detection - YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues - YUNIKORN-2543 Fix locking in RMProxy - YUNIKORN-2545 Eliminate multiple lock calls from Queue - YUNIKORN-2548 Potential deadlock during concurrent bottom-up/top-down queue traversal - YUNIKORN-2550 Fix locking in PartitionContext - YUNIKORN-2552 Recursive locking when sending remove queue event - YUNIKORN-2553 [core] Enable deadlock detection during unit tests - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests - YUNIKORN-2574 totalPartitionResource should not be mutated with AddTo/SubFrom - YUNIKORN-2562 Nil pointer panic in Application.ReplaceAllocation() The following is In Progress for 1.5.1: - YUNIKORN-2526 Discrepancy between shim cache and core app/task list after scheduler restart Candidates: - YUNIKORN-2520 PVC errors in AssumePod() are not handled properly - Resolved, only cherry-picking is needed - YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In progress" since Oct 2023 - YUNIKORN-1089 Application handling with invalid task group annotations - Critical priority, no progress - YUNIKORN-1988 Preemption happens when a queue lower than its guaranteed capacity - Critical priority, "In progress" since Sep 2023 Thoughts, opinions? What should be the scope of 1.5.1? Thanks, Peter