Re: [DISCUSSION] Yunikorn release 1.5.1

2024-05-08 Thread Peter Bacsko
I'd like to start the release process.

The following items will be delivered as part of 1.5.1:
https://issues.apache.org/jira/issues/?filter=12353383

No features in this release, only bugfixes. No further items are considered
(unless something critical is found).

Planned schedule:
RC1 out: 10th May
Voting: from 10th May to early next week, 13th-14th May
Release: 15-16th May

Thanks,
Peter

On Thu, May 2, 2024 at 2:11 AM Shravan Achar
 wrote:

> Have been helping Peter with YUNIKORN-2526, and it has been a tricky
> problem to reproduce and resolve. It makes sense to continue to make
> progress on it without blocking the 1.5.1 patch release as it has
> considerable fixes already (re: deadlock)
>
> Shravan
>
> On 2024/04/29 15:20:27 Peter Bacsko wrote:
> > Hey Wilfred,
> >
> > Yes, I'm taking the role of release manager.
> > I cherry-picked YUNIKORN-2520 to branch-1.5.
> >
> > Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look
> at
> > YUNIKORN-2057 as he originally volunteered to solve it. I told him that
> it
> > was not urgent, but depending on how quickly he makes progress, we might
> > re-consider our position later.
> >
> > Peter
> >
> > On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg 
> > wrote:
> >
> > > Peter,
> > >
> > > Thank you for starting this discussion. See inline for further
> comments.
> > >
> > > > Hi all,
> > > >
> > > > Due to the number of problems that we have discovered since the
> release
> > > of
> > > > 1.5.0, I believe it makes sense to create a new Yunikorn release
> which
> > > > consists of bug fixes only. If I'm not mistaken we haven't done this
> > > before
> > > > (at least since leaving the ASF incubator), so this would be the
> first
> > > > minor Yunikorn release.
> > >
> > > +1
> > > I am totally for releasing YuniKorn 1.5.1 with the lock fixes.
> > > Looking at all the work you have done for this release: would you be
> > > willing to also step up as a release manager for the 1.5.1 release?
> > >
> > > > There are a bunch of fixes that are already on branch-1.5:
> > > >
> > > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by
> > > YUNIKORN-2544)
> > > >- YUNIKORN-2539 Add optional deadlock detection
> > > >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues
> > > >   - YUNIKORN-2543 Fix locking in RMProxy
> > > >   - YUNIKORN-2545 Eliminate multiple lock calls from Queue
> > > >   - YUNIKORN-2548 Potential deadlock during concurrent
> > > >   bottom-up/top-down queue traversal
> > > >   - YUNIKORN-2550 Fix locking in PartitionContext
> > > >   - YUNIKORN-2552 Recursive locking when sending remove queue
> event
> > > >   - YUNIKORN-2553 [core] Enable deadlock detection during unit
> tests
> > > >   - YUNIKORN-2563 [shim] Enable deadlock detection during unit
> tests
> > > >   - YUNIKORN-2574 totalPartitionResource should not be mutated
> with
> > > >   AddTo/SubFrom
> > > >   - YUNIKORN-2562 Nil pointer panic in
> > > Application.ReplaceAllocation()
> > > >
> > >
> > > Yes for all the above.
> > >
> > > > The following is In Progress for 1.5.1:
> > > >
> > > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task
> list
> > > >after scheduler restart
> > >
> > > This would be a good one to get in if we have some progress on this.
> > > Do we understand what is going on yet? I looked at the jira and am not
> > > sure if we understand the root cause.
> > >
> > > > Candidates:
> > > >
> > > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled
> properly -
> > > >Resolved, only cherry-picking is needed
> > >
> > > Yes, this could be added.
> > >
> > > I also think we need to check if we have any CVE fixes that need to be
> > > added.
> > > Quick check shows these two:
> > > * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via
> YUNIKORN-2541)
> > > * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via
> YUNIKORN-2469)
> > > * build with golang 1.21.9
> > >
> > > To satisfy the scanners, although we are not affected:
> > > * K8s 1.29.4 (CVE-2024-3177)
> > >
> > >
> > > >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In
> > > >progress" since Oct 2023
> > > >- YUNIKORN-1089 Application handling with invalid task group
> > > annotations
> > > >- Critical priority, no progress
> > > >- YUNIKORN-1988 Preemption happens when a queue lower than its
> > > >guaranteed capacity - Critical priority, "In progress" since Sep
> 2023
> > >
> > > No for the last 3 mentioned. We did not block the 1.5.0 release on
> > > these and they have not made enough progress since then.
> > > I would not consider them as a possible candidate for 1.5.1
> > >
> > > Wilfred
> > >
> > > >
> > > > Thoughts, opinions? What should be the scope of 1.5.1?
> > > >
> > > > Thanks,
> > > > Peter
> > >
> > > -
> > > To unsubscribe, e-mail: 

[DISCUSSION] Yunikorn release 1.5.1

2024-05-01 Thread Shravan Achar
Have been helping Peter with YUNIKORN-2526, and it has been a tricky problem to 
reproduce and resolve. It makes sense to continue to make progress on it 
without blocking the 1.5.1 patch release as it has considerable fixes already 
(re: deadlock)

Shravan

On 2024/04/29 15:20:27 Peter Bacsko wrote:
> Hey Wilfred,
> 
> Yes, I'm taking the role of release manager.
> I cherry-picked YUNIKORN-2520 to branch-1.5.
> 
> Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look at
> YUNIKORN-2057 as he originally volunteered to solve it. I told him that it
> was not urgent, but depending on how quickly he makes progress, we might
> re-consider our position later.
> 
> Peter
> 
> On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg 
> wrote:
> 
> > Peter,
> >
> > Thank you for starting this discussion. See inline for further comments.
> >
> > > Hi all,
> > >
> > > Due to the number of problems that we have discovered since the release
> > of
> > > 1.5.0, I believe it makes sense to create a new Yunikorn release which
> > > consists of bug fixes only. If I'm not mistaken we haven't done this
> > before
> > > (at least since leaving the ASF incubator), so this would be the first
> > > minor Yunikorn release.
> >
> > +1
> > I am totally for releasing YuniKorn 1.5.1 with the lock fixes.
> > Looking at all the work you have done for this release: would you be
> > willing to also step up as a release manager for the 1.5.1 release?
> >
> > > There are a bunch of fixes that are already on branch-1.5:
> > >
> > >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by
> > YUNIKORN-2544)
> > >- YUNIKORN-2539 Add optional deadlock detection
> > >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues
> > >   - YUNIKORN-2543 Fix locking in RMProxy
> > >   - YUNIKORN-2545 Eliminate multiple lock calls from Queue
> > >   - YUNIKORN-2548 Potential deadlock during concurrent
> > >   bottom-up/top-down queue traversal
> > >   - YUNIKORN-2550 Fix locking in PartitionContext
> > >   - YUNIKORN-2552 Recursive locking when sending remove queue event
> > >   - YUNIKORN-2553 [core] Enable deadlock detection during unit tests
> > >   - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests
> > >   - YUNIKORN-2574 totalPartitionResource should not be mutated with
> > >   AddTo/SubFrom
> > >   - YUNIKORN-2562 Nil pointer panic in
> > Application.ReplaceAllocation()
> > >
> >
> > Yes for all the above.
> >
> > > The following is In Progress for 1.5.1:
> > >
> > >- YUNIKORN-2526 Discrepancy between shim cache and core app/task list
> > >after scheduler restart
> >
> > This would be a good one to get in if we have some progress on this.
> > Do we understand what is going on yet? I looked at the jira and am not
> > sure if we understand the root cause.
> >
> > > Candidates:
> > >
> > >- YUNIKORN-2520 PVC errors in AssumePod() are not handled properly -
> > >Resolved, only cherry-picking is needed
> >
> > Yes, this could be added.
> >
> > I also think we need to check if we have any CVE fixes that need to be
> > added.
> > Quick check shows these two:
> > * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541)
> > * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469)
> > * build with golang 1.21.9
> >
> > To satisfy the scanners, although we are not affected:
> > * K8s 1.29.4 (CVE-2024-3177)
> >
> >
> > >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In
> > >progress" since Oct 2023
> > >- YUNIKORN-1089 Application handling with invalid task group
> > annotations
> > >- Critical priority, no progress
> > >- YUNIKORN-1988 Preemption happens when a queue lower than its
> > >guaranteed capacity - Critical priority, "In progress" since Sep 2023
> >
> > No for the last 3 mentioned. We did not block the 1.5.0 release on
> > these and they have not made enough progress since then.
> > I would not consider them as a possible candidate for 1.5.1
> >
> > Wilfred
> >
> > >
> > > Thoughts, opinions? What should be the scope of 1.5.1?
> > >
> > > Thanks,
> > > Peter
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
> > For additional commands, e-mail: dev-h...@yunikorn.apache.org
> >
> >
> 

Re: [DISCUSSION] Yunikorn release 1.5.1

2024-04-29 Thread Peter Bacsko
Hey Wilfred,

Yes, I'm taking the role of release manager.
I cherry-picked YUNIKORN-2520 to branch-1.5.

Regarding the remaining JIRAs, I asked PoAn Yang on Slack to take a look at
YUNIKORN-2057 as he originally volunteered to solve it. I told him that it
was not urgent, but depending on how quickly he makes progress, we might
re-consider our position later.

Peter

On Mon, Apr 29, 2024 at 5:00 AM Wilfred Spiegelenburg 
wrote:

> Peter,
>
> Thank you for starting this discussion. See inline for further comments.
>
> > Hi all,
> >
> > Due to the number of problems that we have discovered since the release
> of
> > 1.5.0, I believe it makes sense to create a new Yunikorn release which
> > consists of bug fixes only. If I'm not mistaken we haven't done this
> before
> > (at least since leaving the ASF incubator), so this would be the first
> > minor Yunikorn release.
>
> +1
> I am totally for releasing YuniKorn 1.5.1 with the lock fixes.
> Looking at all the work you have done for this release: would you be
> willing to also step up as a release manager for the 1.5.1 release?
>
> > There are a bunch of fixes that are already on branch-1.5:
> >
> >- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by
> YUNIKORN-2544)
> >- YUNIKORN-2539 Add optional deadlock detection
> >- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues
> >   - YUNIKORN-2543 Fix locking in RMProxy
> >   - YUNIKORN-2545 Eliminate multiple lock calls from Queue
> >   - YUNIKORN-2548 Potential deadlock during concurrent
> >   bottom-up/top-down queue traversal
> >   - YUNIKORN-2550 Fix locking in PartitionContext
> >   - YUNIKORN-2552 Recursive locking when sending remove queue event
> >   - YUNIKORN-2553 [core] Enable deadlock detection during unit tests
> >   - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests
> >   - YUNIKORN-2574 totalPartitionResource should not be mutated with
> >   AddTo/SubFrom
> >   - YUNIKORN-2562 Nil pointer panic in
> Application.ReplaceAllocation()
> >
>
> Yes for all the above.
>
> > The following is In Progress for 1.5.1:
> >
> >- YUNIKORN-2526 Discrepancy between shim cache and core app/task list
> >after scheduler restart
>
> This would be a good one to get in if we have some progress on this.
> Do we understand what is going on yet? I looked at the jira and am not
> sure if we understand the root cause.
>
> > Candidates:
> >
> >- YUNIKORN-2520 PVC errors in AssumePod() are not handled properly -
> >Resolved, only cherry-picking is needed
>
> Yes, this could be added.
>
> I also think we need to check if we have any CVE fixes that need to be
> added.
> Quick check shows these two:
> * golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541)
> * google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469)
> * build with golang 1.21.9
>
> To satisfy the scanners, although we are not affected:
> * K8s 1.29.4 (CVE-2024-3177)
>
>
> >- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In
> >progress" since Oct 2023
> >- YUNIKORN-1089 Application handling with invalid task group
> annotations
> >- Critical priority, no progress
> >- YUNIKORN-1988 Preemption happens when a queue lower than its
> >guaranteed capacity - Critical priority, "In progress" since Sep 2023
>
> No for the last 3 mentioned. We did not block the 1.5.0 release on
> these and they have not made enough progress since then.
> I would not consider them as a possible candidate for 1.5.1
>
> Wilfred
>
> >
> > Thoughts, opinions? What should be the scope of 1.5.1?
> >
> > Thanks,
> > Peter
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
> For additional commands, e-mail: dev-h...@yunikorn.apache.org
>
>


Re: [DISCUSSION] Yunikorn release 1.5.1

2024-04-28 Thread Wilfred Spiegelenburg
Peter,

Thank you for starting this discussion. See inline for further comments.

> Hi all,
>
> Due to the number of problems that we have discovered since the release of
> 1.5.0, I believe it makes sense to create a new Yunikorn release which
> consists of bug fixes only. If I'm not mistaken we haven't done this before
> (at least since leaving the ASF incubator), so this would be the first
> minor Yunikorn release.

+1
I am totally for releasing YuniKorn 1.5.1 with the lock fixes.
Looking at all the work you have done for this release: would you be
willing to also step up as a release manager for the 1.5.1 release?

> There are a bunch of fixes that are already on branch-1.5:
>
>- YUNIKORN-2521 Scheduler deadlock (resolved indirectly by YUNIKORN-2544)
>- YUNIKORN-2539 Add optional deadlock detection
>- YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues
>   - YUNIKORN-2543 Fix locking in RMProxy
>   - YUNIKORN-2545 Eliminate multiple lock calls from Queue
>   - YUNIKORN-2548 Potential deadlock during concurrent
>   bottom-up/top-down queue traversal
>   - YUNIKORN-2550 Fix locking in PartitionContext
>   - YUNIKORN-2552 Recursive locking when sending remove queue event
>   - YUNIKORN-2553 [core] Enable deadlock detection during unit tests
>   - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests
>   - YUNIKORN-2574 totalPartitionResource should not be mutated with
>   AddTo/SubFrom
>   - YUNIKORN-2562 Nil pointer panic in Application.ReplaceAllocation()
>

Yes for all the above.

> The following is In Progress for 1.5.1:
>
>- YUNIKORN-2526 Discrepancy between shim cache and core app/task list
>after scheduler restart

This would be a good one to get in if we have some progress on this.
Do we understand what is going on yet? I looked at the jira and am not
sure if we understand the root cause.

> Candidates:
>
>- YUNIKORN-2520 PVC errors in AssumePod() are not handled properly -
>Resolved, only cherry-picking is needed

Yes, this could be added.

I also think we need to check if we have any CVE fixes that need to be added.
Quick check shows these two:
* golang.org/x/net 0.23 (CVE-2023-45288 or GO-2024-2687 via YUNIKORN-2541)
* google.golang.org/protobuf to v1.33.0 (CVE-2024-24786 via YUNIKORN-2469)
* build with golang 1.21.9

To satisfy the scanners, although we are not affected:
* K8s 1.29.4 (CVE-2024-3177)


>- YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In
>progress" since Oct 2023
>- YUNIKORN-1089 Application handling with invalid task group annotations
>- Critical priority, no progress
>- YUNIKORN-1988 Preemption happens when a queue lower than its
>guaranteed capacity - Critical priority, "In progress" since Sep 2023

No for the last 3 mentioned. We did not block the 1.5.0 release on
these and they have not made enough progress since then.
I would not consider them as a possible candidate for 1.5.1

Wilfred

>
> Thoughts, opinions? What should be the scope of 1.5.1?
>
> Thanks,
> Peter

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[DISCUSSION] Yunikorn release 1.5.1

2024-04-28 Thread Peter Bacsko
Hi all,

Due to the number of problems that we have discovered since the release of
1.5.0, I believe it makes sense to create a new Yunikorn release which
consists of bug fixes only. If I'm not mistaken we haven't done this before
(at least since leaving the ASF incubator), so this would be the first
minor Yunikorn release.

There are a bunch of fixes that are already on branch-1.5:

   - YUNIKORN-2521 Scheduler deadlock (resolved indirectly by YUNIKORN-2544)
   - YUNIKORN-2539 Add optional deadlock detection
   - YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues
  - YUNIKORN-2543 Fix locking in RMProxy
  - YUNIKORN-2545 Eliminate multiple lock calls from Queue
  - YUNIKORN-2548 Potential deadlock during concurrent
  bottom-up/top-down queue traversal
  - YUNIKORN-2550 Fix locking in PartitionContext
  - YUNIKORN-2552 Recursive locking when sending remove queue event
  - YUNIKORN-2553 [core] Enable deadlock detection during unit tests
  - YUNIKORN-2563 [shim] Enable deadlock detection during unit tests
  - YUNIKORN-2574 totalPartitionResource should not be mutated with
  AddTo/SubFrom
  - YUNIKORN-2562 Nil pointer panic in Application.ReplaceAllocation()


The following is In Progress for 1.5.1:

   - YUNIKORN-2526 Discrepancy between shim cache and core app/task list
   after scheduler restart


Candidates:

   - YUNIKORN-2520 PVC errors in AssumePod() are not handled properly -
   Resolved, only cherry-picking is needed
   - YUNIKORN-2057 FindQueueByAppID is slow - Critical priority, "In
   progress" since Oct 2023
   - YUNIKORN-1089 Application handling with invalid task group annotations
   - Critical priority, no progress
   - YUNIKORN-1988 Preemption happens when a queue lower than its
   guaranteed capacity - Critical priority, "In progress" since Sep 2023


Thoughts, opinions? What should be the scope of 1.5.1?

Thanks,
Peter