Re: [DISCUSSION] (Potential) Locking issues in Yunikorn

2024-04-11 Thread Wilfred Spiegelenburg
Scheduling is lock free when we get to any of the application.Try...() calls. The scheduling thread does not hold any locks until we get there. That was how it was designed and implemented. When we get there nothing but the scheduling thread is allowed to make changes to the application for the

Re: [DISCUSSION] (Potential) Locking issues in Yunikorn

2024-04-11 Thread Peter Bacsko
Thanks for the replies. I managed to get good progress on this issue. There's a thing which I'd like to talk about. It's not something which is critical but it needs to be addressed IMO. The scope of the mutex-protected critical section is too large in tryAllocate, tryReservedAllocate and

Re: [DISCUSSION] (Potential) Locking issues in Yunikorn

2024-04-07 Thread Wilfred Spiegelenburg
Case 1: I am all for simplifying and removing locks. Changing the SI like you propose will trigger a YuniKorn 2.0 as it is incompatible with the current setup. There is a much simpler change that does not require a 2.0 version. See comments in the jira. Case 2: This is a bug I think, which has

Re: [DISCUSSION] (Potential) Locking issues in Yunikorn

2024-04-06 Thread Craig Condit
I’m all for fixing these… and in general where lockless algorithms can be implemented cleanly, I’m in favor of those implementations instead of requiring locks, so for RMProxy I’m +1 on that. The extra memory for an RMProxy instance is irrelevant. The recursive locking case is a real problem,

[DISCUSSION] (Potential) Locking issues in Yunikorn

2024-04-06 Thread Peter Bacsko
Hi all, after YUNIKORN-2539 got merged, we identified some potential deadlocks. These are false positives now, but a small change can cause Yunikorn to fall apart, so the term "potential deadlock" describes them properly. Thoughs, opinions are welcome. IMO we should handle these with priority to