Case 1:
I am all for simplifying and removing locks. Changing the SI like you
propose will trigger a YuniKorn 2.0 as it is incompatible with the
current setup. There is a much simpler change that does not require a
2.0 version. See comments in the jira.

Case 2:
This is a bug I think, which has nothing to do with locking. The call
to get the priority of the child is made before we have updated the
properties for child or even applied the template. We would not be
calling that until after the properties are converted into setting in
UpdateQueueProperties. That should solve the double lock issue also.

Case 3:
The call(s) in GetPartitionQueueDAOInfo inside the lock could be
dangerous. The getHeadRoom call walks up the queue and takes locks
there.
It should be moved outside of the existing queue lock. The comment on
the lock even says so but it has slipped through.

Wilfred

On Sun, 7 Apr 2024 at 05:33, Craig Condit <ccon...@apache.org> wrote:
>
> I’m all for fixing these… and in general where lockless algorithms can be 
> implemented cleanly, I’m in favor of those implementations instead of 
> requiring locks, so for RMProxy I’m +1 on that. The extra memory for an 
> RMProxy instance is irrelevant.
>
> The recursive locking case is a real problem, and I’m surprised that hasn’t 
> bitten us harder. It can cause all sorts of issues.
>
> Craig
>
> > On Apr 6, 2024, at 2:19 PM, Peter Bacsko <bacs...@gmail.com> wrote:
> >
> > Hi all,
> >
> > after YUNIKORN-2539 got merged, we identified some potential deadlocks.
> > These are false positives now, but a small change can cause Yunikorn to
> > fall apart, so the term "potential deadlock" describes them properly.
> >
> > Thoughs, opinions are welcome. IMO we should handle these with priority to
> > ensure stability.
> >
> > *1. Locking order: Cache→RMProxy vs RMProxy→Cache*
> >
> > We grab the locks of these types in different order (read bottom to top):
> >
> > pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback
> > ??? <<<<<
> > pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ???
> > pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ???
> > cache.(*Context).updateNodeResources ???
> > cache.(*Context).updateNodeOccupiedResources ???
> > cache.(*Context).updateForeignPod ???
> > cache.(*Context).UpdatePod ???
> >
> > cache.(*Context).ForgetPod ??? <<<<<
> > cache.(*Context).ForgetPod ???
> > cache.(*AsyncRMCallback).UpdateAllocation ???
> > rmproxy.(*RMProxy).triggerUpdateAllocation ???
> > rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> > rmproxy.(*RMProxy).handleRMEvents ???
> >
> > I already created a JIRA for this one:
> > https://issues.apache.org/jira/browse/YUNIKORN-2543.
> >
> > Luckily we only call RLock() inside RMProxy while processing allocations,
> > but if this changes, deadlock is guaranteed. I made a POC which creates a
> > lockless RMProxy instance after calling a factory method, see my comment in
> > the JIRA. The scheduler core compiles & all tests pass.
> >
> > *2. Lock order: multiple locks calls on the same goroutine which interferes
> > with another goroutine*
> >
> > We shouldn't call RLock() multiple times or after Lock() on the same mutex
> > instance. This is dangerous as described here:
> > https://github.com/sasha-s/go-deadlock/?tab=readme-ov-file#grabbing-an-rlock-twice-from-the-same-goroutine
> >
> > objects.(*Queue).GetCurrentPriority ??? <<<<<   ← Queue RLock
> > objects.(*Queue).GetCurrentPriority ???
> > objects.(*Queue).addChildQueue ???    ← Queue Lock
> > objects.NewConfiguredQueue ???
> > scheduler.(*PartitionContext).addQueue ???
> > scheduler.(*PartitionContext).addQueue ???
> > scheduler.(*PartitionContext).initialPartitionFromConfig ???
> > scheduler.newPartitionContext ???
> > scheduler.(*ClusterContext).updateSchedulerConfig ???
> > scheduler.(*ClusterContext).processRMRegistrationEvent ???
> > scheduler.(*Scheduler).handleRMEvent ???
> >
> > objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
> > objects.(*Queue).internalHeadRoom ???
> > objects.(*Queue).getHeadRoom ???
> > objects.(*Queue).getHeadRoom ???
> > objects.(*Queue).GetPartitionQueueDAOInfo ???  ←Queue RLock #1
> > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > scheduler.(*PartitionContext).GetPartitionQueues ???
> > webservice.getPartitionQueues ???
> > http.HandlerFunc.ServeHTTP ???
> >
> > No JIRA yet. Fix looks straightfoward.
> >
> > *3. Recursive locking*
> >
> > It's basically the same as #2, but it's not about an interaction between
> > two goroutines. It's just a single goroutine which grabs RLock() multiple
> > times. This can be dangerous in itself.
> >
> > objects.(*Queue).internalHeadRoom ??? <<<<<   ← Queue RLock #2
> > objects.(*Queue).internalHeadRoom ???
> > objects.(*Queue).getHeadRoom ???
> > objects.(*Queue).getHeadRoom ???
> > objects.(*Queue).GetPartitionQueueDAOInfo ???  ← Queue RLock #1
> > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > objects.(*Queue).GetPartitionQueueDAOInfo ???
> > scheduler.(*PartitionContext).GetPartitionQueues ???
> > webservice.getPartitionQueues ???
> > http.HandlerFunc.ServeHTTP ???
> >
> > No JIRA yet. Fix looks straightforward.
> >
> > I can see more from the tool but these are the most obvious ones. Not every
> > "potential deadlock" is exactly clear to me at the moment. See JIRA
> > attachment
> > https://issues.apache.org/jira/secure/attachment/13067895/running-logs.txt.
> >
> > These cause a lot of noise to log files which were uploaded
> > to YUNIKORN-2521. It can slow down our efforts to track down the real
> > problem.
> >
> > I suggest fixing all of these.
> >
> > Peter
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
> For additional commands, e-mail: dev-h...@yunikorn.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to