On Tue, 18 Feb 2025 19:42:07 GMT, Chris Plummer <[email protected]> wrote:
>> That's a good question. I hadn't noticed the extra check for step->pending,
>> and it looks like that was also in place for the previous version that
>> relied on calling ClearFramePop on each outstanding NotifyFramePop. I think
>> it is just bit rot as the code initially evolved. If it was possibly for the
>> flag to be cleared by some other thread (I actually think it might be, but
>> need to look closer), this extra check is not of much help since it could be
>> cleared after the check. Let me look into this a bit more.
>
> There are multiple threads that can end up calling clearStep(). However, all
> callers of clearStep() grab the eventHandler_lock() first, with one somewhat
> obscure exception:
>
> debugInit_reset
> threadControl_reset
> removeVThreads
> clearThread
> stepControl_clearRequest
> clearStep
>
> This is done for a debugger reset after a connection has been dropped.
> Probably not at much risk of a race in clearStep, but I think is best to add
> eventHandler locking here also. After doing this there will be no race
> conditions to worry about.
>
> I'm pretty sure this extra check for "pending" is just bit rot. I know I
> wanted an if block around the new code to make it easy to disable so I could
> easily benchmark and test with and without "the fix". I think at one point
> the code looked a lot different (I think there was something outside of the
> if), but that was long ago.
I looked a bit more...
There are multiple threads that can end up calling clearStep(), so we need to
be careful about races. There are two main paths into clearStep.
-The first is through stepControl_endStep. It always grabs the handler_lock and
stepControl_lock.
-The 2nd is through through stepControl_clearRequest, which grabs no locks
itself, but there are some already grabbed when it is called. It should really
grab the stepControl_lock, but it can't because that lock is suppose to be
grabbed before threadControl_lock, which has already been grabbed. The proper
order is handler_lock -> stepControl_lock -> threadControl_lock. So we need to
fix locking for stepControl_clearRequest somehow.
There are 3 paths into stepControl_clearRequest().
1. The first goes through clearThread via removeThread via removeResumed. This
path always holds the handler_lock and threadControl_lock.
2. The 2nd goes through clearThread via removeThread via
threadControl_onEventHandlerExit. This also holds the handler_lock and
threadControl_onEventHandlerExit.
3. The 3rd goes through clearThread via removeVThreads via threadControl_reset.
This path always holds the handler_lock, but not threadControl_lock
Holding the threadControl_lock here does not help us since stepControl_endStep
does not grab that lock. Holding the eventHandler_lock does help since
stepControl_endStep does grab it, but the 3rd path above does not grab the
eventHandler_lock. I think the simplest fix here is to just make it do so. But
that's using the eventHandler_lock to make stepControl code thread safe. Not
exactly clean. So that means instead requiring the caller of
stepControl_endStep to do all the needed locking. Basically add locking of
stepControl_lock somewhere. Adding locking of stepControl_lock in clearThread
is too late since we already hold the threadLock. It really needs to be pushed
up the call chain of each of the above 3 paths to the point just before grabing
the threadLock. For example for the 2nd call path above we have:
if (ei == EI_THREAD_END) {
eventHandler_lock(); /* for proper lock order - see removeThread() call
below */
debugMonitorEnter(threadLock);
So we need to add stepControl_lock() in between those too lines. We need to do
similar for the 1st and 3rd paths also.
I think this is best held off for another PR. I can file a CR for it now with
these notes. It's not a new problem, although might be slightly more at risk of
going awry since we are doing a lot more in clearStep() now, but so far non of
my testing has turned up any issues, so I think a proper fix can wait. It also
relates to the ranked locking support project, which I hope to start back up
soon.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23182#discussion_r1960569052