On Mon, 4 May 2026 19:20:45 GMT, Patricio Chilano Mateo 
<[email protected]> wrote:

>> David Holmes has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fix typo
>
> src/hotspot/share/runtime/handshake.cpp line 718:
> 
>> 716: 
>> 717:   HandshakeOperation* op = get_op();
>> 718:   assert(op != nullptr, "Must have an op");
> 
> There is a subtle race now where we can hit this assert. The sequence of 
> events go as follows:
> 1 - `Thread1` adds handshake operation to `TargetThread's` queue. Sees 
> `TargetThread's` as handshake safe and proceeds to `claim_handshake()`.
> 2 - Before `Thread1` calls `claim_handshake`, `TargetThread’s` processes the 
> operation and removes it from the queue.
> 3 - `TargetThread’s` calls `GetPrimitiveArrayCritical`. Before incrementing 
> `_jni_deferred_suspension_count`, `Thread2` adds suspend handshake to 
> `TargetThread’s` queue.
> 4 - `Thread1` calls `claim_handshake` and sees that there is a pending 
> operation that is enabled.
> 5 - `TargetThread’s` increments `_jni_deferred_suspension_count` and 
> continues execution in critical native code.
> 6 - `Thread1` calls `can_process_handshake()` and sees `TargetThread` in 
> native. Calls `get_op` but the only operation in the queue is not enabled.
> 
> I created a test with a couple of artificial delays in the VM which triggers 
> this assert: 
> https://github.com/openjdk/jdk/compare/pr/30936...pchilano:jdk:8373839-jni-crit-suspend-stresstest
> 
> One way of fixing this could be to not allow others threads besides the 
> requester/target to process the suspend handshake. That means 
> `HandshakeClosure::is_enabled` would need to also take `_requester`.
> I’m also thinking if we could move `can_process_handshake` to 
> `claim_handshake` after grabbing the monitor and before checking the queue, 
> i.e. we first check if the thread is in a handshake safe state.

@pchilano It seems to me that a simple fix is to convert the assert into a 
`return _no_operation` if `getOp()` returns null. In your scenario `thread1` 
will now return from `try_process` and see that the op it submitted has 
completed. Meanwhile `thread` continues to wait until the suspend op is 
processed, once the target leaves the critical region.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30936#discussion_r3199413902

Reply via email to