Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-30 Thread Roman Kennke
On Thu, 30 Mar 2023 11:45:54 GMT, Roman Kennke  wrote:

>> I have another question about the asymmetric unlocking code in 
>> `InterpreterMacroAssembler::unlock_object`.
>> 
>> We go through here for both fast-locked and fat OM locks, right? If so, 
>> shouldn't we do the asymmetric lock check only for the fast locked case? 
>> Otherwise Lockstack may be empty, so we compare the word preceding the first 
>> slot, which would cause us to always break into the slow case?
>> 
>> Sorry if I miss something here.
>
>> I have another question about the asymmetric unlocking code in 
>> `InterpreterMacroAssembler::unlock_object`.
>> 
>> We go through here for both fast-locked and fat OM locks, right? If so, 
>> shouldn't we do the asymmetric lock check only for the fast locked case? 
>> Otherwise Lockstack may be empty, so we compare the word preceding the first 
>> slot, which would cause us to always break into the slow case?
>> 
>> Sorry if I miss something here.
> 
> Uh, yes, indeed. It works by accident, I suppose, because we don't segfault 
> on the word preceding the lock-stack, and monitor-locking takes the slow-case 
> in interpreter, anyway. But yeah, it's better to check for it.

> @rkennke Question about ZGC and LockStack::contains(): how does this work 
> with colored pointers? Don't we have to mask the color bits out somehow when 
> comparing? E.g. using `ZAddress::offset()` ?

We must ensure that the oops are in a valid state. I recently added code to 
::contains() to call start_processing() when called from a foreign thread. When 
inspecting its own thread, we are always good, because stack-watermark is 
processed right after leaving a safepoint, before resuming normal operations.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1490302961


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-30 Thread Roman Kennke
On Sat, 25 Mar 2023 08:55:25 GMT, Thomas Stuefe  wrote:

> I have another question about the asymmetric unlocking code in 
> `InterpreterMacroAssembler::unlock_object`.
> 
> We go through here for both fast-locked and fat OM locks, right? If so, 
> shouldn't we do the asymmetric lock check only for the fast locked case? 
> Otherwise Lockstack may be empty, so we compare the word preceding the first 
> slot, which would cause us to always break into the slow case?
> 
> Sorry if I miss something here.

Uh, yes, indeed. It works by accident, I suppose, because we don't segfault on 
the word preceding the lock-stack, and monitor-locking takes the slow-case in 
interpreter, anyway. But yeah, it's better to check for it.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1490163377


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-29 Thread Roman Kennke
On Tue, 28 Mar 2023 16:00:34 GMT, Thomas Stuefe  wrote:

>> Please also verify against over- and underflow, and better than just null 
>> checks check that every oop really is an oop. I added this to my code:
>> 
>>   assert((_offset <=  end_offset()), "lockstack overflow: _offset %d 
>> end_offset %d", _offset, end_offset());
>>   assert((_offset >= start_offset()), "lockstack underflow: _offset %d 
>> end_offset %d", _offset, start_offset());
>>   int end = to_index(_offset);
>>   for (int i = 0; i < end; i++) {
>> assert(oopDesc::is_oop(_base[i]), "index %i: not an oop (" PTR_FORMAT 
>> ")", i, p2i(_base[i]));
>> ...
>
> Just realized that my proposal of oop-checking does not work since during GC 
> oop can be moved and will temporarily be invalid.

Checking for is_oop() may not work, because oops may be temporarily invalid, 
until the GC gets to fix them. This is especially (and probably only) true 
around oops_do().

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1152089622


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Dean Long
On Tue, 28 Mar 2023 10:53:10 GMT, Roman Kennke  wrote:

>> src/hotspot/share/runtime/threads.cpp line 1422:
>> 
>>> 1420: }
>>> 1421: 
>>> 1422: JavaThread* Threads::owning_thread_from_object(ThreadsList * t_list, 
>>> oop obj) {
>> 
>> Is this thread-safe?
>
> My last commit changed that code to only run during safepoints. It should be 
> safe now, and I added an assert that verifies that it is only run at 
> safepoint.

I see the assert in `owning_thread_from_monitor` but not 
`owning_thread_from_object`.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150900864


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Thomas Stuefe
On Fri, 24 Mar 2023 16:54:59 GMT, Thomas Stuefe  wrote:

>> src/hotspot/share/runtime/lockStack.cpp line 42:
>> 
>>> 40: 
>>> 41: #ifndef PRODUCT
>>> 42: void LockStack::validate(const char* msg) const {
>> 
>> Would you also like to check there are no `nullptr` elements on stack here?
>
> Please also verify against over- and underflow, and better than just null 
> checks check that every oop really is an oop. I added this to my code:
> 
>   assert((_offset <=  end_offset()), "lockstack overflow: _offset %d 
> end_offset %d", _offset, end_offset());
>   assert((_offset >= start_offset()), "lockstack underflow: _offset %d 
> end_offset %d", _offset, start_offset());
>   int end = to_index(_offset);
>   for (int i = 0; i < end; i++) {
> assert(oopDesc::is_oop(_base[i]), "index %i: not an oop (" PTR_FORMAT 
> ")", i, p2i(_base[i]));
> ...

Just realized that my proposal of oop-checking does not work since during GC 
oop can be moved and will temporarily be invalid.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150847182


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Roman Kennke
On Fri, 24 Mar 2023 11:17:05 GMT, Aleksey Shipilev  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 611:
> 
>> 609:   bind(slow);
>> 610:   testptr(objReg, objReg); // force ZF=0 to indicate failure
>> 611:   jmp(NO_COUNT);
> 
> We set a flag on failure (`NO_COUNT`) path. Should we set the flag on success 
> (`COUNT`) path as well?

The path at COUNT already sets the ZF, there is no need to do it here. NO_COUNT 
doesn't clear ZF, and fast_lock_impl() may branch to slow with ZF set (on the 
overflow check) so we need to explicitly clear ZF. However, I just came up with 
a better way to check for overflow that readily clears the ZF: subtract 1 from 
the end-offset and make a greater-comparison instead of the greaterEquals that 
we currently do on the end-offset. That should simplify the code and avoid a 
branch.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 926:
> 
>> 924:   mov(boxReg, tmpReg);
>> 925:   fast_unlock_impl(objReg, boxReg, tmpReg, NO_COUNT);
>> 926:   jmp(COUNT);
> 
> Do we need to care about returning proper flags here?

Yes we do, but fast_unlock_impl() already does the right thing on the failure 
path, and COUNT does the right thing on the success path. Yes, it is all very 
ugly. *shrugs*

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150609171
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150610895


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Roman Kennke
On Fri, 24 Mar 2023 10:50:39 GMT, Aleksey Shipilev  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/cpu/aarch64/aarch64.ad line 3848:
> 
>> 3846: __ bind(slow);
>> 3847: __ tst(oop, oop); // Force ZF=0 to indicate failure and take 
>> slow-path. We know that oop != null.
>> 3848: __ b(no_count);
> 
> Is this a micro-optimization? I think we can simplify the code by just 
> setting the flags here and then jumping into the usual `__ b(cont)`. This 
> would make the move of `__ b(cont)` unnecessary below.

Well it avoids one conditional branch, so yeah I guess it's an optimization. If 
you don't like the b(cont) I can still move it back to where it was. It would 
be a dead instruction with fast-locking, though.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150585671


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Roman Kennke
On Fri, 24 Mar 2023 10:47:11 GMT, Aleksey Shipilev  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/cpu/aarch64/aarch64.ad line 3954:
> 
>> 3952: // Indicate success on completion.
>> 3953: __ cmp(oop, oop);
>> 3954: __ b(count);
> 
> `aarch64_enc_fast_lock` explicitly sets NE on failure path. But this code 
> just jumps to `no_count` without setting the flag. Does the code outside this 
> encoding block rely on flags?

The code outside this encoding block relies on flags, yes. This is very ugly. 
fast_unlock() jumps to no_count when the CAS fails, where the NE flag is set, 
therefore we don't need to set it again.

> src/hotspot/share/runtime/synchronizer.cpp line 923:
> 
>> 921: static bool is_lock_owned(Thread* thread, oop obj) {
>> 922:   assert(UseFastLocking, "only call this with fast-locking enabled");
>> 923:   return thread->is_Java_thread() ? 
>> reinterpret_cast(thread)->lock_stack().contains(obj) : false;
> 
> Here and later, should use `JavaThread::cast(thread)` instead of 
> `reinterpret_cast`? It also sometimes subsumes the asserts, as 
> `JT::cast` checks the type.

The problem is that the places where that helper function is used receive a 
Thread* and not a JavaThread* (FastHashCode() and inflate()), and changing 
those to accept JavaThread* percolates into various areas that I don't want to 
touch right now (e.g. finalizerService.cpp). That is the reason why that 
function exists to begin with. I'll do the changes that @shipilev suggested for 
the time being. We may want to investigate restricting the incoming type in the 
future.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150580134
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150576745


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Roman Kennke
On Fri, 24 Mar 2023 02:49:11 GMT, David Holmes  wrote:

> > Is anybody familiar with the academic literature on this topic? I am sure I 
> > am not the first person which has come up with this form of locking. Maybe 
> > we could use a name that refers to some academic paper?
> 
> Well not to diminish this in any way but all you are doing is moving the 
> lock-record from the stack frame (indexed from the markword) to a heap 
> allocated side-table (indexed via the thread itself). The "fast-locking" is 
> still the bit that use the markword to indicate the locked state, and that 
> hasn't changed. Encoding lock state in an object header has a number of names 
> in the literature, depending on whose scheme it was: IBM had ThinLocks; the 
> Sun Research VM (EVM) had meta-locks; etc. Hotspot doesn't really have a name 
> for its variation. And as I said you aren't changing that aspect but 
> modifying what data structure is used to access the lock-records.
> 
> So the property Jesper was looking for, IMO, may be something like 
> `UseHeapLockRecords` - though that can unfortunately be parsed as using 
> records for the HeapLock. :(
> 
> I think it was mentioned somewhere above that in the Java Object Monitor 
> prototyping work we avoided using these kinds of boolean flags by defining a 
> single "policy" flag that could take on different values for different 
> implementation schemes. These are simply numbered, so for example:
> 
> * policy 0: use existing/legacy locking with stack-based lock records
> * policy 1: use heavyweight locks (ie UseHeavyMonitors)
> * policy 2 use the new approach with heap-allocated lock-records

Well I would argue that the current implementation puts the lock record in the 
object header (in the form of a pointer into the stack, which is elsewhere used 
to identify which thread an object is locked by), and only displaces the object 
header onto the stack-frame, whereas the new implementation puts the lock 
record onto the lock-stack, which is still part of the JavaThread structure 
(and not heap-allocated ... although it used to be in an earlier incarnation). 
And it leaves the object header alone. So the correct name would be 
+UseStackLockRecord to turn on the new impl (where the old one would be called 
header lock record, or something like that). I like that name, but I suspect 
that it might be confusing because the old impl has traditionally been called 
'stack-locking'.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1486716156


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-28 Thread Roman Kennke
On Fri, 24 Mar 2023 06:55:55 GMT, David Holmes  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/share/runtime/threads.cpp line 1422:
> 
>> 1420: }
>> 1421: 
>> 1422: JavaThread* Threads::owning_thread_from_object(ThreadsList * t_list, 
>> oop obj) {
> 
> Is this thread-safe?

My last commit changed that code to only run during safepoints. It should be 
safe now, and I added an assert that verifies that it is only run at safepoint.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150401225


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Dean Long
On Fri, 24 Mar 2023 06:15:31 GMT, David Holmes  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/share/oops/oop.cpp line 126:
> 
>> 124:   // Outside of a safepoint, the header could be changing (for example,
>> 125:   // another thread could be inflating a lock on this object).
>> 126:   if (ignore_mark_word || UseFastLocking) {
> 
> Not clear why UseFastLocking appears here instead of in the return expression 
> - especially given the comment above.

I have the same question.  Based on the comment, I would expect
`return !SafepointSynchronize::is_at_safepoint() || UseFastLocking`;

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1149966288


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Erik Österlund
On Mon, 27 Mar 2023 18:47:31 GMT, Roman Kennke  wrote:

> > > > > > @rkennke Question about ZGC and LockStack::contains(): how does 
> > > > > > this work with colored pointers? Don't we have to mask the color 
> > > > > > bits out somehow when comparing? E.g. using `ZAddress::offset()` ?
> 
> > > 
> 
> > > 
> 
> > > > >
> 
> > > 
> 
> > > 
> 
> > > > >
> 
> > > 
> 
> > > 
> 
> > > > > That would be a question for @fisk and/or @stefank. AFAIK, the color 
> > > > > bits should be masked by ZGC barriers _before_ the oops enter the 
> > > > > synchronization subsystem. But I kinda suspect that we are somehow 
> > > > > triggering a ZGC bug here. Maybe we require barriers when reading 
> > > > > oops from the lock-stack too?
> 
> > > 
> 
> > > 
> 
> > > > 
> 
> > > 
> 
> > > 
> 
> > > > Oops that are processed in Thread::oops_do should not have load 
> > > > barriers. Other oops should have load barriers.
> 
> > > 
> 
> > > 
> 
> > > Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() 
> > > which is called from Thread::oops_do(). But help me here: I believe ZGC 
> > > processes this stuff concurrently, right? So there might be a window 
> > > where the lock-stack oops would be unprocessed. The lock-stack would not 
> > > go under the stack-watermark machinery. And if some code (like JVMTI 
> > > deadlock detection pause) inspects the lockstack, it might see invalid 
> > > oops? Is that a plausible scenario, or am I missing something?
> 
> > 
> 
> > The JVMTI deadlock detection runs in a safepoint, doesn't it? Safepoints 
> > call start_processing on all threads in safepoint cleanup for non-GC 
> > safepoints. That means the lock stack oops should have been processed when 
> > the deadlock detection logic runs in a safepoint.
> 
> 
> 
> There appears to be a single code-path that inspects the lock-stack (and also 
> the usual stack under non-fast-locking) outside of safepoints:
> 
> 
> 
> V  [libjvm.so+0x180abd4]  Threads::owning_thread_from_monitor(ThreadsList*, 
> ObjectMonitor*)+0x54  (threads.cpp:1433)
> 
> V  [libjvm.so+0x17a4bfc]  ObjectSynchronizer::get_lock_owner(ThreadsList*, 
> Handle)+0x9c  (synchronizer.cpp:1109)
> 
> V  [libjvm.so+0x1802db0]  ThreadSnapshot::initialize(ThreadsList*, 
> JavaThread*)+0x270  (threadService.cpp:942)
> 
> V  [libjvm.so+0x1803244]  
> ThreadDumpResult::add_thread_snapshot(JavaThread*)+0x5c  
> (threadService.cpp:567)
> 
> V  [libjvm.so+0x12a0f64]  jmm_GetThreadInfo+0x480  (management.cpp:1136)
> 
> j  
> sun.management.ThreadImpl.getThreadInfo1([JI[Ljava/lang/management/ThreadInfo;)V+0
>  java.management@21-internal
> 
> 
> 
> Curiously, this seems to be in JMX code, which is also roughly where the 
> failure happens. I came across this code a couple of times and couldn't 
> really tell if it is safe to do that outside of a safepoint. In doubt I have 
> to assume it is not, and maybe this is the source of the failure? WDYT?

Could be. When not running a handshake or safepoint, you need to call 
start_processing manually on the target thread, which will ensure the oops are 
fixed until the next safepoint poll.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485752396


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Roman Kennke
On Mon, 27 Mar 2023 17:30:03 GMT, Roman Kennke  wrote:

>>> @rkennke Question about ZGC and LockStack::contains(): how does this work 
>>> with colored pointers? Don't we have to mask the color bits out somehow 
>>> when comparing? E.g. using `ZAddress::offset()` ?
>> 
>> That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
>> should be masked by ZGC barriers *before* the oops enter the synchronization 
>> subsystem. But I kinda suspect that we are somehow triggering a ZGC bug 
>> here. Maybe we require barriers when reading oops from the lock-stack too?
>
>> > > @rkennke Question about ZGC and LockStack::contains(): how does this 
>> > > work with colored pointers? Don't we have to mask the color bits out 
>> > > somehow when comparing? E.g. using `ZAddress::offset()` ?
>> > 
>> > 
>> > That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
>> > should be masked by ZGC barriers _before_ the oops enter the 
>> > synchronization subsystem. But I kinda suspect that we are somehow 
>> > triggering a ZGC bug here. Maybe we require barriers when reading oops 
>> > from the lock-stack too?
>> 
>> Oops that are processed in Thread::oops_do should not have load barriers. 
>> Other oops should have load barriers.
> 
> Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() which 
> is called from Thread::oops_do(). But help me here: I believe ZGC processes 
> this stuff concurrently, right? So there might be a window where the 
> lock-stack oops would be unprocessed. The lock-stack would not go under the 
> stack-watermark machinery. And if some code (like JVMTI deadlock detection 
> pause) inspects the lockstack, it might see invalid oops? Is that a plausible 
> scenario, or am I missing something?

> > > > > @rkennke Question about ZGC and LockStack::contains(): how does this 
> > > > > work with colored pointers? Don't we have to mask the color bits out 
> > > > > somehow when comparing? E.g. using `ZAddress::offset()` ?
> > 
> > 
> > > >
> > 
> > 
> > > >
> > 
> > 
> > > > That would be a question for @fisk and/or @stefank. AFAIK, the color 
> > > > bits should be masked by ZGC barriers _before_ the oops enter the 
> > > > synchronization subsystem. But I kinda suspect that we are somehow 
> > > > triggering a ZGC bug here. Maybe we require barriers when reading oops 
> > > > from the lock-stack too?
> > 
> > 
> > > 
> > 
> > 
> > > Oops that are processed in Thread::oops_do should not have load barriers. 
> > > Other oops should have load barriers.
> > 
> > 
> > Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() 
> > which is called from Thread::oops_do(). But help me here: I believe ZGC 
> > processes this stuff concurrently, right? So there might be a window where 
> > the lock-stack oops would be unprocessed. The lock-stack would not go under 
> > the stack-watermark machinery. And if some code (like JVMTI deadlock 
> > detection pause) inspects the lockstack, it might see invalid oops? Is that 
> > a plausible scenario, or am I missing something?
> 
> The JVMTI deadlock detection runs in a safepoint, doesn't it? Safepoints call 
> start_processing on all threads in safepoint cleanup for non-GC safepoints. 
> That means the lock stack oops should have been processed when the deadlock 
> detection logic runs in a safepoint.

There appears to be a single code-path that inspects the lock-stack (and also 
the usual stack under non-fast-locking) outside of safepoints:

V  [libjvm.so+0x180abd4]  Threads::owning_thread_from_monitor(ThreadsList*, 
ObjectMonitor*)+0x54  (threads.cpp:1433)
V  [libjvm.so+0x17a4bfc]  ObjectSynchronizer::get_lock_owner(ThreadsList*, 
Handle)+0x9c  (synchronizer.cpp:1109)
V  [libjvm.so+0x1802db0]  ThreadSnapshot::initialize(ThreadsList*, 
JavaThread*)+0x270  (threadService.cpp:942)
V  [libjvm.so+0x1803244]  
ThreadDumpResult::add_thread_snapshot(JavaThread*)+0x5c  (threadService.cpp:567)
V  [libjvm.so+0x12a0f64]  jmm_GetThreadInfo+0x480  (management.cpp:1136)
j  
sun.management.ThreadImpl.getThreadInfo1([JI[Ljava/lang/management/ThreadInfo;)V+0
 java.management@21-internal

Curiously, this seems to be in JMX code, which is also roughly where the 
failure happens. I came across this code a couple of times and couldn't really 
tell if it is safe to do that outside of a safepoint. In doubt I have to assume 
it is not, and maybe this is the source of the failure? WDYT?

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485692007


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Roman Kennke
On Fri, 24 Mar 2023 06:39:18 GMT, David Holmes  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/share/runtime/synchronizer.cpp line 516:
> 
>> 514:   // No room on the lock_stack so fall-through to inflate-enter.
>> 515: } else {
>> 516:   markWord mark = obj->mark();
> 
> why is it `mark` here but `header` above?

Oh I don't know. We are very inconsistent in our nomenclature here and use mark 
in some places and header in some others (e.g. OM::set_header() or the 
displaced_header() methods). The name mark is not really fitting and only still 
exists for historical reasons I believe (when one of the primary functions of 
the object header is to indicate GC marking?) However, the central data type is 
still called markWord so I changed my new code paths to use *mark as well. This 
probably warrants a round of codebase cleanup and consolidation later.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1149630607


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Erik Österlund
On Mon, 27 Mar 2023 17:30:03 GMT, Roman Kennke  wrote:

> > > > @rkennke Question about ZGC and LockStack::contains(): how does this 
> > > > work with colored pointers? Don't we have to mask the color bits out 
> > > > somehow when comparing? E.g. using `ZAddress::offset()` ?
> 
> > > 
> 
> > > 
> 
> > > That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
> > > should be masked by ZGC barriers _before_ the oops enter the 
> > > synchronization subsystem. But I kinda suspect that we are somehow 
> > > triggering a ZGC bug here. Maybe we require barriers when reading oops 
> > > from the lock-stack too?
> 
> > 
> 
> > Oops that are processed in Thread::oops_do should not have load barriers. 
> > Other oops should have load barriers.
> 
> 
> 
> Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() which 
> is called from Thread::oops_do(). But help me here: I believe ZGC processes 
> this stuff concurrently, right? So there might be a window where the 
> lock-stack oops would be unprocessed. The lock-stack would not go under the 
> stack-watermark machinery. And if some code (like JVMTI deadlock detection 
> pause) inspects the lockstack, it might see invalid oops? Is that a plausible 
> scenario, or am I missing something?

The JVMTI deadlock detection runs in a safepoint, doesn't it? Safepoints call 
start_processing on all threads in safepoint cleanup for non-GC safepoints. 
That means the lock stack oops should have been processed when the deadlock 
detection logic runs in a safepoint.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485592271


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Roman Kennke
On Fri, 24 Mar 2023 06:16:14 GMT, David Holmes  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/share/runtime/arguments.cpp line 1997:
> 
>> 1995:   }
>> 1996:   if (UseHeavyMonitors) {
>> 1997: FLAG_SET_DEFAULT(UseFastLocking, false);
> 
> Probably should be an error if both are set on the command-line.

I am not sure. UseFastLocking (or whatever we will call it in the near future) 
chooses between traditional and new stack-locking. We disable both by 
+UseHeavyMonitors. Also, +UseFastLocking is not really meant to be used by 
users, at least it is my intention to mainly turn it on programatically it when 
compact headers (Lilliput) is turned on. When also running with 
+UseHeavyMonitors, then it doesn't really matter which stack-locking impl is 
selected, neither would actually be used. I'd rather remove the explicit 
turning-off of UseFastMonitors under +UseHeavyMonitors.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1149623912


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Erik Österlund
On Mon, 27 Mar 2023 15:53:47 GMT, Roman Kennke  wrote:

> > @rkennke Question about ZGC and LockStack::contains(): how does this work 
> > with colored pointers? Don't we have to mask the color bits out somehow 
> > when comparing? E.g. using `ZAddress::offset()` ?
> 
> 
> 
> That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
> should be masked by ZGC barriers *before* the oops enter the synchronization 
> subsystem. But I kinda suspect that we are somehow triggering a ZGC bug here. 
> Maybe we require barriers when reading oops from the lock-stack too?

Oops that are processed in Thread::oops_do should not have load barriers. Other 
oops should have load barriers.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485426029


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Roman Kennke
On Mon, 27 Mar 2023 15:53:47 GMT, Roman Kennke  wrote:

>> Is anybody familiar with the academic literature on this topic? I am sure I 
>> am not the first person which has come up with this form of locking. Maybe 
>> we could use a name that refers to some academic paper?
>
>> @rkennke Question about ZGC and LockStack::contains(): how does this work 
>> with colored pointers? Don't we have to mask the color bits out somehow when 
>> comparing? E.g. using `ZAddress::offset()` ?
> 
> That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
> should be masked by ZGC barriers *before* the oops enter the synchronization 
> subsystem. But I kinda suspect that we are somehow triggering a ZGC bug here. 
> Maybe we require barriers when reading oops from the lock-stack too?

> > > @rkennke Question about ZGC and LockStack::contains(): how does this work 
> > > with colored pointers? Don't we have to mask the color bits out somehow 
> > > when comparing? E.g. using `ZAddress::offset()` ?
> > 
> > 
> > That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
> > should be masked by ZGC barriers _before_ the oops enter the 
> > synchronization subsystem. But I kinda suspect that we are somehow 
> > triggering a ZGC bug here. Maybe we require barriers when reading oops from 
> > the lock-stack too?
> 
> Oops that are processed in Thread::oops_do should not have load barriers. 
> Other oops should have load barriers.

Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() which 
is called from Thread::oops_do(). But help me here: I believe ZGC processes 
this stuff concurrently, right? So there might be a window where the lock-stack 
oops would be unprocessed. The lock-stack would not go under the 
stack-watermark machinery. And if some code (like JVMTI deadlock detection 
pause) inspects the lockstack, it might see invalid oops? Is that a plausible 
scenario, or am I missing something?

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485550661


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Roman Kennke
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> Is anybody familiar with the academic literature on this topic? I am sure I 
> am not the first person which has come up with this form of locking. Maybe we 
> could use a name that refers to some academic paper?

> @rkennke Question about ZGC and LockStack::contains(): how does this work 
> with colored pointers? Don't we have to mask the color bits out somehow when 
> comparing? E.g. using `ZAddress::offset()` ?

That would be a question for @fisk and/or @stefank. AFAIK, the color bits 
should be masked by ZGC barriers *before* the oops enter the synchronization 
subsystem. But I kinda suspect that we are somehow triggering a ZGC bug here. 
Maybe we require barriers when reading oops from the lock-stack too?

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485390285


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Thomas Stuefe
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> Is anybody familiar with the academic literature on this topic? I am sure I 
> am not the first person which has come up with this form of locking. Maybe we 
> could use a name that refers to some academic paper?

@rkennke Question about ZGC and LockStack::contains(): how does this work with 
colored pointers? Don't we have to mask the color bits out somehow when 
comparing? E.g. using `ZAddress::offset()` ?

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485293012


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-27 Thread Thomas Stuefe
On Mon, 27 Mar 2023 02:46:31 GMT, David Holmes  wrote:

> > AFAIU, these places now could get outdated information from the contains 
> > function: either getting a false positive or a false negative when asking 
> > for a given object. But if I understand correctly, this had been the case 
> > before, too. Since the information about ownership may be outdated as soon 
> > as one read the BasicLock* address from the markwork. So these APIs do only 
> > report a state that once had been, but never the current state.
> 
> Sure but my concern is not stale data, it is whether the lock-stack code may 
> crash (or otherwise misbehave) if there are concurrent changes to it whilst 
> it is being queried.

Lockstack is a fixed sized array of oop with a ptr-sized member indicating the 
current offset. It is embedded into Thread. Modifying the LockStack (push, pop, 
remove) is non-atomic: write the element, then update the current offset. The 
contains function - the only function not only called from current thread (and 
we should assert that) - reads current offset, then iterates the whole array up 
to offset, comparing each oop with the given one. The only way I see this 
crashing is if we read the current offset in a half-written state. Not sure if 
any of our platforms read/write a pointer-sized field non-atomic. All other 
misreads - would "just" result in a time window where contains() gives the 
wrong answer.

So we should read the offset atomically. Alternatively, have a second 
contains(), e.g. `contains_full()`, just for the sake of concurrent iteration, 
and let that one iterate the whole stack incl. unused slots. Iteration limits 
would be hardcoded offsets from Thread*, no need to read offset. The obvious 
disadvantage would be that we'd need to mark/zero out popped slots in release 
builds.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1484583227


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-26 Thread David Holmes
On Fri, 24 Mar 2023 17:11:46 GMT, Thomas Stuefe  wrote:

> AFAIU, these places now could get outdated information from the contains 
> function: either getting a false positive or a false negative when asking for 
> a given object. But if I understand correctly, this had been the case before, 
> too. Since the information about ownership may be outdated as soon as one 
> read the BasicLock* address from the markwork. So these APIs do only report a 
> state that once had been, but never the current state.

Sure but my concern is not stale data, it is whether the lock-stack code may 
crash (or otherwise misbehave) if there are concurrent changes to it whilst it 
is being queried.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1484401692


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-25 Thread Thomas Stuefe
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> Is anybody familiar with the academic literature on this topic? I am sure I 
> am not the first person which has come up with this form of locking. Maybe we 
> could use a name that refers to some academic paper?

@rkennke Would you mind integrating these changes to LockStack: 
https://github.com/tstuefe/jdk/tree/10907%2Blockstack-veris

Mainly lockstack wiping unused slots on pop, more verifications and better 
output.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1483864128


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-25 Thread Thomas Stuefe
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread Thomas Stuefe
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread Thomas Stuefe
On Fri, 24 Mar 2023 11:36:24 GMT, Aleksey Shipilev  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/share/runtime/lockStack.cpp line 42:
> 
>> 40: 
>> 41: #ifndef PRODUCT
>> 42: void LockStack::validate(const char* msg) const {
> 
> Would you also like to check there are no `nullptr` elements on stack here?

Please also verify against over- and underflow, and better than just null 
checks check that every oop really is an oop. I added this to my code:

  assert((_offset <=  end_offset()), "lockstack overflow: _offset %d end_offset 
%d", _offset, end_offset());
  assert((_offset >= start_offset()), "lockstack underflow: _offset %d 
end_offset %d", _offset, start_offset());
  int end = to_index(_offset);
  for (int i = 0; i < end; i++) {
assert(oopDesc::is_oop(_base[i]), "index %i: not an oop (" PTR_FORMAT ")", 
i, p2i(_base[i]));
...

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147841666


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Fri, 24 Mar 2023 07:00:35 GMT, David Holmes  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
>> The lock-stack is grown when needed.
> 
> Could you update the description of the PR with the latest approach please - 
> others are unlikely to read all the comments to realize this has changed.

> @dholmes-ora
> 
> > Is this thread-safe?
> 
> I don't think so, but would the stacklock variant 
> owning_thread_from_monitor_owner not suffer from the same problem?

@tstuefe Yes but that code has already had its thread-safety properties 
determined (presumably) long ago. Checking whether an address is within a 
thread's stack is pretty thread-safe. The new code needs to ensure that 
iteration with `contains` is safe in the face of a concurrent push/pop/remove.

Or  it may be these functions are only called at a safepoint? If so there 
should be an assert in that case, so I presume that is not the case.

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1482707476


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Fri, 24 Mar 2023 10:10:58 GMT, Aleksey Shipilev  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/share/runtime/synchronizer.cpp line 923:
> 
>> 921: static bool is_lock_owned(Thread* thread, oop obj) {
>> 922:   assert(UseFastLocking, "only call this with fast-locking enabled");
>> 923:   return thread->is_Java_thread() ? 
>> reinterpret_cast(thread)->lock_stack().contains(obj) : false;
> 
> Here and later, should use `JavaThread::cast(thread)` instead of 
> `reinterpret_cast`? It also sometimes subsumes the asserts, as 
> `JT::cast` checks the type.

Only JavaThreads can own monitors so this function should take a JavaThread in 
the first place.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147499577


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread Aleksey Shipilev
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread Thomas Stuefe
On Fri, 24 Mar 2023 07:00:35 GMT, David Holmes  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
>> The lock-stack is grown when needed.
> 
> Could you update the description of the PR with the latest approach please - 
> others are unlikely to read all the comments to realize this has changed.

@dholmes-ora 
> Is this thread-safe?

I don't think so, but would the stacklock variant 
owning_thread_from_monitor_owner not suffer from the same problem?

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1482386684


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-24 Thread David Holmes
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread David Holmes
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke  wrote:

> Is anybody familiar with the academic literature on this topic? I am sure I 
> am not the first person which has come up with this form of locking. Maybe we 
> could use a name that refers to some academic paper?

Well not to diminish this in any way but all you are doing is moving the 
lock-record from the stack frame (indexed from the markword) to a heap 
allocated side-table (indexed via the thread itself). The "fast-locking" is 
still the bit that use the markword to indicate the locked state, and that 
hasn't changed. Encoding lock state in an object header has  a number of names 
in the literature, depending on whose scheme it was: IBM had ThinLocks; the Sun 
Research VM (EVM) had meta-locks; etc. Hotspot doesn't really have a name for 
its variation. And as I said you aren't changing that aspect but modifying what 
data structure is used to access the lock-records.

So the property Jesper was looking for, IMO, may be something like 
`UseHeapLockRecords` - though that can unfortunately be parsed as using records 
for the HeapLock. :(

I think it was mentioned somewhere above that in the Java Object Monitor 
prototyping work we avoided using these kinds of boolean flags by defining a 
single "policy" flag that could take on different values for different 
implementation schemes. These are simply numbered, so for example:
- policy 0: use existing/legacy locking with stack-based lock records
- policy 1: use heavyweight locks (ie UseHeavyMonitors)
- policy 2 use the new approach with heap-allocated lock-records

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1482176363


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread Roman Kennke
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread Roman Kennke
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread Roman Kennke
On Thu, 23 Mar 2023 13:25:52 GMT, Jesper Wilhelmsson  
wrote:

> UseNewLocks... Surely there must be a better name? For how long will these be 
> the new locks? Do we rename the flag to UseOldLocks when the next locking 
> scheme comes along? There must be some property that differentiates these 
> locks from the older locks other than being new. Why not name the flag after 
> that property?

The main difference is that this new approach doesn't overload the object 
header with a stack-pointer in a racy way. UseNonRacyLocking?

-

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1481450601


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-23 Thread Jesper Wilhelmsson
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-22 Thread Vladimir Kozlov
On Wed, 22 Mar 2023 09:46:04 GMT, Roman Kennke  wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 670:
>> 
>>> 668:   get_thread (scrReg);// beware: clobbers ICCs
>>> 669:   movptr(Address(boxReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), 
>>> scrReg);
>>> 670:   xorptr(boxReg, boxReg); // set icc.ZFlag = 1 to 
>>> indicate success
>> 
>> Should this be under `if (UseFastLocking)`?
>
> I don't think so, unless we also want to change all the stuff in x86_32.ad to 
> not fetch the thread before calling into fast_unlock(). However, I think it 
> is a nice and useful change. I could break it out of this PR and get it 
> reviewed separately, it is a side-effect of the new locking impl insofar as 
> we always require a thread register, and allocate it before going into 
> fast_lock().

I missed that it is under #ifndef LP64. Yes, it make since since you are now 
passing `thread` in register.
And why we need to `get_thread()` at line 708 if we already have it?

It is still hard to follow this 32-bit code. What each register is holding, 
what is value `3` and why we don't have checks similar to LP64 code after CAS?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1145223128


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-22 Thread Vladimir Kozlov
On Wed, 22 Mar 2023 09:47:47 GMT, Roman Kennke  wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 791:
>> 
>>> 789:   Compile::current()->output()->add_stub(stub);
>>> 790:   jcc(Assembler::notEqual, stub->entry());
>>> 791:   bind(stub->continuation());
>> 
>> Why use stub here and not inline the code? Because the branch mostly not 
>> taken?
>
> Yes, the branch is mostly not taken. If we inline the code, then we would 
> have to take a forward branch on the very common path to skip over the (rare) 
> part that handles ANON monitor owner. This would throw off static branch 
> prediction and is discouraged by the Intel optimization guide.

okay

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1145202521


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-22 Thread Roman Kennke
On Wed, 22 Mar 2023 00:25:43 GMT, Vladimir Kozlov  wrote:

>> Roman Kennke has updated the pull request incrementally with two additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>>  - Set condition flags correctly after fast-lock call on aarch64
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 670:
> 
>> 668:   get_thread (scrReg);// beware: clobbers ICCs
>> 669:   movptr(Address(boxReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), 
>> scrReg);
>> 670:   xorptr(boxReg, boxReg); // set icc.ZFlag = 1 to 
>> indicate success
> 
> Should this be under `if (UseFastLocking)`?

I don't think so, unless we also want to change all the stuff in x86_32.ad to 
not fetch the thread before calling into fast_unlock(). However, I think it is 
a nice and useful change. I could break it out of this PR and get it reviewed 
separately, it is a side-effect of the new locking impl insofar as we always 
require a thread register, and allocate it before going into fast_lock().

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 791:
> 
>> 789:   Compile::current()->output()->add_stub(stub);
>> 790:   jcc(Assembler::notEqual, stub->entry());
>> 791:   bind(stub->continuation());
> 
> Why use stub here and not inline the code? Because the branch mostly not 
> taken?

Yes, the branch is mostly not taken. If we inline the code, then we would have 
to take a forward branch on the very common path to skip over the (rare) part 
that handles ANON monitor owner. This would throw off static branch prediction 
and is discouraged by the Intel optimization guide.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1144501909
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1144504528


Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-21 Thread Vladimir Kozlov
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-16 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]

2023-03-16 Thread Roman Kennke
> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is grown when needed. This means that we need to check for 
> potential overflow before attempting locking. When that is the case, locking 
> fast-paths would call into the runtime to grow the stack and handle the 
> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
> on method entry to avoid (possibly lots) of such checks at locking sites.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier2