Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-17 Thread Thomas Stuefe
On Fri, 17 Mar 2023 06:33:43 GMT, Roman Kennke  wrote:

>>> In my last changes I made a stupid mistake and don't set the condition 
>>> flags correctly to force the slow-path, on aarch64. This is only relevant 
>>> when we exceed the lock-stack capacity, that is why it's failing so rarely. 
>>> I don't see a similar problem on x86_64 - have we observed any failures on 
>>> x86_64? I pushed a fix for aarch64.
>> 
>> I noticed this too for arm; I used cmp to clear EQ but using tst seems 
>> better. I also do it inside fast_lock, to give it a defined exit state wrt 
>> EQ|NE, since it saves me from having to think about this on every call site. 
>> But at least the fail case may be fiddly without conditional execution.
>
>> > In my last changes I made a stupid mistake and don't set the condition 
>> > flags correctly to force the slow-path, on aarch64. This is only relevant 
>> > when we exceed the lock-stack capacity, that is why it's failing so 
>> > rarely. I don't see a similar problem on x86_64 - have we observed any 
>> > failures on x86_64? I pushed a fix for aarch64.
>> 
>> 
>> 
>> I noticed this too for arm; I used cmp to clear EQ but using tst seems 
>> better. I also do it inside fast_lock, to give it a defined exit state wrt 
>> EQ|NE, since it saves me from having to think about this on every call site. 
>> But at least the fail case may be fiddly without conditional execution.
> 
> Cmp(r,r) would not clear EQ, but set it. Unless you do cmp(r,0) on a non-null 
> register. Tst is better at least on x86 because it encodes smaller. *shrugs*
> 
> You can do it in the shared fast_lock() but it's really only needed in C2, 
> that's why I'm doing it there. Maybe I'm too perfectionist when it comes to 
> assembly code?

@rkennke I was not able to directly use 
'JavaThread::lock_stack_offset_offset()' in cmp since it was not encodable as 
immediate. You did not hit the same problem on aarch64, right? IIUC that was 
more out of accident, since you should have similar or the same (not sure) 
restrictions for encoding immediates. But your Thread layout is probably 
different and the offset may just happened to be encodable. If so, that would 
make you vulnerable against changes in Thread that change the offset of the 
LockStack.

Anyway, for now I solved this by using the second scratch register as 
intermediate. One more instruction though. 

I am now experimenting with my original idea of placing the Lockstack slots at 
a known aligned offset and then testing the alignment of the current 
index/pointer. This should be possible with a simple TST. Lets see how this 
goes.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Thomas Stuefe
On Fri, 17 Mar 2023 06:33:43 GMT, Roman Kennke  wrote:

> 



> > > In my last changes I made a stupid mistake and don't set the condition 
> > > flags correctly to force the slow-path, on aarch64. This is only relevant 
> > > when we exceed the lock-stack capacity, that is why it's failing so 
> > > rarely. I don't see a similar problem on x86_64 - have we observed any 
> > > failures on x86_64? I pushed a fix for aarch64.
> > 
> > 
> > I noticed this too for arm; I used cmp to clear EQ but using tst seems 
> > better. I also do it inside fast_lock, to give it a defined exit state wrt 
> > EQ|NE, since it saves me from having to think about this on every call 
> > site. But at least the fail case may be fiddly without conditional 
> > execution.
> 
> Cmp(r,r) would not clear EQ, but set it. Unless you do cmp(r,0) on a non-null 
> register.

Sure. I used cmp with an immediate that I knew was not the value. Clunky, I 
know. As I wrote, tst seems better.

> Tst is better at least on x86 because it encodes smaller. _shrugs_
> 
> You can do it in the shared fast_lock() but it's really only needed in C2, 
> that's why I'm doing it there. Maybe I'm too perfectionist when it comes to 
> assembly code?

I felt just better having it there, at least for the start. I may still move it 
outside to C2. Lets see.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Roman Kennke
On Fri, 17 Mar 2023 06:15:28 GMT, Thomas Stuefe  wrote:

> > In my last changes I made a stupid mistake and don't set the condition 
> > flags correctly to force the slow-path, on aarch64. This is only relevant 
> > when we exceed the lock-stack capacity, that is why it's failing so rarely. 
> > I don't see a similar problem on x86_64 - have we observed any failures on 
> > x86_64? I pushed a fix for aarch64.
> 
> 
> 
> I noticed this too for arm; I used cmp to clear EQ but using tst seems 
> better. I also do it inside fast_lock, to give it a defined exit state wrt 
> EQ|NE, since it saves me from having to think about this on every call site. 
> But at least the fail case may be fiddly without conditional execution.

Cmp(r,r) would not clear EQ, but set it. Unless you do cmp(r,0) on a non-null 
register. Tst is better at least on x86 because it encodes smaller. *shrugs*

You can do it in the shared fast_lock() but it's really only needed in C2, 
that's why I'm doing it there. Maybe I'm too perfectionist when it comes to 
assembly code?

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Thomas Stuefe
On Thu, 16 Mar 2023 20:47:59 GMT, Roman Kennke  wrote:

> In my last changes I made a stupid mistake and don't set the condition flags 
> correctly to force the slow-path, on aarch64. This is only relevant when we 
> exceed the lock-stack capacity, that is why it's failing so rarely. I don't 
> see a similar problem on x86_64 - have we observed any failures on x86_64? I 
> pushed a fix for aarch64.

I noticed this too for arm; I used cmp to clear EQ but using tst seems better. 
I also do it inside fast_lock, to give it a defined exit state wrt EQ|NE, since 
it saves me from having to think about this on every call site. But at least 
the fail case may be fiddly without conditional execution.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Roman Kennke
On Thu, 16 Mar 2023 21:05:54 GMT, Daniel D. Daugherty  
wrote:

> > I pushed a fix for aarch64.
> 
> 
> 
> Do you think this is the cause for the -Xcheck:jni failures that I ran into
> 
> in my Tier4 testing?

Yes, and with high probability also for some/all of the other failures. It 
leads to the situation that when the lock-stack is full, it should take the 
slow-path, but doesn't (because the flags are not set correctly) and thus leave 
the object unlocked.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 20:57:31 GMT, Roman Kennke  wrote:

>> src/hotspot/cpu/x86/x86_32.ad line 617:
>> 
>>> 615:   int bangsize = C->output()->bang_size_in_bytes();
>>> 616: 
>>> 617:   __ verified_entry(framesize, 
>>> C->output()->need_stack_bang(bangsize)?bangsize:0, C->in_24_bit_fp_mode(), 
>>> C->stub_function() != NULL);
>> 
>> Why did this change from `nullptr` -> `NULL`?
>
> I reverted that part back to upstream state (at least what is in JDK-21+13).

Okay.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 20:47:59 GMT, Roman Kennke  wrote:

> I pushed a fix for aarch64.

Do you think this is the cause for the -Xcheck:jni failures that I ran into
in my Tier4 testing?

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Roman Kennke
On Thu, 16 Mar 2023 20:50:12 GMT, Daniel D. Daugherty  
wrote:

>> Roman Kennke has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Several changes (mostly cosmetic) in response to reviews
>
> src/hotspot/cpu/x86/x86_32.ad line 617:
> 
>> 615:   int bangsize = C->output()->bang_size_in_bytes();
>> 616: 
>> 617:   __ verified_entry(framesize, 
>> C->output()->need_stack_bang(bangsize)?bangsize:0, C->in_24_bit_fp_mode(), 
>> C->stub_function() != NULL);
> 
> Why did this change from `nullptr` -> `NULL`?

I reverted that part back to upstream state (at least what is in JDK-21+13).

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 12:51:10 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol.

Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Roman Kennke
On Thu, 16 Mar 2023 12:51:10 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol.

Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 12:51:10 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol.

Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Daniel D . Daugherty
On Thu, 16 Mar 2023 12:51:10 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol.

Re: RFR: 8291555: Implement alternative fast-locking scheme [v28]

2023-03-16 Thread Roman Kennke
> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is grown when needed. This means that we need to check for 
> potential overflow before attempting locking. When that is the case, locking 
> fast-paths would call into the runtime to grow the stack and handle the 
> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
> on method entry to avoid (possibly lots) of such checks at locking sites.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier2 x86_6