Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-16 Thread Thomas Stuefe
On Thu, 16 Mar 2023 06:31:42 GMT, Roman Kennke  wrote:

> > > Agreed. But I don't think "Thin locks" is an option as that was 
> > > specifically an IBM locking implementation. Historically Hotspot's 
> > > locking mechanism has internally been referred to as stack-locks, so I 
> > > would suggest UseNewStackLocks
> > 
> > 
> > They don't use the stack anymore; would this not be us using a wrong name 
> > just for history's sake?
> 
> Well, it's still got the lock-stacks. :-D

Yes, but we have variables like "is_stack_locked" etc, without an "old" 
qualifier. Idk, up to you. Better than UseFastLocking I guess.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-16 Thread Roman Kennke
On Thu, 16 Mar 2023 06:05:38 GMT, Thomas Stuefe  wrote:

> 
> 
> > Agreed. But I don't think "Thin locks" is an option as that was 
> > specifically an IBM locking implementation. Historically Hotspot's locking 
> > mechanism has internally been referred to as stack-locks, so I would 
> > suggest UseNewStackLocks
> 
> 
> 
> They don't use the stack anymore; would this not be us using a wrong name 
> just for history's sake?

Well, it's still got the lock-stacks. :-D

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-16 Thread Roman Kennke
On Wed, 15 Mar 2023 19:40:33 GMT, Roman Kennke  wrote:

>>> Would it be possible to open/send me the failing test that triggers 
>>> vframeArray assert
>>> or extract a reproducer that you could publish?
>> 
>> I have started an internal discussion at Oracle to see what it would take
>> to move that test from closed to open. Will keep you posted.
>
>> > Would it be possible to open/send me the failing test that triggers 
>> > vframeArray assert
>> > or extract a reproducer that you could publish?
>> 
>> I have started an internal discussion at Oracle to see what it would take to 
>> move that test from closed to open. Will keep you posted.
> 
> Thank you!
> 
> Regarding moving this PR back to draft, I am not sure. I can do that, yes. 
> But really the fundamental algorithm and implementation is basically fixed 
> since half a year already. I have re-worked it into a fresh PR based on the 
> request to put it behind a flag. The recent change to a fixed-size lock-stack 
> has probably invalidated part of your previous reviews, and I am sorry for 
> that. On the upside, it removed a lot of complexity in the JIT compilers and 
> assembly code generators.
> 
> What else do I expect to happen?
> 
> Thomas is working on an ARM(32) port, but this is quite separate and could 
> even land after this PR is done.
> 
> I still don't quite like the naming. Fast-locking doesn't really say anything 
> and it's not (meant to be) faster than the previous stack-locking. It is an 
> alternative (and less racy, on the object header) way to implement a 
> thin-locking layer before inflating monitors, that is all. So maybe 
> -XX:+UseNewThinLocking? It is somewhat temporary anyway. At least my hope is 
> that when we eventually switch to Lilliput turned on by default, we would 
> entirely remove stack-locking.
> 
> I would also add some code in arguments.cpp to keep this new thin locking 
> turned off on platforms that don't yet support it.
> 
> Besides that, from my POV, it is pretty much done.
> 
> What do you think?

> @rkennke The changed to fixed-size lock stack was a significant change as you 
> note and that suggested to me that the design was still in flux. So I have to 
> wonder whether everything is in fact now stable? (or as stable as one should 
> expect for an experimental new feature)

I think it is, except for the few points that I mentioned earlier, and anything 
that comes up in reviews, I don't expect any major design changes. In fact, I 
would actively hold them back if anything comes up, to move this PR across the 
line at this point. I can't think of any bad spots where I thunk 'oh this is 
ugly - this needs a better approach' though.


> > Fast-locking doesn't really say anything and it's not (meant to be) faster 
> > than the previous stack-locking.
> 
> 
> 
> Agreed. But I don't think "Thin locks" is an option as that was specifically 
> an IBM locking implementation. Historically Hotspot's locking mechanism has 
> internally been referred to as stack-locks, so I would suggest 
> UseNewStackLocks

That's fine by me.

Thank you,
Roman

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-16 Thread Thomas Stuefe
On Thu, 16 Mar 2023 05:45:29 GMT, David Holmes  wrote:

> Agreed. But I don't think "Thin locks" is an option as that was specifically 
> an IBM locking implementation. Historically Hotspot's locking mechanism has 
> internally been referred to as stack-locks, so I would suggest 
> UseNewStackLocks

They don't use the stack anymore; would this not be us using a wrong name just 
for history's sake?

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-15 Thread David Holmes
On Wed, 15 Mar 2023 19:40:33 GMT, Roman Kennke  wrote:

>>> Would it be possible to open/send me the failing test that triggers 
>>> vframeArray assert
>>> or extract a reproducer that you could publish?
>> 
>> I have started an internal discussion at Oracle to see what it would take
>> to move that test from closed to open. Will keep you posted.
>
>> > Would it be possible to open/send me the failing test that triggers 
>> > vframeArray assert
>> > or extract a reproducer that you could publish?
>> 
>> I have started an internal discussion at Oracle to see what it would take to 
>> move that test from closed to open. Will keep you posted.
> 
> Thank you!
> 
> Regarding moving this PR back to draft, I am not sure. I can do that, yes. 
> But really the fundamental algorithm and implementation is basically fixed 
> since half a year already. I have re-worked it into a fresh PR based on the 
> request to put it behind a flag. The recent change to a fixed-size lock-stack 
> has probably invalidated part of your previous reviews, and I am sorry for 
> that. On the upside, it removed a lot of complexity in the JIT compilers and 
> assembly code generators.
> 
> What else do I expect to happen?
> 
> Thomas is working on an ARM(32) port, but this is quite separate and could 
> even land after this PR is done.
> 
> I still don't quite like the naming. Fast-locking doesn't really say anything 
> and it's not (meant to be) faster than the previous stack-locking. It is an 
> alternative (and less racy, on the object header) way to implement a 
> thin-locking layer before inflating monitors, that is all. So maybe 
> -XX:+UseNewThinLocking? It is somewhat temporary anyway. At least my hope is 
> that when we eventually switch to Lilliput turned on by default, we would 
> entirely remove stack-locking.
> 
> I would also add some code in arguments.cpp to keep this new thin locking 
> turned off on platforms that don't yet support it.
> 
> Besides that, from my POV, it is pretty much done.
> 
> What do you think?

@rkennke The changed to fixed-size lock stack was a significant change as you 
note and that suggested to me that the design was still in flux. So I have to 
wonder whether everything is in fact now stable? (or as stable as one should 
expect for an experimental new feature)

> Fast-locking doesn't really say anything and it's not (meant to be) faster 
> than the previous stack-locking.

Agreed. But I don't think "Thin locks" is an option as that was specifically an 
IBM locking implementation. Historically Hotspot's locking mechanism has 
internally been referred to as stack-locks, so I would suggest UseNewStackLocks

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-15 Thread Roman Kennke
On Wed, 15 Mar 2023 19:14:09 GMT, Daniel D. Daugherty  
wrote:

> > Would it be possible to open/send me the failing test that triggers 
> > vframeArray assert
> > or extract a reproducer that you could publish?
> 
> I have started an internal discussion at Oracle to see what it would take to 
> move that test from closed to open. Will keep you posted.

Thank you!

Regarding moving this PR back to draft, I am not sure. I can do that, yes. But 
really the fundamental algorithm and implementation is basically fixed since 
half a year already. I have re-worked it into a fresh PR based on the request 
to put it behind a flag. The recent change to a fixed-size lock-stack has 
probably invalidated part of your previous reviews, and I am sorry for that. On 
the upside, it removed a lot of complexity in the JIT compilers and assembly 
code generators.

What else do I expect to happen?

Thomas is working on an ARM(32) port, but this is quite separate and could even 
land after this PR is done.

I still don't quite like the naming. Fast-locking doesn't really say anything 
and it's not (meant to be) faster than the previous stack-locking. It is an 
alternative (and less racy, on the object header) way to implement a 
thin-locking layer before inflating monitors, that is all. So maybe 
-XX:+UseNewThinLocking? It is somewhat temporary anyway. At least my hope is 
that when we eventually switch to Lilliput turned on by default, we would 
entirely remove stack-locking.

I would also add some code in arguments.cpp to keep this new thin locking 
turned off on platforms that don't yet support it.

Besides that, from my POV, it is pretty much done.

What do you think?

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-15 Thread Daniel D . Daugherty
On Wed, 15 Mar 2023 09:36:25 GMT, Roman Kennke  wrote:

> Would it be possible to open/send me the failing test that triggers 
> vframeArray assert
> or extract a reproducer that you could publish?

I have started an internal discussion at Oracle to see what it would take
to move that test from closed to open. Will keep you posted.

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-15 Thread Roman Kennke
On Wed, 15 Mar 2023 01:25:33 GMT, Daniel D. Daugherty  
wrote:

> I did Mach5 Tier{1,2,3} on v25. Please see the bug report for the gory 
> details:
> 
> Tier1 - 1 known, unrelated failure Tier2 - 4 closed, unknown, related test 
> failures Tier3 - 8 closed, unknown, related test failures; 2 open, known, 
> unrelated test failures; 16 open, unknown, related test failures
> 
> I'm pausing my Mach5 testing at this point.

Hi Daniel,

I could not reproduce any of the test failures, neither on x86_64 nor on 
aarch64. I have blindly fixed the code stub size issue, it seems rather 
trivial. Would it be possible to open/send me the failing test that triggers 
vframeArray assert or extract a reproducer that you could publish? I looked at 
it but could not figure out what could be going on there.

Thanks,
Roman

-

PR: https://git.openjdk.org/jdk/pull/10907


Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-14 Thread Daniel D . Daugherty
On Tue, 14 Mar 2023 18:52:39 GMT, Roman Kennke  wrote:

>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for 
>> potential overflow before attempting locking. When that is the case, locking 
>> fast-paths would call into the runtime to grow the stack and handle the 
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation 

Re: RFR: 8291555: Implement alternative fast-locking scheme [v26]

2023-03-14 Thread Roman Kennke
> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is grown when needed. This means that we need to check for 
> potential overflow before attempting locking. When that is the case, locking 
> fast-paths would call into the runtime to grow the stack and handle the 
> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check 
> on method entry to avoid (possibly lots) of such checks at locking sites.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier2