Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 30 Mar 2023 11:45:54 GMT, Roman Kennke wrote: >> I have another question about the asymmetric unlocking code in >> `InterpreterMacroAssembler::unlock_object`. >> >> We go through here for both fast-locked and fat OM locks, right? If so, >> shouldn't we do the asymmetric lock check only for the fast locked case? >> Otherwise Lockstack may be empty, so we compare the word preceding the first >> slot, which would cause us to always break into the slow case? >> >> Sorry if I miss something here. > >> I have another question about the asymmetric unlocking code in >> `InterpreterMacroAssembler::unlock_object`. >> >> We go through here for both fast-locked and fat OM locks, right? If so, >> shouldn't we do the asymmetric lock check only for the fast locked case? >> Otherwise Lockstack may be empty, so we compare the word preceding the first >> slot, which would cause us to always break into the slow case? >> >> Sorry if I miss something here. > > Uh, yes, indeed. It works by accident, I suppose, because we don't segfault > on the word preceding the lock-stack, and monitor-locking takes the slow-case > in interpreter, anyway. But yeah, it's better to check for it. > @rkennke Question about ZGC and LockStack::contains(): how does this work > with colored pointers? Don't we have to mask the color bits out somehow when > comparing? E.g. using `ZAddress::offset()` ? We must ensure that the oops are in a valid state. I recently added code to ::contains() to call start_processing() when called from a foreign thread. When inspecting its own thread, we are always good, because stack-watermark is processed right after leaving a safepoint, before resuming normal operations. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1490302961
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Sat, 25 Mar 2023 08:55:25 GMT, Thomas Stuefe wrote: > I have another question about the asymmetric unlocking code in > `InterpreterMacroAssembler::unlock_object`. > > We go through here for both fast-locked and fat OM locks, right? If so, > shouldn't we do the asymmetric lock check only for the fast locked case? > Otherwise Lockstack may be empty, so we compare the word preceding the first > slot, which would cause us to always break into the slow case? > > Sorry if I miss something here. Uh, yes, indeed. It works by accident, I suppose, because we don't segfault on the word preceding the lock-stack, and monitor-locking takes the slow-case in interpreter, anyway. But yeah, it's better to check for it. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1490163377
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Tue, 28 Mar 2023 16:00:34 GMT, Thomas Stuefe wrote: >> Please also verify against over- and underflow, and better than just null >> checks check that every oop really is an oop. I added this to my code: >> >> assert((_offset <= end_offset()), "lockstack overflow: _offset %d >> end_offset %d", _offset, end_offset()); >> assert((_offset >= start_offset()), "lockstack underflow: _offset %d >> end_offset %d", _offset, start_offset()); >> int end = to_index(_offset); >> for (int i = 0; i < end; i++) { >> assert(oopDesc::is_oop(_base[i]), "index %i: not an oop (" PTR_FORMAT >> ")", i, p2i(_base[i])); >> ... > > Just realized that my proposal of oop-checking does not work since during GC > oop can be moved and will temporarily be invalid. Checking for is_oop() may not work, because oops may be temporarily invalid, until the GC gets to fix them. This is especially (and probably only) true around oops_do(). - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1152089622
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Tue, 28 Mar 2023 10:53:10 GMT, Roman Kennke wrote: >> src/hotspot/share/runtime/threads.cpp line 1422: >> >>> 1420: } >>> 1421: >>> 1422: JavaThread* Threads::owning_thread_from_object(ThreadsList * t_list, >>> oop obj) { >> >> Is this thread-safe? > > My last commit changed that code to only run during safepoints. It should be > safe now, and I added an assert that verifies that it is only run at > safepoint. I see the assert in `owning_thread_from_monitor` but not `owning_thread_from_object`. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150900864
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 16:54:59 GMT, Thomas Stuefe wrote: >> src/hotspot/share/runtime/lockStack.cpp line 42: >> >>> 40: >>> 41: #ifndef PRODUCT >>> 42: void LockStack::validate(const char* msg) const { >> >> Would you also like to check there are no `nullptr` elements on stack here? > > Please also verify against over- and underflow, and better than just null > checks check that every oop really is an oop. I added this to my code: > > assert((_offset <= end_offset()), "lockstack overflow: _offset %d > end_offset %d", _offset, end_offset()); > assert((_offset >= start_offset()), "lockstack underflow: _offset %d > end_offset %d", _offset, start_offset()); > int end = to_index(_offset); > for (int i = 0; i < end; i++) { > assert(oopDesc::is_oop(_base[i]), "index %i: not an oop (" PTR_FORMAT > ")", i, p2i(_base[i])); > ... Just realized that my proposal of oop-checking does not work since during GC oop can be moved and will temporarily be invalid. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150847182
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 11:17:05 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 611: > >> 609: bind(slow); >> 610: testptr(objReg, objReg); // force ZF=0 to indicate failure >> 611: jmp(NO_COUNT); > > We set a flag on failure (`NO_COUNT`) path. Should we set the flag on success > (`COUNT`) path as well? The path at COUNT already sets the ZF, there is no need to do it here. NO_COUNT doesn't clear ZF, and fast_lock_impl() may branch to slow with ZF set (on the overflow check) so we need to explicitly clear ZF. However, I just came up with a better way to check for overflow that readily clears the ZF: subtract 1 from the end-offset and make a greater-comparison instead of the greaterEquals that we currently do on the end-offset. That should simplify the code and avoid a branch. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 926: > >> 924: mov(boxReg, tmpReg); >> 925: fast_unlock_impl(objReg, boxReg, tmpReg, NO_COUNT); >> 926: jmp(COUNT); > > Do we need to care about returning proper flags here? Yes we do, but fast_unlock_impl() already does the right thing on the failure path, and COUNT does the right thing on the success path. Yes, it is all very ugly. *shrugs* - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150609171 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150610895
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 10:50:39 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/cpu/aarch64/aarch64.ad line 3848: > >> 3846: __ bind(slow); >> 3847: __ tst(oop, oop); // Force ZF=0 to indicate failure and take >> slow-path. We know that oop != null. >> 3848: __ b(no_count); > > Is this a micro-optimization? I think we can simplify the code by just > setting the flags here and then jumping into the usual `__ b(cont)`. This > would make the move of `__ b(cont)` unnecessary below. Well it avoids one conditional branch, so yeah I guess it's an optimization. If you don't like the b(cont) I can still move it back to where it was. It would be a dead instruction with fast-locking, though. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150585671
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 10:47:11 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/cpu/aarch64/aarch64.ad line 3954: > >> 3952: // Indicate success on completion. >> 3953: __ cmp(oop, oop); >> 3954: __ b(count); > > `aarch64_enc_fast_lock` explicitly sets NE on failure path. But this code > just jumps to `no_count` without setting the flag. Does the code outside this > encoding block rely on flags? The code outside this encoding block relies on flags, yes. This is very ugly. fast_unlock() jumps to no_count when the CAS fails, where the NE flag is set, therefore we don't need to set it again. > src/hotspot/share/runtime/synchronizer.cpp line 923: > >> 921: static bool is_lock_owned(Thread* thread, oop obj) { >> 922: assert(UseFastLocking, "only call this with fast-locking enabled"); >> 923: return thread->is_Java_thread() ? >> reinterpret_cast(thread)->lock_stack().contains(obj) : false; > > Here and later, should use `JavaThread::cast(thread)` instead of > `reinterpret_cast`? It also sometimes subsumes the asserts, as > `JT::cast` checks the type. The problem is that the places where that helper function is used receive a Thread* and not a JavaThread* (FastHashCode() and inflate()), and changing those to accept JavaThread* percolates into various areas that I don't want to touch right now (e.g. finalizerService.cpp). That is the reason why that function exists to begin with. I'll do the changes that @shipilev suggested for the time being. We may want to investigate restricting the incoming type in the future. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150580134 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150576745
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 02:49:11 GMT, David Holmes wrote: > > Is anybody familiar with the academic literature on this topic? I am sure I > > am not the first person which has come up with this form of locking. Maybe > > we could use a name that refers to some academic paper? > > Well not to diminish this in any way but all you are doing is moving the > lock-record from the stack frame (indexed from the markword) to a heap > allocated side-table (indexed via the thread itself). The "fast-locking" is > still the bit that use the markword to indicate the locked state, and that > hasn't changed. Encoding lock state in an object header has a number of names > in the literature, depending on whose scheme it was: IBM had ThinLocks; the > Sun Research VM (EVM) had meta-locks; etc. Hotspot doesn't really have a name > for its variation. And as I said you aren't changing that aspect but > modifying what data structure is used to access the lock-records. > > So the property Jesper was looking for, IMO, may be something like > `UseHeapLockRecords` - though that can unfortunately be parsed as using > records for the HeapLock. :( > > I think it was mentioned somewhere above that in the Java Object Monitor > prototyping work we avoided using these kinds of boolean flags by defining a > single "policy" flag that could take on different values for different > implementation schemes. These are simply numbered, so for example: > > * policy 0: use existing/legacy locking with stack-based lock records > * policy 1: use heavyweight locks (ie UseHeavyMonitors) > * policy 2 use the new approach with heap-allocated lock-records Well I would argue that the current implementation puts the lock record in the object header (in the form of a pointer into the stack, which is elsewhere used to identify which thread an object is locked by), and only displaces the object header onto the stack-frame, whereas the new implementation puts the lock record onto the lock-stack, which is still part of the JavaThread structure (and not heap-allocated ... although it used to be in an earlier incarnation). And it leaves the object header alone. So the correct name would be +UseStackLockRecord to turn on the new impl (where the old one would be called header lock record, or something like that). I like that name, but I suspect that it might be confusing because the old impl has traditionally been called 'stack-locking'. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1486716156
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 06:55:55 GMT, David Holmes wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/share/runtime/threads.cpp line 1422: > >> 1420: } >> 1421: >> 1422: JavaThread* Threads::owning_thread_from_object(ThreadsList * t_list, >> oop obj) { > > Is this thread-safe? My last commit changed that code to only run during safepoints. It should be safe now, and I added an assert that verifies that it is only run at safepoint. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1150401225
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 06:15:31 GMT, David Holmes wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/share/oops/oop.cpp line 126: > >> 124: // Outside of a safepoint, the header could be changing (for example, >> 125: // another thread could be inflating a lock on this object). >> 126: if (ignore_mark_word || UseFastLocking) { > > Not clear why UseFastLocking appears here instead of in the return expression > - especially given the comment above. I have the same question. Based on the comment, I would expect `return !SafepointSynchronize::is_at_safepoint() || UseFastLocking`; - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1149966288
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Mon, 27 Mar 2023 18:47:31 GMT, Roman Kennke wrote: > > > > > > @rkennke Question about ZGC and LockStack::contains(): how does > > > > > > this work with colored pointers? Don't we have to mask the color > > > > > > bits out somehow when comparing? E.g. using `ZAddress::offset()` ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That would be a question for @fisk and/or @stefank. AFAIK, the color > > > > > bits should be masked by ZGC barriers _before_ the oops enter the > > > > > synchronization subsystem. But I kinda suspect that we are somehow > > > > > triggering a ZGC bug here. Maybe we require barriers when reading > > > > > oops from the lock-stack too? > > > > > > > > > > > > > > > > > > > > > > > > > > Oops that are processed in Thread::oops_do should not have load > > > > barriers. Other oops should have load barriers. > > > > > > > > > > > > Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() > > > which is called from Thread::oops_do(). But help me here: I believe ZGC > > > processes this stuff concurrently, right? So there might be a window > > > where the lock-stack oops would be unprocessed. The lock-stack would not > > > go under the stack-watermark machinery. And if some code (like JVMTI > > > deadlock detection pause) inspects the lockstack, it might see invalid > > > oops? Is that a plausible scenario, or am I missing something? > > > > > > The JVMTI deadlock detection runs in a safepoint, doesn't it? Safepoints > > call start_processing on all threads in safepoint cleanup for non-GC > > safepoints. That means the lock stack oops should have been processed when > > the deadlock detection logic runs in a safepoint. > > > > There appears to be a single code-path that inspects the lock-stack (and also > the usual stack under non-fast-locking) outside of safepoints: > > > > V [libjvm.so+0x180abd4] Threads::owning_thread_from_monitor(ThreadsList*, > ObjectMonitor*)+0x54 (threads.cpp:1433) > > V [libjvm.so+0x17a4bfc] ObjectSynchronizer::get_lock_owner(ThreadsList*, > Handle)+0x9c (synchronizer.cpp:1109) > > V [libjvm.so+0x1802db0] ThreadSnapshot::initialize(ThreadsList*, > JavaThread*)+0x270 (threadService.cpp:942) > > V [libjvm.so+0x1803244] > ThreadDumpResult::add_thread_snapshot(JavaThread*)+0x5c > (threadService.cpp:567) > > V [libjvm.so+0x12a0f64] jmm_GetThreadInfo+0x480 (management.cpp:1136) > > j > sun.management.ThreadImpl.getThreadInfo1([JI[Ljava/lang/management/ThreadInfo;)V+0 > java.management@21-internal > > > > Curiously, this seems to be in JMX code, which is also roughly where the > failure happens. I came across this code a couple of times and couldn't > really tell if it is safe to do that outside of a safepoint. In doubt I have > to assume it is not, and maybe this is the source of the failure? WDYT? Could be. When not running a handshake or safepoint, you need to call start_processing manually on the target thread, which will ensure the oops are fixed until the next safepoint poll. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485752396
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Mon, 27 Mar 2023 17:30:03 GMT, Roman Kennke wrote: >>> @rkennke Question about ZGC and LockStack::contains(): how does this work >>> with colored pointers? Don't we have to mask the color bits out somehow >>> when comparing? E.g. using `ZAddress::offset()` ? >> >> That would be a question for @fisk and/or @stefank. AFAIK, the color bits >> should be masked by ZGC barriers *before* the oops enter the synchronization >> subsystem. But I kinda suspect that we are somehow triggering a ZGC bug >> here. Maybe we require barriers when reading oops from the lock-stack too? > >> > > @rkennke Question about ZGC and LockStack::contains(): how does this >> > > work with colored pointers? Don't we have to mask the color bits out >> > > somehow when comparing? E.g. using `ZAddress::offset()` ? >> > >> > >> > That would be a question for @fisk and/or @stefank. AFAIK, the color bits >> > should be masked by ZGC barriers _before_ the oops enter the >> > synchronization subsystem. But I kinda suspect that we are somehow >> > triggering a ZGC bug here. Maybe we require barriers when reading oops >> > from the lock-stack too? >> >> Oops that are processed in Thread::oops_do should not have load barriers. >> Other oops should have load barriers. > > Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() which > is called from Thread::oops_do(). But help me here: I believe ZGC processes > this stuff concurrently, right? So there might be a window where the > lock-stack oops would be unprocessed. The lock-stack would not go under the > stack-watermark machinery. And if some code (like JVMTI deadlock detection > pause) inspects the lockstack, it might see invalid oops? Is that a plausible > scenario, or am I missing something? > > > > > @rkennke Question about ZGC and LockStack::contains(): how does this > > > > > work with colored pointers? Don't we have to mask the color bits out > > > > > somehow when comparing? E.g. using `ZAddress::offset()` ? > > > > > > > > > > > > > > > > > > > > > > > > That would be a question for @fisk and/or @stefank. AFAIK, the color > > > > bits should be masked by ZGC barriers _before_ the oops enter the > > > > synchronization subsystem. But I kinda suspect that we are somehow > > > > triggering a ZGC bug here. Maybe we require barriers when reading oops > > > > from the lock-stack too? > > > > > > > > > > > > > > Oops that are processed in Thread::oops_do should not have load barriers. > > > Other oops should have load barriers. > > > > > > Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() > > which is called from Thread::oops_do(). But help me here: I believe ZGC > > processes this stuff concurrently, right? So there might be a window where > > the lock-stack oops would be unprocessed. The lock-stack would not go under > > the stack-watermark machinery. And if some code (like JVMTI deadlock > > detection pause) inspects the lockstack, it might see invalid oops? Is that > > a plausible scenario, or am I missing something? > > The JVMTI deadlock detection runs in a safepoint, doesn't it? Safepoints call > start_processing on all threads in safepoint cleanup for non-GC safepoints. > That means the lock stack oops should have been processed when the deadlock > detection logic runs in a safepoint. There appears to be a single code-path that inspects the lock-stack (and also the usual stack under non-fast-locking) outside of safepoints: V [libjvm.so+0x180abd4] Threads::owning_thread_from_monitor(ThreadsList*, ObjectMonitor*)+0x54 (threads.cpp:1433) V [libjvm.so+0x17a4bfc] ObjectSynchronizer::get_lock_owner(ThreadsList*, Handle)+0x9c (synchronizer.cpp:1109) V [libjvm.so+0x1802db0] ThreadSnapshot::initialize(ThreadsList*, JavaThread*)+0x270 (threadService.cpp:942) V [libjvm.so+0x1803244] ThreadDumpResult::add_thread_snapshot(JavaThread*)+0x5c (threadService.cpp:567) V [libjvm.so+0x12a0f64] jmm_GetThreadInfo+0x480 (management.cpp:1136) j sun.management.ThreadImpl.getThreadInfo1([JI[Ljava/lang/management/ThreadInfo;)V+0 java.management@21-internal Curiously, this seems to be in JMX code, which is also roughly where the failure happens. I came across this code a couple of times and couldn't really tell if it is safe to do that outside of a safepoint. In doubt I have to assume it is not, and maybe this is the source of the failure? WDYT? - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485692007
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 06:39:18 GMT, David Holmes wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/share/runtime/synchronizer.cpp line 516: > >> 514: // No room on the lock_stack so fall-through to inflate-enter. >> 515: } else { >> 516: markWord mark = obj->mark(); > > why is it `mark` here but `header` above? Oh I don't know. We are very inconsistent in our nomenclature here and use mark in some places and header in some others (e.g. OM::set_header() or the displaced_header() methods). The name mark is not really fitting and only still exists for historical reasons I believe (when one of the primary functions of the object header is to indicate GC marking?) However, the central data type is still called markWord so I changed my new code paths to use *mark as well. This probably warrants a round of codebase cleanup and consolidation later. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1149630607
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Mon, 27 Mar 2023 17:30:03 GMT, Roman Kennke wrote: > > > > @rkennke Question about ZGC and LockStack::contains(): how does this > > > > work with colored pointers? Don't we have to mask the color bits out > > > > somehow when comparing? E.g. using `ZAddress::offset()` ? > > > > > > > > > > > > That would be a question for @fisk and/or @stefank. AFAIK, the color bits > > > should be masked by ZGC barriers _before_ the oops enter the > > > synchronization subsystem. But I kinda suspect that we are somehow > > > triggering a ZGC bug here. Maybe we require barriers when reading oops > > > from the lock-stack too? > > > > > > Oops that are processed in Thread::oops_do should not have load barriers. > > Other oops should have load barriers. > > > > Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() which > is called from Thread::oops_do(). But help me here: I believe ZGC processes > this stuff concurrently, right? So there might be a window where the > lock-stack oops would be unprocessed. The lock-stack would not go under the > stack-watermark machinery. And if some code (like JVMTI deadlock detection > pause) inspects the lockstack, it might see invalid oops? Is that a plausible > scenario, or am I missing something? The JVMTI deadlock detection runs in a safepoint, doesn't it? Safepoints call start_processing on all threads in safepoint cleanup for non-GC safepoints. That means the lock stack oops should have been processed when the deadlock detection logic runs in a safepoint. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485592271
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 06:16:14 GMT, David Holmes wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/share/runtime/arguments.cpp line 1997: > >> 1995: } >> 1996: if (UseHeavyMonitors) { >> 1997: FLAG_SET_DEFAULT(UseFastLocking, false); > > Probably should be an error if both are set on the command-line. I am not sure. UseFastLocking (or whatever we will call it in the near future) chooses between traditional and new stack-locking. We disable both by +UseHeavyMonitors. Also, +UseFastLocking is not really meant to be used by users, at least it is my intention to mainly turn it on programatically it when compact headers (Lilliput) is turned on. When also running with +UseHeavyMonitors, then it doesn't really matter which stack-locking impl is selected, neither would actually be used. I'd rather remove the explicit turning-off of UseFastMonitors under +UseHeavyMonitors. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1149623912
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Mon, 27 Mar 2023 15:53:47 GMT, Roman Kennke wrote: > > @rkennke Question about ZGC and LockStack::contains(): how does this work > > with colored pointers? Don't we have to mask the color bits out somehow > > when comparing? E.g. using `ZAddress::offset()` ? > > > > That would be a question for @fisk and/or @stefank. AFAIK, the color bits > should be masked by ZGC barriers *before* the oops enter the synchronization > subsystem. But I kinda suspect that we are somehow triggering a ZGC bug here. > Maybe we require barriers when reading oops from the lock-stack too? Oops that are processed in Thread::oops_do should not have load barriers. Other oops should have load barriers. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485426029
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Mon, 27 Mar 2023 15:53:47 GMT, Roman Kennke wrote: >> Is anybody familiar with the academic literature on this topic? I am sure I >> am not the first person which has come up with this form of locking. Maybe >> we could use a name that refers to some academic paper? > >> @rkennke Question about ZGC and LockStack::contains(): how does this work >> with colored pointers? Don't we have to mask the color bits out somehow when >> comparing? E.g. using `ZAddress::offset()` ? > > That would be a question for @fisk and/or @stefank. AFAIK, the color bits > should be masked by ZGC barriers *before* the oops enter the synchronization > subsystem. But I kinda suspect that we are somehow triggering a ZGC bug here. > Maybe we require barriers when reading oops from the lock-stack too? > > > @rkennke Question about ZGC and LockStack::contains(): how does this work > > > with colored pointers? Don't we have to mask the color bits out somehow > > > when comparing? E.g. using `ZAddress::offset()` ? > > > > > > That would be a question for @fisk and/or @stefank. AFAIK, the color bits > > should be masked by ZGC barriers _before_ the oops enter the > > synchronization subsystem. But I kinda suspect that we are somehow > > triggering a ZGC bug here. Maybe we require barriers when reading oops from > > the lock-stack too? > > Oops that are processed in Thread::oops_do should not have load barriers. > Other oops should have load barriers. Ok, good. The lockstack is processed in JavaThread::oops_do_no_frames() which is called from Thread::oops_do(). But help me here: I believe ZGC processes this stuff concurrently, right? So there might be a window where the lock-stack oops would be unprocessed. The lock-stack would not go under the stack-watermark machinery. And if some code (like JVMTI deadlock detection pause) inspects the lockstack, it might see invalid oops? Is that a plausible scenario, or am I missing something? - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485550661
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > Is anybody familiar with the academic literature on this topic? I am sure I > am not the first person which has come up with this form of locking. Maybe we > could use a name that refers to some academic paper? > @rkennke Question about ZGC and LockStack::contains(): how does this work > with colored pointers? Don't we have to mask the color bits out somehow when > comparing? E.g. using `ZAddress::offset()` ? That would be a question for @fisk and/or @stefank. AFAIK, the color bits should be masked by ZGC barriers *before* the oops enter the synchronization subsystem. But I kinda suspect that we are somehow triggering a ZGC bug here. Maybe we require barriers when reading oops from the lock-stack too? - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485390285
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > Is anybody familiar with the academic literature on this topic? I am sure I > am not the first person which has come up with this form of locking. Maybe we > could use a name that refers to some academic paper? @rkennke Question about ZGC and LockStack::contains(): how does this work with colored pointers? Don't we have to mask the color bits out somehow when comparing? E.g. using `ZAddress::offset()` ? - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1485293012
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Mon, 27 Mar 2023 02:46:31 GMT, David Holmes wrote: > > AFAIU, these places now could get outdated information from the contains > > function: either getting a false positive or a false negative when asking > > for a given object. But if I understand correctly, this had been the case > > before, too. Since the information about ownership may be outdated as soon > > as one read the BasicLock* address from the markwork. So these APIs do only > > report a state that once had been, but never the current state. > > Sure but my concern is not stale data, it is whether the lock-stack code may > crash (or otherwise misbehave) if there are concurrent changes to it whilst > it is being queried. Lockstack is a fixed sized array of oop with a ptr-sized member indicating the current offset. It is embedded into Thread. Modifying the LockStack (push, pop, remove) is non-atomic: write the element, then update the current offset. The contains function - the only function not only called from current thread (and we should assert that) - reads current offset, then iterates the whole array up to offset, comparing each oop with the given one. The only way I see this crashing is if we read the current offset in a half-written state. Not sure if any of our platforms read/write a pointer-sized field non-atomic. All other misreads - would "just" result in a time window where contains() gives the wrong answer. So we should read the offset atomically. Alternatively, have a second contains(), e.g. `contains_full()`, just for the sake of concurrent iteration, and let that one iterate the whole stack incl. unused slots. Iteration limits would be hardcoded offsets from Thread*, no need to read offset. The obvious disadvantage would be that we'd need to mark/zero out popped slots in release builds. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1484583227
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 17:11:46 GMT, Thomas Stuefe wrote: > AFAIU, these places now could get outdated information from the contains > function: either getting a false positive or a false negative when asking for > a given object. But if I understand correctly, this had been the case before, > too. Since the information about ownership may be outdated as soon as one > read the BasicLock* address from the markwork. So these APIs do only report a > state that once had been, but never the current state. Sure but my concern is not stale data, it is whether the lock-stack code may crash (or otherwise misbehave) if there are concurrent changes to it whilst it is being queried. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1484401692
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > Is anybody familiar with the academic literature on this topic? I am sure I > am not the first person which has come up with this form of locking. Maybe we > could use a name that refers to some academic paper? @rkennke Would you mind integrating these changes to LockStack: https://github.com/tstuefe/jdk/tree/10907%2Blockstack-veris Mainly lockstack wiping unused slots on pop, more verifications and better output. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1483864128
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 11:36:24 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/share/runtime/lockStack.cpp line 42: > >> 40: >> 41: #ifndef PRODUCT >> 42: void LockStack::validate(const char* msg) const { > > Would you also like to check there are no `nullptr` elements on stack here? Please also verify against over- and underflow, and better than just null checks check that every oop really is an oop. I added this to my code: assert((_offset <= end_offset()), "lockstack overflow: _offset %d end_offset %d", _offset, end_offset()); assert((_offset >= start_offset()), "lockstack underflow: _offset %d end_offset %d", _offset, start_offset()); int end = to_index(_offset); for (int i = 0; i < end; i++) { assert(oopDesc::is_oop(_base[i]), "index %i: not an oop (" PTR_FORMAT ")", i, p2i(_base[i])); ... - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147841666
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 07:00:35 GMT, David Holmes wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > >> The lock-stack is grown when needed. > > Could you update the description of the PR with the latest approach please - > others are unlikely to read all the comments to realize this has changed. > @dholmes-ora > > > Is this thread-safe? > > I don't think so, but would the stacklock variant > owning_thread_from_monitor_owner not suffer from the same problem? @tstuefe Yes but that code has already had its thread-safety properties determined (presumably) long ago. Checking whether an address is within a thread's stack is pretty thread-safe. The new code needs to ensure that iteration with `contains` is safe in the face of a concurrent push/pop/remove. Or it may be these functions are only called at a safepoint? If so there should be an assert in that case, so I presume that is not the case. - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1482707476
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 10:10:58 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/share/runtime/synchronizer.cpp line 923: > >> 921: static bool is_lock_owned(Thread* thread, oop obj) { >> 922: assert(UseFastLocking, "only call this with fast-locking enabled"); >> 923: return thread->is_Java_thread() ? >> reinterpret_cast(thread)->lock_stack().contains(obj) : false; > > Here and later, should use `JavaThread::cast(thread)` instead of > `reinterpret_cast`? It also sometimes subsumes the asserts, as > `JT::cast` checks the type. Only JavaThreads can own monitors so this function should take a JavaThread in the first place. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147499577
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Fri, 24 Mar 2023 07:00:35 GMT, David Holmes wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > >> The lock-stack is grown when needed. > > Could you update the description of the PR with the latest approach please - > others are unlikely to read all the comments to realize this has changed. @dholmes-ora > Is this thread-safe? I don't think so, but would the stacklock variant owning_thread_from_monitor_owner not suffer from the same problem? - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1482386684
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 23 Mar 2023 16:32:57 GMT, Roman Kennke wrote: > Is anybody familiar with the academic literature on this topic? I am sure I > am not the first person which has come up with this form of locking. Maybe we > could use a name that refers to some academic paper? Well not to diminish this in any way but all you are doing is moving the lock-record from the stack frame (indexed from the markword) to a heap allocated side-table (indexed via the thread itself). The "fast-locking" is still the bit that use the markword to indicate the locked state, and that hasn't changed. Encoding lock state in an object header has a number of names in the literature, depending on whose scheme it was: IBM had ThinLocks; the Sun Research VM (EVM) had meta-locks; etc. Hotspot doesn't really have a name for its variation. And as I said you aren't changing that aspect but modifying what data structure is used to access the lock-records. So the property Jesper was looking for, IMO, may be something like `UseHeapLockRecords` - though that can unfortunately be parsed as using records for the HeapLock. :( I think it was mentioned somewhere above that in the Java Object Monitor prototyping work we avoided using these kinds of boolean flags by defining a single "policy" flag that could take on different values for different implementation schemes. These are simply numbered, so for example: - policy 0: use existing/legacy locking with stack-based lock records - policy 1: use heavyweight locks (ie UseHeavyMonitors) - policy 2 use the new approach with heap-allocated lock-records - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1482176363
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 23 Mar 2023 13:25:52 GMT, Jesper Wilhelmsson wrote: > UseNewLocks... Surely there must be a better name? For how long will these be > the new locks? Do we rename the flag to UseOldLocks when the next locking > scheme comes along? There must be some property that differentiates these > locks from the older locks other than being new. Why not name the flag after > that property? The main difference is that this new approach doesn't overload the object header with a stack-pointer in a racy way. UseNonRacyLocking? - PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1481450601
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Wed, 22 Mar 2023 09:46:04 GMT, Roman Kennke wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 670: >> >>> 668: get_thread (scrReg);// beware: clobbers ICCs >>> 669: movptr(Address(boxReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), >>> scrReg); >>> 670: xorptr(boxReg, boxReg); // set icc.ZFlag = 1 to >>> indicate success >> >> Should this be under `if (UseFastLocking)`? > > I don't think so, unless we also want to change all the stuff in x86_32.ad to > not fetch the thread before calling into fast_unlock(). However, I think it > is a nice and useful change. I could break it out of this PR and get it > reviewed separately, it is a side-effect of the new locking impl insofar as > we always require a thread register, and allocate it before going into > fast_lock(). I missed that it is under #ifndef LP64. Yes, it make since since you are now passing `thread` in register. And why we need to `get_thread()` at line 708 if we already have it? It is still hard to follow this 32-bit code. What each register is holding, what is value `3` and why we don't have checks similar to LP64 code after CAS? - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1145223128
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Wed, 22 Mar 2023 09:47:47 GMT, Roman Kennke wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 791: >> >>> 789: Compile::current()->output()->add_stub(stub); >>> 790: jcc(Assembler::notEqual, stub->entry()); >>> 791: bind(stub->continuation()); >> >> Why use stub here and not inline the code? Because the branch mostly not >> taken? > > Yes, the branch is mostly not taken. If we inline the code, then we would > have to take a forward branch on the very common path to skip over the (rare) > part that handles ANON monitor owner. This would throw off static branch > prediction and is discouraged by the Intel optimization guide. okay - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1145202521
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Wed, 22 Mar 2023 00:25:43 GMT, Vladimir Kozlov wrote: >> Roman Kennke has updated the pull request incrementally with two additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 >> - Set condition flags correctly after fast-lock call on aarch64 > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 670: > >> 668: get_thread (scrReg);// beware: clobbers ICCs >> 669: movptr(Address(boxReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), >> scrReg); >> 670: xorptr(boxReg, boxReg); // set icc.ZFlag = 1 to >> indicate success > > Should this be under `if (UseFastLocking)`? I don't think so, unless we also want to change all the stuff in x86_32.ad to not fetch the thread before calling into fast_unlock(). However, I think it is a nice and useful change. I could break it out of this PR and get it reviewed separately, it is a side-effect of the new locking impl insofar as we always require a thread register, and allocate it before going into fast_lock(). > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 791: > >> 789: Compile::current()->output()->add_stub(stub); >> 790: jcc(Assembler::notEqual, stub->entry()); >> 791: bind(stub->continuation()); > > Why use stub here and not inline the code? Because the branch mostly not > taken? Yes, the branch is mostly not taken. If we inline the code, then we would have to take a forward branch on the very common path to skip over the (rare) part that handles ANON monitor owner. This would throw off static branch prediction and is discouraged by the Intel optimization guide. - PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1144501909 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1144504528
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke wrote: >> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation
Re: RFR: 8291555: Implement alternative fast-locking scheme [v29]
> This change adds a fast-locking scheme as an alternative to the current > stack-locking implementation. It retains the advantages of stack-locking > (namely fast locking in uncontended code-paths), while avoiding the overload > of the mark word. That overloading causes massive problems with Lilliput, > because it means we have to check and deal with this situation when trying to > access the mark-word. And because of the very racy nature, this turns out to > be very complex and would involve a variant of the inflation protocol to > ensure that the object header is stable. (The current implementation of > setting/fetching the i-hash provides a glimpse into the complexity). > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations (and we already do). The lock-stack is also a new set of GC roots, > and would be scanned during thread scanning, possibly concurrently, via the > normal p rotocols. > > The lock-stack is grown when needed. This means that we need to check for > potential overflow before attempting locking. When that is the case, locking > fast-paths would call into the runtime to grow the stack and handle the > locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check > on method entry to avoid (possibly lots) of such checks at locking sites. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > > Testing: > - [x] tier1 x86_64 x aarch64 x +UseFastLocking > - [x] tier2