On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke <rken...@openjdk.org> wrote:
>> This change adds a fast-locking scheme as an alternative to the current >> stack-locking implementation. It retains the advantages of stack-locking >> (namely fast locking in uncontended code-paths), while avoiding the overload >> of the mark word. That overloading causes massive problems with Lilliput, >> because it means we have to check and deal with this situation when trying >> to access the mark-word. And because of the very racy nature, this turns out >> to be very complex and would involve a variant of the inflation protocol to >> ensure that the object header is stable. (The current implementation of >> setting/fetching the i-hash provides a glimpse into the complexity). >> >> What the original stack-locking does is basically to push a stack-lock onto >> the stack which consists only of the displaced header, and CAS a pointer to >> this stack location into the object header (the lowest two header bits being >> 00 indicate 'stack-locked'). The pointer into the stack can then be used to >> identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two >> header bits to 00 to indicate 'fast-locked' but does *not* overload the >> upper bits with a stack-pointer. Instead, it pushes the object-reference to >> a thread-local lock-stack. This is a new structure which is basically a >> small array of oops that is associated with each thread. Experience shows >> that this array typcially remains very small (3-5 elements). Using this lock >> stack, it is possible to query which threads own which locks. Most >> importantly, the most common question 'does the current thread own me?' is >> very quickly answered by doing a quick scan of the array. More complex >> queries like 'which thread owns X?' are not performed in very >> performance-critical paths (usually in code like JVMTI or deadlock >> detection) where it is ok to do more complex operations (and we already do). >> The lock-stack is also a new set of GC roots, and would be scanned during >> thread scanning, possibly concurrently, via the normal protocols. >> >> The lock-stack is grown when needed. This means that we need to check for >> potential overflow before attempting locking. When that is the case, locking >> fast-paths would call into the runtime to grow the stack and handle the >> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check >> on method entry to avoid (possibly lots) of such checks at locking sites. >> >> In contrast to stack-locking, fast-locking does *not* support recursive >> locking (yet). When that happens, the fast-lock gets inflated to a full >> monitor. It is not clear if it is worth to add support for recursive >> fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked >> object, it must inflate the fast-lock to a full monitor. Normally, we need >> to know the current owning thread, and record that in the monitor, so that >> the contending thread can wait for the current owner to properly exit the >> monitor. However, fast-locking doesn't have this information. What we do >> instead is to record a special marker ANONYMOUS_OWNER. When the thread that >> currently holds the lock arrives at monitorexit, and observes >> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, >> and then properly exits the monitor, and thus handing over to the contending >> thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only >> use heavy monitors. In most workloads this did not show measurable >> regressions. However, in a few workloads, I have observed severe >> regressions. All of them have been using old synchronized Java collections >> (Vector, Stack), StringBuffer or similar code. The combination of two >> conditions leads to regressions without stack- or fast-locking: 1. The >> workload synchronizes on uncontended locks (e.g. single-threaded use of >> Vector or StringBuffer) and 2. The workload churns such locks. IOW, >> uncontended use of Vector, StringBuffer, etc as such is ok, but creating >> lots of such single-use, single-threaded-locked objects leads to massive >> ObjectMonitor churn, which can lead to a significant performance impact. But >> alas, such code exists, and we probably don't want to punish it if we can >> avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the >> (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the >> hashcode has been installed. Fast-locked headers can be used directly, for >> monitor-locked objects we can easily reach-through to the displaced header. >> This is safe because Java threads participate in monitor deflation protocol. >> This would be implemented in a separate PR >> >> >> Testing: >> - [x] tier1 x86_64 x aarch64 x +UseFastLocking >> - [x] tier2 x86_64 x aarch64 x +UseFastLocking >> - [x] tier3 x86_64 x aarch64 x +UseFastLocking >> - [x] tier4 x86_64 x aarch64 x +UseFastLocking >> - [x] tier1 x86_64 x aarch64 x -UseFastLocking >> - [x] tier2 x86_64 x aarch64 x -UseFastLocking >> - [x] tier3 x86_64 x aarch64 x -UseFastLocking >> - [x] tier4 x86_64 x aarch64 x -UseFastLocking >> - [x] Several real-world applications have been tested with this change in >> tandem with Lilliput without any problems, yet >> >> ### Performance >> >> #### Simple Microbenchmark >> >> The microbenchmark exercises only the locking primitives for monitorenter >> and monitorexit, without contention. The benchmark can be found >> (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops. >> >> | | x86_64 | aarch64 | >> | -- | -- | -- | >> | -UseFastLocking | 20.651 | 20.764 | >> | +UseFastLocking | 18.896 | 18.908 | >> >> >> #### Renaissance >> >> | x86_64 | | | | aarch64 | | >> -- | -- | -- | -- | -- | -- | -- | -- >> | stack-locking | fast-locking | | | stack-locking | fast-locking | >> AkkaUct | 841.884 | 836.948 | 0.59% | | 1475.774 | 1465.647 | 0.69% >> Reactors | 11041.427 | 11181.451 | -1.25% | | 11381.751 | 11521.318 | >> -1.21% >> Als | 1367.183 | 1359.358 | 0.58% | | 1678.103 | 1688.067 | -0.59% >> ChiSquare | 577.021 | 577.398 | -0.07% | | 986.619 | 988.063 | -0.15% >> GaussMix | 817.459 | 819.073 | -0.20% | | 1154.293 | 1155.522 | -0.11% >> LogRegression | 598.343 | 603.371 | -0.83% | | 638.052 | 644.306 | -0.97% >> MovieLens | 8248.116 | 8314.576 | -0.80% | | 7569.219 | 7646.828 | -1.01%% >> NaiveBayes | 587.607 | 581.608 | 1.03% | | 541.583 | 550.059 | -1.54% >> PageRank | 3260.553 | 3263.472 | -0.09% | | 4376.405 | 4381.101 | -0.11% >> FjKmeans | 979.978 | 976.122 | 0.40% | | 774.312 | 771.235 | 0.40% >> FutureGenetic | 2187.369 | 2183.271 | 0.19% | | 2685.722 | 2689.056 | >> -0.12% >> ParMnemonics | 2434.551 | 2468.763 | -1.39% | | 4278.225 | 4263.863 | 0.34% >> Scrabble | 111.882 | 111.768 | 0.10% | | 151.796 | 153.959 | -1.40% >> RxScrabble | 210.252 | 211.38 | -0.53% | | 310.116 | 315.594 | -1.74% >> Dotty | 750.415 | 752.658 | -0.30% | | 1033.636 | 1036.168 | -0.24% >> ScalaDoku | 3072.05 | 3051.2 | 0.68% | | 3711.506 | 3690.04 | 0.58% >> ScalaKmeans | 211.427 | 209.957 | 0.70% | | 264.38 | 265.788 | -0.53% >> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% | | 1088.182 | 1092.266 | >> -0.37% >> Philosophers | 6450.124 | 6565.705 | -1.76% | | 12017.964 | 11902.559 | >> 0.97% >> FinagleChirper | 3953.623 | 3972.647 | -0.48% | | 4750.751 | 4769.274 | >> -0.39% >> FinagleHttp | 3970.526 | 4005.341 | -0.87% | | 5294.125 | 5296.224 | -0.04% > > Roman Kennke has updated the pull request incrementally with two additional > commits since the last revision: > > - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2 > - Set condition flags correctly after fast-lock call on aarch64 Cursory review follows. src/hotspot/cpu/aarch64/aarch64.ad line 3848: > 3846: __ bind(slow); > 3847: __ tst(oop, oop); // Force ZF=0 to indicate failure and take > slow-path. We know that oop != null. > 3848: __ b(no_count); Is this a micro-optimization? I think we can simplify the code by just setting the flags here and then jumping into the usual `__ b(cont)`. This would make the move of `__ b(cont)` unnecessary below. src/hotspot/cpu/aarch64/aarch64.ad line 3954: > 3952: // Indicate success on completion. > 3953: __ cmp(oop, oop); > 3954: __ b(count); `aarch64_enc_fast_lock` explicitly sets NE on failure path. But this code just jumps to `no_count` without setting the flag. Does the code outside this encoding block rely on flags? src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp line 90: > 88: Label done; > 89: // and mark it as unlocked > 90: orr(hdr, hdr, markWord::unlocked_value); Indentation: Suggestion: orr(hdr, hdr, markWord::unlocked_value); src/hotspot/cpu/x86/c1_LIRGenerator_x86.cpp line 322: > 320: // object is already locked (xhandlers expect object to be unlocked) > 321: CodeEmitInfo* info = state_for(x, x->state(), true); > 322: LIR_Opr tmp = UseFastLocking ? new_register(T_INT) : > LIR_OprFact::illegalOpr; Is it really `T_INT`, not `T_ADDRESS`? I guess it follows the definition of `LIR_Opr lock` above, but maybe that one is accidentally working because lock addresses fit in 32 bit? I wonder what that `tmp` would be used for, maybe 64-bit math would break with it? (I separately notice that `syncTempOpr()` is just `rax`, so ) src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 611: > 609: bind(slow); > 610: testptr(objReg, objReg); // force ZF=0 to indicate failure > 611: jmp(NO_COUNT); We set a flag on failure (`NO_COUNT`) path. Should we set the flag on success (`COUNT`) path as well? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 926: > 924: mov(boxReg, tmpReg); > 925: fast_unlock_impl(objReg, boxReg, tmpReg, NO_COUNT); > 926: jmp(COUNT); Do we need to care about returning proper flags here? src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9700: > 9698: // If successful, push object to lock-stack. > 9699: movl(tmp, Address(thread, JavaThread::lock_stack_offset_offset())); > 9700: movptr(Address(thread, tmp, Address::times_1), obj); Minor: I think `Address::times_1` is unnecessary in cases like this. (You might want to grep for other instances of this). src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1714: > 1712: // Save the test result, for recursive case, the result is zero > 1713: __ movptr(Address(lock_reg, mark_word_offset), swap_reg); > 1714: __ jcc(Assembler::notEqual, slow_path_lock); Indenting: Suggestion: __ jcc(Assembler::notEqual, slow_path_lock); src/hotspot/share/runtime/lockStack.cpp line 42: > 40: > 41: #ifndef PRODUCT > 42: void LockStack::validate(const char* msg) const { Would you also like to check there are no `nullptr` elements on stack here? src/hotspot/share/runtime/lockStack.hpp line 42: > 40: // We do this instead of a simple index into the array because this > allows for > 41: // efficient addressing in generated code. > 42: int _offset; This field is accessed from compiled code, we better use a strict-width datatype like `uint32_t`, or maybe even `uint16_t` here? src/hotspot/share/runtime/lockStack.inline.hpp line 54: > 52: assert(to_index(_offset) > 0, "underflow, probably unbalanced > push/pop"); > 53: _offset -= oopSize; > 54: oop o = _base[to_index(_offset)]; I think `pop` and `remove` might benefit from zapping the removed elements in the stack, just to make sure we don't accidentally read them? `validate` can check that everything beoynd `end` is zapped, and there are no zapped elements on stack? src/hotspot/share/runtime/synchronizer.cpp line 315: > 313: const markWord mark = obj->mark(); > 314: > 315: if ((mark.is_fast_locked() && > current->lock_stack().contains(oop(obj))) || `cast_to_oop(obj)` instead of `oop(obj)`? src/hotspot/share/runtime/synchronizer.cpp line 923: > 921: static bool is_lock_owned(Thread* thread, oop obj) { > 922: assert(UseFastLocking, "only call this with fast-locking enabled"); > 923: return thread->is_Java_thread() ? > reinterpret_cast<JavaThread*>(thread)->lock_stack().contains(obj) : false; Here and later, should use `JavaThread::cast(thread)` instead of `reinterpret_cast<JavaThread*>`? It also sometimes subsumes the asserts, as `JT::cast` checks the type. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/ObjectMonitor.java line 83: > 81: > 82: public boolean isOwnedAnonymous() { > 83: return addr.getAddressAt(ownerFieldOffset).asLongValue() == 1; This `1` should be `ANONYMOUS_OWNER` constant, I think. So that textual search would hit both `oop.hpp` and this file. ------------- PR Review: https://git.openjdk.org/jdk/pull/10907#pullrequestreview-1356303825 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147415098 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147411462 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147416146 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147433422 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147441662 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147444139 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147452034 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147452841 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147463354 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147319821 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147469524 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147478054 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147374433 PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147479959