On Fri, 28 Oct 2022 20:17:37 GMT, Roman Kennke <rken...@openjdk.org> wrote:
> This change adds a fast-locking scheme as an alternative to the current > stack-locking implementation. It retains the advantages of stack-locking > (namely fast locking in uncontended code-paths), while avoiding the overload > of the mark word. That overloading causes massive problems with Lilliput, > because it means we have to check and deal with this situation when trying to > access the mark-word. And because of the very racy nature, this turns out to > be very complex and would involve a variant of the inflation protocol to > ensure that the object header is stable. (The current implementation of > setting/fetching the i-hash provides a glimpse into the complexity). > > What the original stack-locking does is basically to push a stack-lock onto > the stack which consists only of the displaced header, and CAS a pointer to > this stack location into the object header (the lowest two header bits being > 00 indicate 'stack-locked'). The pointer into the stack can then be used to > identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper > bits with a stack-pointer. Instead, it pushes the object-reference to a > thread-local lock-stack. This is a new structure which is basically a small > array of oops that is associated with each thread. Experience shows that this > array typcially remains very small (3-5 elements). Using this lock stack, it > is possible to query which threads own which locks. Most importantly, the > most common question 'does the current thread own me?' is very quickly > answered by doing a quick scan of the array. More complex queries like 'which > thread owns X?' are not performed in very performance-critical paths (usually > in code like JVMTI or deadlock detection) where it is ok to do more complex > operations (and we already do). The lock-stack is also a new set of GC roots, > and would be scanned during thread scanning, possibly concurrently, via the > normal p rotocols. > > The lock-stack is grown when needed. This means that we need to check for > potential overflow before attempting locking. When that is the case, locking > fast-paths would call into the runtime to grow the stack and handle the > locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check > on method entry to avoid (possibly lots) of such checks at locking sites. > > In contrast to stack-locking, fast-locking does *not* support recursive > locking (yet). When that happens, the fast-lock gets inflated to a full > monitor. It is not clear if it is worth to add support for recursive > fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, > it must inflate the fast-lock to a full monitor. Normally, we need to know > the current owning thread, and record that in the monitor, so that the > contending thread can wait for the current owner to properly exit the > monitor. However, fast-locking doesn't have this information. What we do > instead is to record a special marker ANONYMOUS_OWNER. When the thread that > currently holds the lock arrives at monitorexit, and observes > ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, > and then properly exits the monitor, and thus handing over to the contending > thread. > > As an alternative, I considered to remove stack-locking altogether, and only > use heavy monitors. In most workloads this did not show measurable > regressions. However, in a few workloads, I have observed severe regressions. > All of them have been using old synchronized Java collections (Vector, > Stack), StringBuffer or similar code. The combination of two conditions leads > to regressions without stack- or fast-locking: 1. The workload synchronizes > on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and > 2. The workload churns such locks. IOW, uncontended use of Vector, > StringBuffer, etc as such is ok, but creating lots of such single-use, > single-threaded-locked objects leads to massive ObjectMonitor churn, which > can lead to a significant performance impact. But alas, such code exists, and > we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the > (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the > hashcode has been installed. Fast-locked headers can be used directly, for > monitor-locked objects we can easily reach-through to the displaced header. > This is safe because Java threads participate in monitor deflation protocol. > This would be implemented in a separate PR > > > Testing: > - [x] tier1 x86_64 x aarch64 x +UseFastLocking > - [x] tier2 x86_64 x aarch64 x +UseFastLocking > - [x] tier3 x86_64 x aarch64 x +UseFastLocking > - [x] tier4 x86_64 x aarch64 x +UseFastLocking > - [x] tier1 x86_64 x aarch64 x -UseFastLocking > - [x] tier2 x86_64 x aarch64 x -UseFastLocking > - [x] tier3 x86_64 x aarch64 x -UseFastLocking > - [x] tier4 x86_64 x aarch64 x -UseFastLocking > - [x] Several real-world applications have been tested with this change in > tandem with Lilliput without any problems, yet > > ### Performance > > #### Renaissance > > > > | x86_64 | | | | aarch64 | | > -- | -- | -- | -- | -- | -- | -- | -- > | stack-locking | fast-locking | | | stack-locking | fast-locking | > AkkaUct | 841.884 | 836.948 | 0.59% | | 1475.774 | 1465.647 | 0.69% > Reactors | 11444.511 | 11606.66 | -1.40% | | 11382.594 | 11638.036 | -2.19% > Als | 1367.183 | 1359.358 | 0.58% | | 1678.103 | 1688.067 | -0.59% > ChiSquare | 577.021 | 577.398 | -0.07% | | 986.619 | 988.063 | -0.15% > GaussMix | 817.459 | 819.073 | -0.20% | | 1154.293 | 1155.522 | -0.11% > LogRegression | 598.343 | 603.371 | -0.83% | | 638.052 | 644.306 | -0.97% > MovieLens | 8248.116 | 8314.576 | -0.80% | | 9898.1 | 10097.867 | -1.98% > NaiveBayes | 587.607 | 581.608 | 1.03% | | 541.583 | 550.059 | -1.54% > PageRank | 3260.553 | 3263.472 | -0.09% | | 4376.405 | 4381.101 | -0.11% > FjKmeans | 979.978 | 976.122 | 0.40% | | 774.312 | 771.235 | 0.40% > FutureGenetic | 2187.369 | 2183.271 | 0.19% | | 2685.722 | 2689.056 | -0.12% > ParMnemonics | 2527.228 | 2564.667 | -1.46% | | 4278.225 | 4263.863 | 0.34% > Scrabble | 111.882 | 111.768 | 0.10% | | 151.796 | 153.959 | -1.40% > RxScrabble | 210.252 | 211.38 | -0.53% | | 310.116 | 315.594 | -1.74% > Dotty | 750.415 | 752.658 | -0.30% | | 1033.636 | 1036.168 | -0.24% > ScalaDoku | 3072.05 | 3051.2 | 0.68% | | 3711.506 | 3690.04 | 0.58% > ScalaKmeans | 211.427 | 209.957 | 0.70% | | 264.38 | 265.788 | -0.53% > ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% | | 1088.182 | 1092.266 | > -0.37% > Philosophers | 6450.124 | 6565.705 | -1.76% | | 12017.964 | 11902.559 | > 0.97% > FinagleChirper | 3953.623 | 3972.647 | -0.48% | | 4750.751 | 4769.274 | > -0.39% > FinagleHttp | 3970.526 | 4005.341 | -0.87% | | 5294.125 | 5296.224 | -0.04% Hi Roman, I get this error on aarch64 in random tests: # Internal Error (.../open/src/hotspot/share/runtime/lockStack.cpp:50), pid=422380, tid=422437 # assert(*loc1 != *loc2) failed: entries must be unique: pre-contains V [libjvm.so+0x12a9f48] LockStack::validate(char const*) const+0x118 (lockStack.cpp:50) V [libjvm.so+0x17c5e60] is_lock_owned(Thread*, oop)+0x70 (lockStack.inline.hpp:72) V [libjvm.so+0x17cac8c] ObjectSynchronizer::inflate(Thread*, oop, ObjectSynchronizer::InflateCause)+0x6cc (synchronizer.cpp:1323) V [libjvm.so+0x17d0648] ObjectSynchronizer::enter(Handle, BasicLock*, JavaThread*)+0x108 (synchronizer.cpp:544) V [libjvm.so+0x16a23b8] SharedRuntime::monitor_enter_helper(oopDesc*, BasicLock*, JavaThread*)+0x1e8 (sharedRuntime.cpp:2294) V [libjvm.so+0x16a2724] SharedRuntime::complete_monitor_locking_C(oopDesc*, BasicLock*, JavaThread*)+0xb4 (sharedRuntime.cpp:2301) Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::_complete_monitor_locking_Java 0x0000fffd30576414 J 9196 c2 java.util.jar.JarFile.getManEntry()Ljava/util/jar/JarEntry; java.base@21-internal (39 bytes) @ 0x0000fffd3154715c [0x0000fffd31546a80+0x00000000000006dc] ------------- PR: https://git.openjdk.org/jdk/pull/10907