On Wed, 18 Jan 2023 20:10:21 GMT, Roman Kennke <[email protected]> wrote:
>> This change adds a fast-locking scheme as an alternative to the current
>> stack-locking implementation. It retains the advantages of stack-locking
>> (namely fast locking in uncontended code-paths), while avoiding the overload
>> of the mark word. That overloading causes massive problems with Lilliput,
>> because it means we have to check and deal with this situation when trying
>> to access the mark-word. And because of the very racy nature, this turns out
>> to be very complex and would involve a variant of the inflation protocol to
>> ensure that the object header is stable. (The current implementation of
>> setting/fetching the i-hash provides a glimpse into the complexity).
>>
>> What the original stack-locking does is basically to push a stack-lock onto
>> the stack which consists only of the displaced header, and CAS a pointer to
>> this stack location into the object header (the lowest two header bits being
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to
>> identify which thread currently owns the lock.
>>
>> This change basically reverses stack-locking: It still CASes the lowest two
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to
>> a thread-local lock-stack. This is a new structure which is basically a
>> small array of oops that is associated with each thread. Experience shows
>> that this array typcially remains very small (3-5 elements). Using this lock
>> stack, it is possible to query which threads own which locks. Most
>> importantly, the most common question 'does the current thread own me?' is
>> very quickly answered by doing a quick scan of the array. More complex
>> queries like 'which thread owns X?' are not performed in very
>> performance-critical paths (usually in code like JVMTI or deadlock
>> detection) where it is ok to do more complex operations (and we already do).
>> The lock-stack is also a new set of GC roots, and would be scanned during
>> thread scanning, possibly concurrently, via the normal
protocols.
>>
>> The lock-stack is grown when needed. This means that we need to check for
>> potential overflow before attempting locking. When that is the case, locking
>> fast-paths would call into the runtime to grow the stack and handle the
>> locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check
>> on method entry to avoid (possibly lots) of such checks at locking sites.
>>
>> In contrast to stack-locking, fast-locking does *not* support recursive
>> locking (yet). When that happens, the fast-lock gets inflated to a full
>> monitor. It is not clear if it is worth to add support for recursive
>> fast-locking.
>>
>> One trouble is that when a contending thread arrives at a fast-locked
>> object, it must inflate the fast-lock to a full monitor. Normally, we need
>> to know the current owning thread, and record that in the monitor, so that
>> the contending thread can wait for the current owner to properly exit the
>> monitor. However, fast-locking doesn't have this information. What we do
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that
>> currently holds the lock arrives at monitorexit, and observes
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself,
>> and then properly exits the monitor, and thus handing over to the contending
>> thread.
>>
>> As an alternative, I considered to remove stack-locking altogether, and only
>> use heavy monitors. In most workloads this did not show measurable
>> regressions. However, in a few workloads, I have observed severe
>> regressions. All of them have been using old synchronized Java collections
>> (Vector, Stack), StringBuffer or similar code. The combination of two
>> conditions leads to regressions without stack- or fast-locking: 1. The
>> workload synchronizes on uncontended locks (e.g. single-threaded use of
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW,
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating
>> lots of such single-use, single-threaded-locked objects leads to massive
>> ObjectMonitor churn, which can lead to a significant performance impact. But
>> alas, such code exists, and we probably don't want to punish it if we can
>> avoid it.
>>
>> This change enables to simplify (and speed-up!) a lot of code:
>>
>> - The inflation protocol is no longer necessary: we can directly CAS the
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the
>> hashcode has been installed. Fast-locked headers can be used directly, for
>> monitor-locked objects we can easily reach-through to the displaced header.
>> This is safe because Java threads participate in monitor deflation protocol.
>> This would be implemented in a separate PR
>>
>>
>> Testing:
>> - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>> - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>> - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>> - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>> - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>> - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>> - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>> - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>> - [x] Several real-world applications have been tested with this change in
>> tandem with Lilliput without any problems, yet
>>
>> ### Performance
>>
>> #### Renaissance
>>
>>
>>
>> | x86_64 | | | | aarch64 | |
>> -- | -- | -- | -- | -- | -- | -- | --
>> | stack-locking | fast-locking | | | stack-locking | fast-locking |
>> AkkaUct | 841.884 | 836.948 | 0.59% | | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11444.511 | 11606.66 | -1.40% | | 11382.594 | 11638.036 | -2.19%
>> Als | 1367.183 | 1359.358 | 0.58% | | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% | | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% | | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% | | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% | | 9898.1 | 10097.867 | -1.98%
>> NaiveBayes | 587.607 | 581.608 | 1.03% | | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% | | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% | | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% | | 2685.722 | 2689.056 |
>> -0.12%
>> ParMnemonics | 2527.228 | 2564.667 | -1.46% | | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% | | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% | | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% | | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% | | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% | | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% | | 1088.182 | 1092.266 |
>> -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% | | 12017.964 | 11902.559 |
>> 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% | | 4750.751 | 4769.274 |
>> -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% | | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Revert UseFastLocking to default to off
src/hotspot/share/oops/markWord.hpp line 183:
> 181: }
> 182: markWord set_fast_locked() const {
> 183: return markWord(value() & ~lock_mask_in_place);
Perhaps add a comment above L183:
// Clear the lock_mask_in_place bits to set locked_value:
src/hotspot/share/runtime/lockStack.cpp line 34:
> 32:
> 33: LockStack::LockStack() :
> 34: _base(UseFastLocking && !UseHeavyMonitors ? NEW_C_HEAP_ARRAY(oop,
> INITIAL_CAPACITY, mtSynchronizer) : NULL),
Okay so `UseFastLocking && UseHeavyMonitors`, then we don't need the lock stack.
src/hotspot/share/runtime/lockStack.inline.hpp line 81:
> 79: }
> 80: }
> 81: validate("post-contains");
You only do the `validate("post-contains")` call when `false` is
returned. Why not also validate for the `true` branch?
src/hotspot/share/runtime/objectMonitor.cpp line 1141:
> 1139: assert(_recursions == 0, "invariant");
> 1140: set_owner_from_BasicLock(cur, current); // Convert from
> BasicLock* to Thread*.
> 1141: _recursions = 0;
Hmmm... fast-locking also has two different ways that a thread can own an
ObjectMonitor: 1) owner == thread* and 2) owner == anonymous and lock on
the owning thread's lock stack.
If `UseFastLocking` is enabled and `ObjectMonitor::exit()` is called by a
thread that
owns the ObjectMonitor in the secondary way, then this code will fail the
assert on
L1158 in non-product bits. There needs to be check to see if the current thread
owns
the fast-lock and then update the owner in the ObjectMonitor after asserting
about
recursions and resetting recursions to 0.
src/hotspot/share/runtime/synchronizer.cpp line 568:
> 566: monitor->set_owner_from_anonymous(current);
> 567: monitor->exit(current);
> 568: }
Hmmm... We're in `ObjectSynchronizer::exit()` so we should be the owner of
the ObjectMonitor so I'm not yet sure what "Another thread beat us" means.
XXX
-------------
PR: https://git.openjdk.org/jdk/pull/10907