Re: RFR: 8291555: Implement alternative fast-locking scheme [v73]

Roman Kennke Thu, 04 May 2023 22:54:59 -0700

> This change adds a fast-locking scheme as an alternative to the current 
> stack-locking implementation. It retains the advantages of stack-locking 
> (namely fast locking in uncontended code-paths), while avoiding the overload 
> of the mark word. That overloading causes massive problems with Lilliput, 
> because it means we have to check and deal with this situation when trying to 
> access the mark-word. And because of the very racy nature, this turns out to 
> be very complex and would involve a variant of the inflation protocol to 
> ensure that the object header is stable. (The current implementation of 
> setting/fetching the i-hash provides a glimpse into the complexity).
> 
> What the original stack-locking does is basically to push a stack-lock onto 
> the stack which consists only of the displaced header, and CAS a pointer to 
> this stack location into the object header (the lowest two header bits being 
> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
> identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two 
> header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
> bits with a stack-pointer. Instead, it pushes the object-reference to a 
> thread-local lock-stack. This is a new structure which is basically a small 
> array of oops that is associated with each thread. Experience shows that this 
> array typcially remains very small (3-5 elements). Using this lock stack, it 
> is possible to query which threads own which locks. Most importantly, the 
> most common question 'does the current thread own me?' is very quickly 
> answered by doing a quick scan of the array. More complex queries like 'which 
> thread owns X?' are not performed in very performance-critical paths (usually 
> in code like JVMTI or deadlock detection) where it is ok to do more complex 
> operations (and we already do). The lock-stack is also a new set of GC roots, 
> and would be scanned during thread scanning, possibly concurrently, via the 
> normal p
 rotocols.
> 
> The lock-stack is fixed size, currently with 8 elements. According to my 
> experiments with various workloads, this covers the vast majority of 
> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
> thread at a time). We check for overflow in the fast-paths and when the 
> lock-stack is full, we take the slow-path, which would inflate the lock to a 
> monitor. That case should be very rare.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive 
> locking (yet). When that happens, the fast-lock gets inflated to a full 
> monitor. It is not clear if it is worth to add support for recursive 
> fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, 
> it must inflate the fast-lock to a full monitor. Normally, we need to know 
> the current owning thread, and record that in the monitor, so that the 
> contending thread can wait for the current owner to properly exit the 
> monitor. However, fast-locking doesn't have this information. What we do 
> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
> currently holds the lock arrives at monitorexit, and observes 
> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
> and then properly exits the monitor, and thus handing over to the contending 
> thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only 
> use heavy monitors. In most workloads this did not show measurable 
> regressions. However, in a few workloads, I have observed severe regressions. 
> All of them have been using old synchronized Java collections (Vector, 
> Stack), StringBuffer or similar code. The combination of two conditions leads 
> to regressions without stack- or fast-locking: 1. The workload synchronizes 
> on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 
> 2. The workload churns such locks. IOW, uncontended use of Vector, 
> StringBuffer, etc as such is ok, but creating lots of such single-use, 
> single-threaded-locked objects leads to massive ObjectMonitor churn, which 
> can lead to a significant performance impact. But alas, such code exists, and 
> we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the 
> (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the 
> hashcode has been installed. Fast-locked headers can be used directly, for 
> monitor-locked objects we can easily reach-through to the displaced header. 
> This is safe because Java threads participate in monitor deflation protocol. 
> This would be implemented in a separate PR
> 
> Also, and I might be mistaken here, this new lightweight locking would make 
> synchronized work better with Loom: Because the lock-records are no longer 
> scattered across the stack, but instead are densely packed into the 
> lock-stack, it should be easy for a vthread to save its lock-stack upon 
> unmounting and restore it when re-mounting. However, I am not sure about 
> this, and this PR does not attempt to implement that support.
> 
> Testing:
>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>  - [x] Several real-world applications have been tested with this change in 
> tandem with Lilliput without any problems, yet
> 
> ### Performance
> 
> #### Simple Microbenchmark
> 
> The microbenchmark exercises only the locking primitives for monitorenter and 
> monitorexit, without contention. The benchmark can be found 
> (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
> 
> |  | x86_64 | aarch64 |
> | -- | -- | -- |
> | -UseFastLocking | 20.651 | 20.764 |
> | +UseFastLocking | 18.896 | 18.908 |
> 
> 
> #### Renaissance
> 
>   | x86_64 |   |   |   | aarch64 |   |  
> -- | -- | -- | -- | -- | -- | -- | --
>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | -1.21%
> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | 
> -0.37%
> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 
> 0.97%
> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | 
> -0.39%
> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%


Roman Kennke has updated the pull request incrementally with one additional 
commit since the last revision:

  Relax zapped-entry test when calling thread is not owning thread

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/10907/files
  - new: https://git.openjdk.org/jdk/pull/10907/files/e06c5ef1..43cdbb53

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=10907&range=72
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10907&range=71-72

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/10907.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/10907/head:pull/10907

PR: https://git.openjdk.org/jdk/pull/10907

Re: RFR: 8291555: Implement alternative fast-locking scheme [v73]

Reply via email to