Re: RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v104]

Aleksey Shipilev Wed, 13 May 2026 08:33:59 -0700

On Mon, 4 May 2026 23:03:06 GMT, Xiaolong Peng <[email protected]> wrote:


>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>> 
>> Shenandoah always allocates memory with heap lock, we have observed heavy 
>> heap lock contention on memory allocation path in performance analysis of 
>> some service in which we tried to adopt Shenandoah. This change is to 
>> propose an optimization for the code path of memory allocation to improve 
>> heap lock contention, along with the optimization, a better OOD is also done 
>> to Shenandoah memory allocation to reuse the majority of the code:
>> 
>> * ShenandoahAllocator: base class of the allocators, most of the allocation 
>> code is in this class.
>> * ShenandoahMutatorAllocator: allocator for mutator, inherit from 
>> ShenandoahAllocator, only override methods `alloc_start_index`, `verify`, 
>> `_alloc_region_count` and  `_yield_to_safepoint` to customize the allocator 
>> for mutator.
>> * ShenandoahCollectorAllocator: allocator for collector allocation in 
>> Collector partition, similar to ShenandoahMutatorAllocator, only few lines 
>> of code to customize the allocator for Collector. 
>> * ShenandoahOldCollectorAllocator:  allocator for mutator collector 
>> allocation in OldCollector partition, it doesn't inherit the logic from 
>> ShenandoahAllocator for now, the `allocate` method has been overridden to 
>> delegate to `FreeSet::allocate_for_collector` due to the special allocation 
>> considerations for `plab` in old gen. We will rewrite this part later and 
>> move the code out of `FreeSet::allocate_for_collector`
>> 
>> I'm not expecting significant performance impact for most of the cases since 
>> in most case the contention on heap lock it not high enough to cause 
>> performance issue, but in some cases it may improve the latency/performance:
>> 
>> 1. Dacapo lusearch test on EC2 host with 96 CPU cores, p90 is improved from 
>> 500+us to less than 150us, p99 from 1000+us to ~200us. 
>> 
>> java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms31G -Xmx31G 
>> -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions 
>> -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit 
>> -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar 
>> ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered 
>> full smoothing"
>> 
>> 
>> Openjdk TIP:
>> 
>> ===== DaCapo tail latency, metered full smoothing: 50% 241098 usec, 90% 
>> 402356 usec, 99% 411065 usec, 99.9% 411763 usec, 99.99% 415531 usec, max 
>> 428584 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 902 usec, 9...
>
> Xiaolong Peng has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 386 commits:
> 
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - Document _collector_allocator_reserved ordering contract
>  - Tame TestCASAllocContention: fewer threads, shorter run, smaller retention
>  - Comment: pair reserve-time inflation with read-time compensation
>    
>    Cross-reference ShenandoahFreeSet::reserve_alloc_regions_internal at
>    every 'bytes_allocated - mutator_allocator_remaining' subtraction site.
>    No behavior change.
>  - Include mutator allocator remaining bytes in unsafe_max_tlab_alloc
>  - Reserve mutator alloc regions after abbreviated degenerated GC
>  - fix: new jtreg tests miss -XX:+UnlockDiagnosticVMOptions
>  - fix: sync _top before CAS in unset_active_alloc_region
>  - fix: sync _top once on CAS success and reset_age after the loop in 
> unset_active_alloc_region
>  - Add jtreg tests for CAS allocator flag combinations and contention
>  - ... and 376 more: https://git.openjdk.org/jdk/compare/ebb3d688...ce68d1fa

No, this requires significantly more work. This is way too hard to review. You 
need to drastically and mercilessly simplify.

I believe it would be significantly simpler if you can introduce "allocator" 
interface separately without any semantic changes first. _Then_ extend that 
allocator with CAS allocations support.

Additionally, I think it clashes with 
https://github.com/openjdk/jdk/pull/31047, and that one probably needs to go in 
first.

src/hotspot/share/gc/shared/plab.cpp line 40:

> 38: }
> 39: 
> 40: size_t PLAB::min_byte_size() {

There is a single use. Inline the implementation to avoid extra shared code 
changes.

src/hotspot/share/gc/shared/plab.cpp line 48:

> 46: }
> 47: 
> 48: size_t PLAB::max_byte_size() {

I don't think this is used anywhere.

src/hotspot/share/gc/shenandoah/heuristics/shenandoahAdaptiveHeuristics.cpp 
line 421:

> 419:   size_t allocated_bytes_since_last_sample = 0;
> 420: 
> 421:   {

I think this clashes with https://github.com/openjdk/jdk/pull/31047, which 
simplifies this path.

src/hotspot/share/gc/shenandoah/shenandoahAffiliation.hpp line 28:

> 26: #define SHARE_GC_SHENANDOAH_SHENANDOAHAFFILIATION_HPP
> 27: 
> 28: #include "utilities/debug.hpp"

Do you need it? If you don't use anything from the header in this file, move 
the include where it is actually needed.

-------------

Changes requested by shade (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/26171#pullrequestreview-4282998779
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r3235410521
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r3235405129
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r3235449980
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r3235419377

Re: RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v104]

Reply via email to