> Hi all,
>
> please review this change that implements (currently Draft) JEP: G1:
> Improve Application Throughput with a More Efficient Write-Barrier.
>
> The reason for posting this early is that this is a large change, and the JEP
> process is already taking very long with no end in sight but we would like to
> have this ready by JDK 25.
>
> ### Current situation
>
> With this change, G1 will reduce the post write barrier to much more resemble
> Parallel GC's as described in the JEP. The reason is that G1 lacks in
> throughput compared to Parallel/Serial GC due to larger barrier.
>
> The main reason for the current barrier is how g1 implements concurrent
> refinement:
> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers
> (dirty card queues - dcq) containing the location of dirtied cards.
> Refinement threads pick up their contents to re-refine. The barrier needs to
> enqueue card locations.
> * For correctness dirty card updates requires fine-grained synchronization
> between mutator and refinement threads,
> * Finally there is generic code to avoid dirtying cards altogether (filters),
> to avoid executing the synchronization and the enqueuing as much as possible.
>
> These tasks require the current barrier to look as follows for an assignment
> `x.a = y` in pseudo code:
>
>
> // Filtering
> if (region(@x.a) == region(y)) goto done; // same region check
> if (y == null) goto done; // null value check
> if (card(@x.a) == young_card) goto done; // write to young gen check
> StoreLoad; // synchronize
> if (card(@x.a) == dirty_card) goto done;
>
> *card(@x.a) = dirty
>
> // Card tracking
> enqueue(card-address(@x.a)) into thread-local-dcq;
> if (thread-local-dcq is not full) goto done;
>
> call runtime to move thread-local-dcq into dcqs
>
> done:
>
>
> Overall this post-write barrier alone is in the range of 40-50 total
> instructions, compared to three or four(!) for parallel and serial gc.
>
> The large size of the inlined barrier not only has a large code footprint,
> but also prevents some compiler optimizations like loop unrolling or inlining.
>
> There are several papers showing that this barrier alone can decrease
> throughput by 10-20%
> ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is
> corroborated by some benchmarks (see links).
>
> The main idea for this change is to not use fine-grained synchronization
> between refinement and mutator threads, but coarse grained based on
> atomically switching card tables. Mutators only work on the "primary" card
> table, refinement threads on a se...
Thomas Schatzl has updated the pull request with a new target base due to a
merge or a rebase. The pull request now contains 82 commits:
- Merge branch 'master' into 8342382-card-table-instead-of-dcq
- * iwalulya review
* documentation for a few PSS members
* rename some member variables to contain _ct and _rt suffixes in
remembered set verification
- Merge branch 'master' into 8342382-card-table-instead-of-dcq
- * therealaph suggestion for avoiding the register aliasin in
gen_write_ref_array_post
- * walulyai review
- * walulyai review
* tried to remove "logged card" terminology for the current "pending card"
one
- * aph review, fix some comment
- Merge branch 'master' into 8342382-card-table-instead-of-dcq
- Merge branch 'master' into 8342382-card-table-instead-of-dcq
- * iwalulya: remove confusing comment
- ... and 72 more: https://git.openjdk.org/jdk/compare/5efaa997...b5d22d52
-------------
Changes: https://git.openjdk.org/jdk/pull/23739/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=61
Stats: 7178 lines in 113 files changed: 2606 ins; 3588 del; 984 mod
Patch: https://git.openjdk.org/jdk/pull/23739.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739
PR: https://git.openjdk.org/jdk/pull/23739