This is a feature requiested by @RogerRiggs and @cl4es .

**Idea**

Merging multiple consecutive small stores (e.g. 8 byte stores) into larger 
stores (e.g. one long store) can lead to speedup.
Recently, @cl4es and @RogerRiggs had to review a few PR's where people would 
try to get
speedups by using
Unsafe (e.g. `Unsafe.putLongUnaligned`), or 
ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`).
They have asked if we can do such an optimization in C2, rather than in the 
Java library code, or even user code.

This patch here supports a few simple use-cases, like these:

Merge consecutive array stores, with constants. We can combine the separate 
constants into a larger constant:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395

Merge consecutive array stores, with a variable that was split (using shifts). 
We can essentially undo the
splitting (i.e. shifting and truncation), and directly store the variable:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456

The idea is that this would allow the introduction of a very simple API, 
without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):

https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472

**Details**

This draft currently implements the optimization in an additional special IGVN 
phase:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485

We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see 
`Compile::gather_nodes_for_merge_stores`).
During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all 
other optimizations) of `StoreNode::Ideal`.
We essentially try to establish a chain of mergable stores:

https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806

Mergable stores must have the same Opcode (implies they have the same element 
type and hence size).
Further, mergable stores must have the same control (or be separated by only a 
RangeCheck).
Further, they must either both store constants, or adjacent segments of a 
larger value (i.e. a larger value right-shifted by a constant offset, 
see`is_con_RShift`).
Further, we must be able to prove that the stores reference adjacent memory 
(i.e. the address is shifted by the element size).
For two mergable stores (one `use`, one `def`), the def-store should not have 
any other use than the use-store,
so that we only merge stores that are in the same basic block. With the only 
exceptions of merging through
RangeChecks (which can have MergeMem nodes on the memory path, and hence such 
MergeMem are allowed
as secondary uses of the def-node).

I made this optimization a new phase, and placed it after loop-opts for these 
reasons:
- I do not want it to interfere with loop-opts, in particular with the 
autovectorizer (SuperWord).
- I don't want it to interfere with any other memory optimizations, this should 
just improve things if nothing else worked.
- Checking if two memory addresses are adjacent is much simpler after 
loop-opts, when some of the `CastII` nodes have disappeared, and the address 
expression becomes much simpler (in particular, the constants from the integer 
index can only sink through the CastI2L after loop-opts). We could do adjacency 
checking with a more complicated algorithm, such as `VPointer` in the current 
autovectorizer.

**Performance**

TODO

**Testing**

Tier 1-6 + stress-testing.
Performance testing: no significant difference.

-------------

Commit messages:
 - fix flag initialization issue
 - merge manually
 - Merge branch 'master' into JDK-8318446
 - make sure the unsafe accesses are always recognized
 - swap number and type
 - add short and int examples to bench
 - rename array
 - add LE-API to bench
 - improved the benchmark
 - fix bad if problem
 - ... and 20 more: https://git.openjdk.org/jdk/compare/7e0a4ed6...adca9e22

Changes: https://git.openjdk.org/jdk/pull/16245/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8318446
  Stats: 2007 lines in 11 files changed: 2006 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/16245.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245

PR: https://git.openjdk.org/jdk/pull/16245

Reply via email to