The current subword (`byte`/`short`) gather load API implementation is not
well-suited for platforms that provide native vector instructions for these
operations. As **discussed in PR [1]**, we'd like to re-implement these APIs
with a **unified cross-platform** solution.
The main idea is to re-implement the API at Java-level, by performing multiple
sub-gather operations. Each sub-gather operation loads a portion of elements
using a specific index vector by calling the HotSpot intrinsic API. The partial
results are then merged using vector `slice` and `or` operations. This design
simplifies the VM compiler intrinsic implementation and better aligns with the
Vector API design principles.
Key changes:
1. Re-implement the subword gather load API at the Java level. The HotSpot
intrinsic `VectorSupport.loadWithMap` is simplified by reducing the vector
index parameters from four (vix1-vix4) to a single parameter.
2. Adjust the compiler intrinsic implementation to support the new Java API,
including updates to the x86 backend implementation.
The performance impact varies across different scenarios on X86. I tested the
performance with different AVX levels on an X86 machine that supports AVX512.
To achieve optimal performance, I also **applied PR [2]**, which improves the
performance of the **`slice()`** API on X86. Following is the summarized
performance gains, where:
- "non masked" means the gather operation is not the masked gather API.
- "masked" means the gather operation is the masked gather API.
- "1 gather cases" means the gather API is implemented with a single gather
operation. E.g. Load `Short128Vector` with `MaxVectorSize=256`.
- "2 gather cases" means the gather API is implemented with 2 parts of gather
operations. E.g. Load `Short256Vector` with `MaxVectorSize=256`.
- "4 gather cases" means the gather API is implemented with 4 parts of gather
operations. E.g. Load `Byte256Vector` with `MaxVectorSize=256`.
- "Un-intrinsified" means the gather operation is not supported to be
intrinsified by hotspot. E.g. Load `Byte512Vector` with `MaxVectorSize=256`.
The singificant performance uplifts comes from the Java-level changes which
removes the vector index generation and range checks for such cases.
----------------------------------------------------------------------------
| UseAVX=3 | UseAVX=2 |
|-----------------------------|-----------------------------|
| non masked | masked | non masked | masked |
|--------------|--------------|--------------|--------------|
1 gather cases | 0.99 ~ 1.06x | 0.94 ~ 1.11x | 0.94 ~ 1.00x | 0.99 ~ 1.11x |
---------------|--------------|--------------|--------------|--------------|
2 gather cases | 0.94 ~ 1.01x | 0.88 ~ 0.97x | 0.8 ~ 1.13x | 0.82 ~ 0.93x |
---------------|--------------|--------------|--------------|--------------|
4 gather cases | 0.92 ~ 0.95x | 0.84 ~ 0.88x | 0.98 ~ 1.06x | 0.81 ~ 0.92x |
---------------|--------------|--------------|--------------|--------------|
Un-intrinsified| N/A | N/A | 1.48 ~ 1.65x | 1.1 ~ 1.53x |
---------------|--------------|--------------|--------------|--------------|
There are performance regressions especially for APIs that need splitting and
merging operations. And the regressions are more significant for the masked
cases. This is caused by the additional vector/mask slice and merging
operations in Java code, which I think is un-avoidable.
Note-1: Compared with before, this patch **disables** the gather API
intrinsification for **64-bit species** when **`MaxVectorSize=8`**, because it
would generate a 16-bit vector, which is smaller than the supported minimum
vector size of 32-bit. This limitation can be addressed by adjusting the IR
pattern in the future. However, this requires significant refactoring of the
X86 backend implementation, which is challenging for me. I'd like to leave this
as a separate work. And it would be much more helpful if I can get any help
from the X86 experts.
Note-2: This patch only includes the refactoring of the Java API code and the
HotSpot x86 backend implementation. A follow-up patch will add the support for
the AArch64 SVE backend.
[1] https://github.com/openjdk/jdk/pull/26236
[2] https://github.com/openjdk/jdk/pull/24104
-------------
Commit messages:
- 8372136: VectorAPI: Refactor subword gather load API java implementation
Changes: https://git.openjdk.org/jdk/pull/28520/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28520&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8372136
Stats: 558 lines in 13 files changed: 383 ins; 78 del; 97 mod
Patch: https://git.openjdk.org/jdk/pull/28520.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28520/head:pull/28520
PR: https://git.openjdk.org/jdk/pull/28520