The current subword (`byte`/`short`) gather load API implementation is not 
well-suited for platforms that provide native vector instructions for these 
operations. As **discussed in PR [1]**, we'd like to re-implement these APIs 
with a **unified cross-platform** solution.

The main idea is to re-implement the API at Java-level, by performing multiple 
sub-gather operations. Each sub-gather operation loads a portion of elements 
using a specific index vector by calling the HotSpot intrinsic API. The partial 
results are then merged using vector `slice` and `or` operations. This design 
simplifies the VM compiler intrinsic implementation and better aligns with the 
Vector API design principles.

Key changes:
1. Re-implement the subword gather load API at the Java level. The HotSpot 
intrinsic `VectorSupport.loadWithMap` is simplified by reducing the vector 
index parameters from four (vix1-vix4) to a single parameter.
2. Adjust the compiler intrinsic implementation to support the new Java API, 
including updates to the x86 backend implementation.

The performance impact varies across different scenarios on X86. I tested the 
performance with different AVX levels on an X86 machine that supports AVX512. 
To achieve optimal performance, I also **applied PR [2]**, which improves the 
performance of the **`slice()`** API on X86. Following is the summarized 
performance gains, where:

- "non masked" means the gather operation is not the masked gather API.
- "masked" means the gather operation is the masked gather API.
- "1 gather cases" means the gather API is implemented with a single gather 
operation. E.g. Load `Short128Vector` with `MaxVectorSize=256`.
- "2 gather cases" means the gather API is implemented with 2 parts of gather 
operations. E.g. Load `Short256Vector` with `MaxVectorSize=256`.
- "4 gather cases" means the gather API is implemented with 4 parts of gather 
operations. E.g. Load `Byte256Vector` with `MaxVectorSize=256`.
- "Un-intrinsified" means the gather operation is not supported to be 
intrinsified by hotspot. E.g. Load `Byte512Vector` with `MaxVectorSize=256`. 
The singificant performance uplifts comes from the Java-level changes which 
removes the vector index generation and range checks for such cases.


----------------------------------------------------------------------------
               |          UseAVX=3           |          UseAVX=2           |
               |-----------------------------|-----------------------------|
               |  non masked  |  masked      | non masked   |    masked    |
               |--------------|--------------|--------------|--------------|
1 gather cases | 0.99 ~ 1.06x | 0.94 ~ 1.11x | 0.94 ~ 1.00x | 0.99 ~ 1.11x |
---------------|--------------|--------------|--------------|--------------|
2 gather cases | 0.94 ~ 1.01x | 0.88 ~ 0.97x |  0.8 ~ 1.13x | 0.82 ~ 0.93x |
---------------|--------------|--------------|--------------|--------------|
4 gather cases | 0.92 ~ 0.95x | 0.84 ~ 0.88x | 0.98 ~ 1.06x | 0.81 ~ 0.92x |
---------------|--------------|--------------|--------------|--------------|
Un-intrinsified|     N/A      |     N/A      | 1.48 ~ 1.65x |  1.1 ~ 1.53x |
---------------|--------------|--------------|--------------|--------------|


There are performance regressions especially for APIs that need splitting and 
merging operations. And the regressions are more significant for the masked 
cases. This is caused by the additional vector/mask slice and merging 
operations in Java code, which I think is un-avoidable.

Note-1: Compared with before, this patch **disables** the gather API 
intrinsification for **64-bit species** when **`MaxVectorSize=8`**, because it 
would generate a 16-bit vector, which is smaller than the supported minimum 
vector size of 32-bit. This limitation can be addressed by adjusting the IR 
pattern in the future. However, this requires significant refactoring of the 
X86 backend implementation, which is challenging for me. I'd like to leave this 
as a separate work. And it would be much more helpful if I can get any help 
from the X86 experts.

Note-2: This patch only includes the refactoring of the Java API code and the 
HotSpot x86 backend implementation. A follow-up patch will add the support for 
the AArch64 SVE backend.

[1] https://github.com/openjdk/jdk/pull/26236
[2] https://github.com/openjdk/jdk/pull/24104

-------------

Commit messages:
 - 8372136: VectorAPI: Refactor subword gather load API java implementation

Changes: https://git.openjdk.org/jdk/pull/28520/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28520&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8372136
  Stats: 558 lines in 13 files changed: 383 ins; 78 del; 97 mod
  Patch: https://git.openjdk.org/jdk/pull/28520.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28520/head:pull/28520

PR: https://git.openjdk.org/jdk/pull/28520

Reply via email to