On Thu, 27 Nov 2025 01:42:07 GMT, Xiaohong Gong <[email protected]> wrote:

> The current subword (`byte`/`short`) gather load API implementation is not 
> well-suited for platforms that provide native vector instructions for these 
> operations. As **discussed in PR [1]**, we'd like to re-implement these APIs 
> with a **unified cross-platform** solution.
> 
> The main idea is to re-implement the API at Java-level, by performing 
> multiple sub-gather operations. Each sub-gather operation loads a portion of 
> elements using a specific index vector by calling the HotSpot intrinsic API. 
> The partial results are then merged using vector `slice` and `or` operations. 
> This design simplifies the VM compiler intrinsic implementation and better 
> aligns with the Vector API design principles.
> 
> Key changes:
> 1. Re-implement the subword gather load API at the Java level. The HotSpot 
> intrinsic `VectorSupport.loadWithMap` is simplified by reducing the vector 
> index parameters from four (vix1-vix4) to a single parameter.
> 2. Adjust the compiler intrinsic implementation to support the new Java API, 
> including updates to the x86 backend implementation.
> 
> The performance impact varies across different scenarios on X86. I tested the 
> performance with different AVX levels on an X86 machine that supports AVX512. 
> To achieve optimal performance, I also **applied PR [2]**, which improves the 
> performance of the **`slice()`** API on X86. Following is the summarized 
> performance gains, where:
> 
> - "non masked" means the gather operation is not the masked gather API.
> - "masked" means the gather operation is the masked gather API.
> - "1 gather cases" means the gather API is implemented with a single gather 
> operation. E.g. Load `Short128Vector` with `MaxVectorSize=256`.
> - "2 gather cases" means the gather API is implemented with 2 parts of gather 
> operations. E.g. Load `Short256Vector` with `MaxVectorSize=256`.
> - "4 gather cases" means the gather API is implemented with 4 parts of gather 
> operations. E.g. Load `Byte256Vector` with `MaxVectorSize=256`.
> - "Un-intrinsified" means the gather operation is not supported to be 
> intrinsified by hotspot. E.g. Load `Byte512Vector` with `MaxVectorSize=256`. 
> The singificant performance uplifts comes from the Java-level changes which 
> removes the vector index generation and range checks for such cases.
> 
> 
> ----------------------------------------------------------------------------
>                |          UseAVX=3           |          UseAVX=2           |
>                |-----------------------------|-----------------------------|
>                |  non maske...

Following is the performance changes of JMH 
`org.openjdk.bench.jdk.incubator.vector.GatherOperationsBenchmark` with 
**-XX:UseAVX=2|3** relatively.

<img width="1768" height="370" alt="image" 
src="https://github.com/user-attachments/assets/35222e36-ff51-47b3-8012-9bd0aa53770e";
 />



<img width="1770" height="347" alt="image" 
src="https://github.com/user-attachments/assets/ac7d4e90-a52c-45a4-9dda-76bc0305f688";
 />


<img width="1540" height="312" alt="image" 
src="https://github.com/user-attachments/assets/33c57e08-320d-41ae-9c62-599dca9d8d34";
 />



<img width="1546" height="354" alt="image" 
src="https://github.com/user-attachments/assets/4365a079-f49d-4812-9fef-ad65f6aa7872";
 />

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28520#issuecomment-3583900604

Reply via email to