Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction

Jatin Bhateja Fri, 25 Jul 2025 06:53:04 -0700

On Tue, 18 Mar 2025 20:51:46 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
> instruction.
> It also adds a new hybrid call generator to facilitate lazy intrinsification 
> or else perform procedural inlining to prevent call overhead and boxing 
> penalties in case the fallback implementation expects to operate over 
> vectors. The existing vector API-based slice implementation is now the 
> fallback code that gets inlined in case intrinsification fails.
> 
>  Idea here is to add infrastructure support to enable intrinsification of 
> fast path for selected vector APIs, else enable inlining of fall-back 
> implementation if it's based on vector APIs. Existing call generators like 
> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
> of multiple call generators to handle hit/miss scenarios for a particular 
> receiver type. The newly added hybrid call generator is lazy and called 
> during incremental inlining optimization. It also relieves the inline 
> expander to handle slow paths, which can easily be implemented library side 
> (Java).
> 
> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
> 
> Performance numbers:
> 
> 
> System : 13th Gen Intel(R) Core(TM) i3-1315U
> 
> Baseline:
> Benchmark                                                (size)   Mode  Cnt   
>    Score   Error   Units
> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2   
> 9444.444          ops/ms
> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
> 10009.319          ops/ms
> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   
> 9081.926          ops/ms
> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   
> 6085.825          ops/ms
> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   
> 6505.378          ops/ms
> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   
> 6204.489          ops/ms
> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2   
> 1651.334          ops/ms
> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   
> 1642.784          ops/ms
> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   
> 1474.808          ops/ms
> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
> 10399.394          ops/ms
> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
> 10502.894          ops/ms
> VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024 ...

Performance after AVX2 backend modifications


Benchmark                                                (size)   Mode  Cnt     
 Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
51644.530          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
48171.079          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   
9662.306          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
14358.347          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
14619.920          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   
6675.824          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2    
818.911          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   
4778.321          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   
1612.264          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
35961.146          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
39072.170          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2  
11209.685          ops/ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3116214722

Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction

Reply via email to