On Fri, 27 Feb 2026 04:47:34 GMT, Jatin Bhateja <[email protected]> wrote:

>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification 
>> or else perform procedural inlining to prevent call overhead and boxing 
>> penalties in case the fallback implementation expects to operate over 
>> vectors. The existing vector API-based slice implementation is now the 
>> fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of 
>> fast path for selected vector APIs, else enable inlining of fall-back 
>> implementation if it's based on vector APIs. Existing call generators like 
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
>> of multiple call generators to handle hit/miss scenarios for a particular 
>> receiver type. The newly added hybrid call generator is lazy and called 
>> during incremental inlining optimization. It also relieves the inline 
>> expander to handle slow paths, which can easily be implemented library side 
>> (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt  
>>     Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
>> 10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2  
>>  9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
>>  6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
>>  6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2  
>>  6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2  
>>  1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2  
>>  1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
>> 10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
>> 10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Review resolutions

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7002:

> 7000:      vpalignr(dst, xtmp, src1, origin, Assembler::AVX_256bit);
> 7001:    } else {
> 7002:      assert(origin > 16 && origin <= 32, "");

If the slice amount is exactly 32 bytes, the result is simply src2 (no need to 
do vperm2i128 & vpalignr).
Should this be (origin < 32) ?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7012:

> 7010:      // Result lanes
> 7011:      // res[127:0]   = {src2[127:0]   , src1[255:127]}  >> SHIFT
> 7012:      // res[255:128] = {src2[255:128] , src2[127:0]}    >> SHIFT

Should be:
     // res[127:0]   = {src2[127:0]   , src1[255:12**8**]}  >> **(SHIFT - 16)**
     // res[255:128] = {src2[255:128] , src2[127:0]}    >> **(SHIFT - 16)**

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7014:

> 7012:      // res[255:128] = {src2[255:128] , src2[127:0]}    >> SHIFT
> 7013:      vperm2i128(xtmp, src1, src2, 0x21);
> 7014:      vpalignr(dst, src2, xtmp, origin - 16, Assembler::AVX_256bit);

vector_slice_32B_op() could be implemented without using xtmp.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7026:

> 7024:     // src1 = [v1 v2 v3 v4] and src2 = [v5 v6 v7 v8]
> 7025:     // where v* represents 128 bit wide vector lanes.
> 7026:     // When SHIFT <= 16 result will be sliced out from src1 and

SHIFT < 16 here.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7048:

> 7046:     //                  |_____________|
> 7047:      evalignd(xtmp, src2, src1, 4, vlen_enc);
> 7048:      vpalignr(dst, xtmp, src1, origin, vlen_enc);

This could be implemented without using xtmp.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7090:

> 7088:      assert(origin > 48 && origin < 64, "");
> 7089:      evalignd(xtmp, src2, src1, 12, vlen_enc);
> 7090:      vpalignr(dst, src2, xtmp, origin - 48, vlen_enc);

This could be implemented without using xtmp.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3034724935
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3034736426
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3042052098
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3034740247
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3042046097
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3042047184

Reply via email to