On Fri, 27 Feb 2026 04:47:34 GMT, Jatin Bhateja <[email protected]> wrote:
>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification
>> or else perform procedural inlining to prevent call overhead and boxing
>> penalties in case the fallback implementation expects to operate over
>> vectors. The existing vector API-based slice implementation is now the
>> fallback code that gets inlined in case intrinsification fails.
>>
>> Idea here is to add infrastructure support to enable intrinsification of
>> fast path for selected vector APIs, else enable inlining of fall-back
>> implementation if it's based on vector APIs. Existing call generators like
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use
>> of multiple call generators to handle hit/miss scenarios for a particular
>> receiver type. The newly added hybrid call generator is lazy and called
>> during incremental inlining optimization. It also relieves the inline
>> expander to handle slow paths, which can easily be implemented library side
>> (Java).
>>
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>>
>> Performance numbers:
>>
>>
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>>
>> Baseline:
>> Benchmark (size) Mode Cnt
>> Score Error Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2
>> 9444.444 ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2
>> 10009.319 ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2
>> 9081.926 ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2
>> 6085.825 ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2
>> 6505.378 ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2
>> 6204.489 ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2
>> 1651.334 ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2
>> 1642.784 ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2
>> 1474.808 ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2
>> 10399.394 ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2
>> 10502.894 ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Review resolutions
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7002:
> 7000: vpalignr(dst, xtmp, src1, origin, Assembler::AVX_256bit);
> 7001: } else {
> 7002: assert(origin > 16 && origin <= 32, "");
If the slice amount is exactly 32 bytes, the result is simply src2 (no need to
do vperm2i128 & vpalignr).
Should this be (origin < 32) ?
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7012:
> 7010: // Result lanes
> 7011: // res[127:0] = {src2[127:0] , src1[255:127]} >> SHIFT
> 7012: // res[255:128] = {src2[255:128] , src2[127:0]} >> SHIFT
Should be:
// res[127:0] = {src2[127:0] , src1[255:12**8**]} >> **(SHIFT - 16)**
// res[255:128] = {src2[255:128] , src2[127:0]} >> **(SHIFT - 16)**
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7014:
> 7012: // res[255:128] = {src2[255:128] , src2[127:0]} >> SHIFT
> 7013: vperm2i128(xtmp, src1, src2, 0x21);
> 7014: vpalignr(dst, src2, xtmp, origin - 16, Assembler::AVX_256bit);
vector_slice_32B_op() could be implemented without using xtmp.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7026:
> 7024: // src1 = [v1 v2 v3 v4] and src2 = [v5 v6 v7 v8]
> 7025: // where v* represents 128 bit wide vector lanes.
> 7026: // When SHIFT <= 16 result will be sliced out from src1 and
SHIFT < 16 here.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7048:
> 7046: // |_____________|
> 7047: evalignd(xtmp, src2, src1, 4, vlen_enc);
> 7048: vpalignr(dst, xtmp, src1, origin, vlen_enc);
This could be implemented without using xtmp.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 7090:
> 7088: assert(origin > 48 && origin < 64, "");
> 7089: evalignd(xtmp, src2, src1, 12, vlen_enc);
> 7090: vpalignr(dst, src2, xtmp, origin - 48, vlen_enc);
This could be implemented without using xtmp.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3034724935
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3034736426
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3042052098
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3034740247
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3042046097
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3042047184