Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v17]

Sandhya Viswanathan Tue, 07 Apr 2026 11:40:13 -0700

On Tue, 7 Apr 2026 08:44:42 GMT, Jatin Bhateja <[email protected]> wrote:


>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification 
>> or else perform procedural inlining to prevent call overhead and boxing 
>> penalties in case the fallback implementation expects to operate over 
>> vectors. The existing vector API-based slice implementation is now the 
>> fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of 
>> fast path for selected vector APIs, else enable inlining of fall-back 
>> implementation if it's based on vector APIs. Existing call generators like 
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
>> of multiple call generators to handle hit/miss scenarios for a particular 
>> receiver type. The newly added hybrid call generator is lazy and called 
>> during incremental inlining optimization. It also relieves the inline 
>> expander to handle slow paths, which can easily be implemented library side 
>> (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt  
>>     Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
>> 10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2  
>>  9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
>>  6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
>>  6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2  
>>  6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2  
>>  1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2  
>>  1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
>> 10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
>> 10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 20 commits:
> 
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Review comments resolutions
>  - Review resolutions
>  - Review comments resolution
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Review comments resolutions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Review comments resolutions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762
>  - Update callGenerator.hpp copyright year
>  - ... and 10 more: https://git.openjdk.org/jdk/compare/0b803bd3...bde0c216

src/hotspot/cpu/x86/x86.ad line 3418:

> 3416:                          ((size_in_bits != 512) && 
> !VM_Version::supports_avx512vl()))) {
> 3417:         return false;
> 3418:       }

This can be simplified to:

      if (UseAVX > 2 && !VM_Version::supports_avx512vlbw()) {
        return false;
      }

As the platforms supporting bw also support vl.

src/hotspot/cpu/x86/x86.ad line 25418:

> 25416: %}
> 25417: 
> 25418: instruct vector_slice_const_origin_16B_reg(vec dst, vec src1, vec 
> src2, immI origin)

The instruct rules with same register profile can be merged, so overall only 3 
rules are needed:

1) With dst, src1, src2, origin profile
    The following three rules can be merged into 1:
        vector_slice_const_origin_16B_reg
        vector_slice_const_origin_GT16B_index16_reg
        vector_slice_const_origin_GT16B_index_multiple4_reg_evex
   With predicate:
        predicate((Matcher::vector_length_in_bytes(n) == 16)  ||
                         n->in(2)->get_int() & 0x3) == 0);

2) With dst, src1, src2, origin and TEMP dst
    The following two rules can be merged into 1:
       vector_slice_const_origin_GT16B_reg
       vector_slice_const_origin_GT16B_index_LT16_OR_GT48_reg_evex
   With predicate:
        predicate ( n->in(2)->get_int() & 0x3) != 0  &&
                          (Matcher::vector_length_in_bytes(n) == 32) ||
                          (Matcher::vector_length_in_bytes(n) == 64 &&
                           (n->in(2)->get_int() < 16 || n->in(2)->get_int() > 
48));
3) With dst, src1, src2, origin, xtmp with TEMP dst
    vector_slice_const_origin_GT16B_index_GT16_AND_LT48_reg_evex
    With predicate:
    predicate( n->in(2)->get_int() & 0x3) != 0 &&
                     Matcher::vector_length_in_bytes(n) == 64 &&
                     n->in(2)->get_int() > 16 && n->in(2)->get_int() < 48);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3046588525
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3047080402

Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v17]

Reply via email to