On Tue, 7 Apr 2026 08:44:42 GMT, Jatin Bhateja <[email protected]> wrote:
>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR >> instruction. >> It also adds a new hybrid call generator to facilitate lazy intrinsification >> or else perform procedural inlining to prevent call overhead and boxing >> penalties in case the fallback implementation expects to operate over >> vectors. The existing vector API-based slice implementation is now the >> fallback code that gets inlined in case intrinsification fails. >> >> Idea here is to add infrastructure support to enable intrinsification of >> fast path for selected vector APIs, else enable inlining of fall-back >> implementation if it's based on vector APIs. Existing call generators like >> PredictedCallGenerator, used to handle bi-morphic inlining, already make use >> of multiple call generators to handle hit/miss scenarios for a particular >> receiver type. The newly added hybrid call generator is lazy and called >> during incremental inlining optimization. It also relieves the inline >> expander to handle slow paths, which can easily be implemented library side >> (Java). >> >> Vector API jtreg tests pass at AVX level 2, remaining validation in progress. >> >> Performance numbers: >> >> >> System : 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark (size) Mode Cnt >> Score Error Units >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 >> 9444.444 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 >> 10009.319 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 >> 9081.926 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 >> 6085.825 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 >> 6505.378 ops/ms >> VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 >> 6204.489 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 >> 1651.334 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 >> 1642.784 ops/ms >> VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 >> 1474.808 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 >> 10399.394 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 >> 10502.894 ops/ms >> VectorSliceB... > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 20 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762 > - Review comments resolutions > - Review resolutions > - Review comments resolution > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762 > - Review comments resolutions > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762 > - Review comments resolutions > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762 > - Update callGenerator.hpp copyright year > - ... and 10 more: https://git.openjdk.org/jdk/compare/0b803bd3...bde0c216 src/hotspot/cpu/x86/x86.ad line 3418: > 3416: ((size_in_bits != 512) && > !VM_Version::supports_avx512vl()))) { > 3417: return false; > 3418: } This can be simplified to: if (UseAVX > 2 && !VM_Version::supports_avx512vlbw()) { return false; } As the platforms supporting bw also support vl. src/hotspot/cpu/x86/x86.ad line 25418: > 25416: %} > 25417: > 25418: instruct vector_slice_const_origin_16B_reg(vec dst, vec src1, vec > src2, immI origin) The instruct rules with same register profile can be merged, so overall only 3 rules are needed: 1) With dst, src1, src2, origin profile The following three rules can be merged into 1: vector_slice_const_origin_16B_reg vector_slice_const_origin_GT16B_index16_reg vector_slice_const_origin_GT16B_index_multiple4_reg_evex With predicate: predicate((Matcher::vector_length_in_bytes(n) == 16) || n->in(2)->get_int() & 0x3) == 0); 2) With dst, src1, src2, origin and TEMP dst The following two rules can be merged into 1: vector_slice_const_origin_GT16B_reg vector_slice_const_origin_GT16B_index_LT16_OR_GT48_reg_evex With predicate: predicate ( n->in(2)->get_int() & 0x3) != 0 && (Matcher::vector_length_in_bytes(n) == 32) || (Matcher::vector_length_in_bytes(n) == 64 && (n->in(2)->get_int() < 16 || n->in(2)->get_int() > 48)); 3) With dst, src1, src2, origin, xtmp with TEMP dst vector_slice_const_origin_GT16B_index_GT16_AND_LT48_reg_evex With predicate: predicate( n->in(2)->get_int() & 0x3) != 0 && Matcher::vector_length_in_bytes(n) == 64 && n->in(2)->get_int() > 16 && n->in(2)->get_int() < 48); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3046588525 PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r3047080402
