On Tue, 18 Mar 2025 20:51:46 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR > instruction. > It also adds a new hybrid call generator to facilitate lazy intrinsification > or else perform procedural inlining to prevent call overhead and boxing > penalties in case the fallback implementation expects to operate over > vectors. The existing vector API-based slice implementation is now the > fallback code that gets inlined in case intrinsification fails. > > Idea here is to add infrastructure support to enable intrinsification of > fast path for selected vector APIs, else enable inlining of fall-back > implementation if it's based on vector APIs. Existing call generators like > PredictedCallGenerator, used to handle bi-morphic inlining, already make use > of multiple call generators to handle hit/miss scenarios for a particular > receiver type. The newly added hybrid call generator is lazy and called > during incremental inlining optimization. It also relieves the inline > expander to handle slow paths, which can easily be implemented library side > (Java). > > Vector API jtreg tests pass at AVX level 2, remaining validation in progress. > > Performance numbers: > > > System : 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark (size) Mode Cnt > Score Error Units > VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 > 9444.444 ops/ms > VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 > 10009.319 ops/ms > VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 > 9081.926 ops/ms > VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 > 6085.825 ops/ms > VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 > 6505.378 ops/ms > VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 > 6204.489 ops/ms > VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 > 1651.334 ops/ms > VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 > 1642.784 ops/ms > VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 > 1474.808 ops/ms > VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 > 10399.394 ops/ms > VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 > 10502.894 ops/ms > VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 ... Performance after AVX2 backend modifications Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 51644.530 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 48171.079 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9662.306 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 14358.347 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 14619.920 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6675.824 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 818.911 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 4778.321 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1612.264 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 35961.146 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 39072.170 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 2 11209.685 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3116214722