On Wed, 21 Aug 2024 16:42:44 GMT, Jatin Bhateja <[email protected]> wrote:
>> Hi All,
>>
>> As per the discussion on panama-dev mailing list[1], patch adds the support
>> for following new two vector permutation APIs.
>>
>>
>> Declaration:-
>> Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
>>
>>
>> Semantics:-
>> Using index values stored in the lanes of "this" vector, assemble the
>> values stored in first (v1) and second (v2) vector arguments. Thus, first
>> and second vector serves as a table, whose elements are selected based on
>> index value vector. API is applicable to all integral and floating-point
>> types. The result of this operation is semantically equivalent to
>> expression v1.rearrange(this.toShuffle(), v2). Values held in index vector
>> lanes must lie within valid two vector index range [0, 2*VLEN) else an
>> IndexOutOfBoundException is thrown.
>>
>> Summary of changes:
>> - Java side implementation of new selectFrom API.
>> - C2 compiler IR and inline expander changes.
>> - In absence of direct two vector permutation instruction in target ISA, a
>> lowering transformation dismantles new IR into constituent IR supported by
>> target platforms.
>> - Optimized x86 backend implementation for AVX512 and legacy target.
>> - Function tests covering new API.
>>
>> JMH micro included with this patch shows around 10-15x gain over existing
>> rearrange API :-
>> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
>>
>>
>> Benchmark (size) Mode Cnt
>> Score Error Units
>> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762
>> ops/ms
>> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550
>> ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605
>> ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004
>> ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758
>> ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192
>> ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459
>> ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556
>> ops/ms
>> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830
>> ops/ms
>> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174
>> ops/ms
>> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804
>> ops/ms
>> S...
>
> Jatin Bhateja has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Pass explicit wrap argument to selectFrom API with default value set to
> true.
Hi @rose00 , @sviswa7 , @PaulSandoz ,
As suggested, now passing explicit 'wrap' argument to new selectFrom API.
Following are the performance number of modified JMH micro included with the
patch.
Baseline:-
Benchmark (size) Mode Cnt Score
Error Units
SelectFromBenchmark.rearrangeFromByteVector 4096 thrpt 2 5849.771
ops/ms
SelectFromBenchmark.rearrangeFromDoubleVector 4096 thrpt 2 430.712
ops/ms
SelectFromBenchmark.rearrangeFromFloatVector 4096 thrpt 2 942.737
ops/ms
SelectFromBenchmark.rearrangeFromIntVector 4096 thrpt 2 1057.695
ops/ms
SelectFromBenchmark.rearrangeFromLongVector 4096 thrpt 2 616.360
ops/ms
SelectFromBenchmark.rearrangeFromShortVector 4096 thrpt 2 2146.465
ops/ms
With Patch:-
Benchmark (size) Mode Cnt Score
Error Units
SelectFromBenchmark.selectFromByteVector 4096 thrpt 2 9543.775
ops/ms
SelectFromBenchmark.selectFromDoubleVector 4096 thrpt 2 558.195
ops/ms
SelectFromBenchmark.selectFromFloatVector 4096 thrpt 2 1325.059
ops/ms
SelectFromBenchmark.selectFromIntVector 4096 thrpt 2 1418.748
ops/ms
SelectFromBenchmark.selectFromLongVector 4096 thrpt 2 687.231
ops/ms
SelectFromBenchmark.selectFromShortVector 4096 thrpt 2 4782.395
ops/ms
With WIP wrap index acceleration PR#20634:
Benchmark (size) Mode Cnt Score
Error Units
SelectFromBenchmark.rearrangeFromByteVector 4096 thrpt 2 7602.645
ops/ms
SelectFromBenchmark.rearrangeFromDoubleVector 4096 thrpt 2 441.684
ops/ms
SelectFromBenchmark.rearrangeFromFloatVector 4096 thrpt 2 926.112
ops/ms
SelectFromBenchmark.rearrangeFromIntVector 4096 thrpt 2 1061.695
ops/ms
SelectFromBenchmark.rearrangeFromLongVector 4096 thrpt 2 644.058
ops/ms
SelectFromBenchmark.rearrangeFromShortVector 4096 thrpt 2 2777.735
ops/ms
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2302541724