On Thu, 2 Feb 2023 18:07:59 GMT, Jatin Bhateja <[email protected]> wrote:
>> Hi, I modified a version by using the old implementation for the tail loop
>> instead of adding the new intrinsics. The code looks like:
>>
>> public VectorMask<E> indexInRange(int offset, int limit) {
>> int vlength = length();
>> if (offset >= 0 && limit - offset >= length()) {
>> return this;
>> } else if (offset >= limit) {
>> return vectorSpecies().maskAll(false);
>> }
>>
>> Vector<E> iota = vectorSpecies().zero().addIndex(1);
>> VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
>> return this.andNot(badMask);
>> }
>>
>> And I tested the performance of the new added benchmarks with different
>> vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show
>> that the performance of changed version is not better than the current
>> version, if the array size is not aligned with the vector size, especially
>> for the double/long type with larger size.
>>
>> Here are some raw data with NEON:
>>
>> Benchmark (size) Mode Cnt current
>> modified Units
>> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 2654.919
>> 2584.423 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 1830.364
>> 1802.876 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 1058.548
>> 1073.742 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 594.920
>> 593.832 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 308.678
>> 149.279 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 160.639
>> 74.579 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 1097.567
>> 1104.008 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 617.845
>> 606.886 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 315.978
>> 152.046 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 1165.279
>> 1205.486 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 633.648
>> 631.672 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 315.370
>> 154.144 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 639.840
>> 633.623 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 312.267
>> 152.788 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 163.028
>> 78.150 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 1834.318
>> 1800.318 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 1105.695
>> 1094.347 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 602.442
>> 599.827 ops/ms
>>
>>
>> And the data with SVE 256-bit vector size:
>>
>> Benchmark (size) Mode Cnt current
>> modified Units
>> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 23772.370
>> 22921.113 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 18930.388
>> 17920.910 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 13528.610
>> 13282.504 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 7850.522
>> 7975.720 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4281.749
>> 4373.926 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 2160.001
>> 604.458 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 13594.943
>> 13306.904 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 8163.134
>> 7912.343 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 4335.529
>> 4198.555 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 22106.880
>> 20348.266 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 11711.588
>> 10958.299 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 5501.034
>> 5358.441 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 9832.578
>> 9829.398 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 4979.326
>> 4947.166 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 2269.131
>> 614.204 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 19865.866
>> 19297.628 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 14005.214
>> 13592.407 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 8766.450
>> 8531.675 ops/ms
>>
>> As a conclusion, I prefer to keep the current version. WDYT?
>
> Yes, I agree, I did collect performance data with your benchmarks at various
> AVX levels and see significant gains. Cause of performance variation b/w
> integer and floating point cases is due to below limitation.
> https://github.com/openjdk/jdk/pull/12064#discussion_r1094101761
> Which can be addressed in a separate PR.
>
> FTR here are performance numbers at UseAVX=3
>
>
> Benchmark (size) Mode Cnt Score
> Error Units
> IndexInRangeBenchmark.byteIndexInRange 1024 thrpt 2 74983.406
> ops/ms
> IndexInRangeBenchmark.byteIndexInRange 1027 thrpt 2 19156.962
> ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024 thrpt 2 11368.179
> ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027 thrpt 2 2165.207
> ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1024 thrpt 2 18736.787
> ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1027 thrpt 2 3798.996
> ops/ms
> IndexInRangeBenchmark.intIndexInRange 1024 thrpt 2 18797.863
> ops/ms
> IndexInRangeBenchmark.intIndexInRange 1027 thrpt 2 5455.317
> ops/ms
> IndexInRangeBenchmark.longIndexInRange 1024 thrpt 2 11866.493
> ops/ms
> IndexInRangeBenchmark.longIndexInRange 1027 thrpt 2 2227.896
> ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1024 thrpt 2 46921.520
> ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1027 thrpt 2 8532.394
> ops/ms
Thanks for the performance testing on x86 systems! Agree that a separate PR is
fine, and I will address it once this PR merged.
-------------
PR: https://git.openjdk.org/jdk/pull/12064