On Thu, 2 Feb 2023 08:15:14 GMT, Xiaohong Gong <[email protected]> wrote:
>> While you talked about Java side changes, I found another opportunity for
>> optimization in checkIndex0 implementation, in the following code snippet
>> from checkIndex0 method, indexLimit is guaranteed to be a +ve value.
>>
>>
>> int indexLimit = Math.max(0, Math.min(length - offset, vlength));
>> VectorMask<E> badMask =
>> iota.compare(GE, iota.broadcast(indexLimit));
>>
>> For float/double iota vectors, subsequent broadcast operation accepting long
>> argument checks for precision loss before broadcasting floating point value.
>>
>>
>>
>> long longToElementBits(long value) {
>> // Do the conversion, and then test it for failure.
>> float e = (float) value;
>> if ((long) e != value) {
>> throw badElementBits(value, e);
>> }
>> return toBits(e);
>> }`
>>
>>
>> This can be saved by doing a prior casting of indexLimit to floating point
>> value before calling broadcast, effectively we may need to move checkIndex0
>> to template generated abstract vector classes.
>
> Hi, I modified a version by using the old implementation for the tail loop
> instead of adding the new intrinsics. The code looks like:
>
> public VectorMask<E> indexInRange(int offset, int limit) {
> int vlength = length();
> if (offset >= 0 && limit - offset >= length()) {
> return this;
> } else if (offset >= limit) {
> return vectorSpecies().maskAll(false);
> }
>
> Vector<E> iota = vectorSpecies().zero().addIndex(1);
> VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
> return this.andNot(badMask);
> }
>
> And I tested the performance of the new added benchmarks with different
> vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show
> that the performance of changed version is not better than the current
> version, if the array size is not aligned with the vector size, especially
> for the double/long type with larger size.
>
> Here are some raw data with NEON:
>
> Benchmark (size) Mode Cnt current
> modified Units
> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 2654.919
> 2584.423 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 1830.364
> 1802.876 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 1058.548
> 1073.742 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 594.920
> 593.832 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 308.678
> 149.279 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 160.639
> 74.579 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 1097.567
> 1104.008 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 617.845
> 606.886 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 315.978
> 152.046 ops/ms
> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 1165.279
> 1205.486 ops/ms
> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 633.648
> 631.672 ops/ms
> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 315.370
> 154.144 ops/ms
> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 639.840
> 633.623 ops/ms
> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 312.267
> 152.788 ops/ms
> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 163.028
> 78.150 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 1834.318
> 1800.318 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 1105.695
> 1094.347 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 602.442
> 599.827 ops/ms
>
>
> And the data with SVE 256-bit vector size:
>
> Benchmark (size) Mode Cnt current
> modified Units
> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 23772.370
> 22921.113 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 18930.388
> 17920.910 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 13528.610
> 13282.504 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 7850.522
> 7975.720 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4281.749
> 4373.926 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 2160.001
> 604.458 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 13594.943
> 13306.904 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 8163.134
> 7912.343 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 4335.529
> 4198.555 ops/ms
> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 22106.880
> 20348.266 ops/ms
> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 11711.588
> 10958.299 ops/ms
> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 5501.034
> 5358.441 ops/ms
> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 9832.578
> 9829.398 ops/ms
> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 4979.326
> 4947.166 ops/ms
> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 2269.131
> 614.204 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 19865.866
> 19297.628 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 14005.214
> 13592.407 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 8766.450
> 8531.675 ops/ms
>
> As a conclusion, I prefer to keep the current version. WDYT?
Yes, I agree, I did collect performance data with your benchmarks at various
AVX levels and see significant gains.
-------------
PR: https://git.openjdk.org/jdk/pull/12064