On Wed, 1 Mar 2023 09:10:47 GMT, Eirik Bjorsnos <[email protected]> wrote:
>> This PR suggests we add a vectorized equalsIgnoreCase benchmark to the set
>> of benchmarks in `org.openjdk.bench.jdk.incubator.vector`. This benchmark
>> serves as an example of how vectorization can be useful also in the area of
>> text processing. It takes advantage of the fact that ASCII and Latin-1 were
>> designed to optimize case-twiddling operations.
>>
>> The code came about during the work on #12632, where vectorization was
>> deemed out of scope.
>>
>> Benchmark results:
>>
>>
>> Benchmark (size) Mode Cnt Score Error
>> Units
>> EqualsIgnoreCaseBenchmark.scalar 16 avgt 15 20.671 ± 0.718
>> ns/op
>> EqualsIgnoreCaseBenchmark.scalar 32 avgt 15 46.155 ± 3.258
>> ns/op
>> EqualsIgnoreCaseBenchmark.scalar 64 avgt 15 68.248 ± 1.767
>> ns/op
>> EqualsIgnoreCaseBenchmark.scalar 128 avgt 15 148.948 ± 0.890
>> ns/op
>> EqualsIgnoreCaseBenchmark.scalar 1024 avgt 15 1090.708 ± 7.540
>> ns/op
>> EqualsIgnoreCaseBenchmark.vectorized 16 avgt 15 21.872 ± 0.232
>> ns/op
>> EqualsIgnoreCaseBenchmark.vectorized 32 avgt 15 11.378 ± 0.097
>> ns/op
>> EqualsIgnoreCaseBenchmark.vectorized 64 avgt 15 13.703 ± 0.135
>> ns/op
>> EqualsIgnoreCaseBenchmark.vectorized 128 avgt 15 21.632 ± 0.735
>> ns/op
>> EqualsIgnoreCaseBenchmark.vectorized 1024 avgt 15 105.509 ± 7.493
>> ns/op
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional
> commit since the last revision:
>
> The equal.allTrue check early if the loop does not cover cases where some
> bytes are equal, but not all. Reverting this change.
We can narrow down this performance difference to the following observations:
- `compare(LT, (byte) 0xDF)` is faster than `compare(LE, (byte) 0xDE)`
- `compare(GT, (byte) 0XBF)` is faster than `compare(GE, (byte) 0XC0)`
- `compare(EQ, (byte) 0xD7).not()` is faster than `compare(NE, (byte) 0xD7)`
LT, GT, EQ:
Benchmark (size) Mode Cnt Score Error Units
EqualsIgnoreCaseBenchmark.vectorized 1024 avgt 15 91.796 ± 1.083 ns/op
LE, GE, NE:
Benchmark (size) Mode Cnt Score Error Units
EqualsIgnoreCaseBenchmark.vectorized 1024 avgt 15 118.898 ± 5.480 ns/op
Code:
VectorMask<Byte> asciiLetter = upperA.compare(GT, (byte)
'@').and(upperA.compare(LT, (byte) '['));
VectorMask<Byte> lat1Letter = upperA
.compare(LT, (byte) 0xDF) // <= Thorn
.and(upperA.compare(GT, (byte) 0XBF)) // >= A-grave
.and(upperA.compare(EQ, (byte) 0xD7).not()); // Excluding
multiplication
vs.
VectorMask<Byte> asciiLetter = upperA.compare(GE, (byte)
'A').and(upperA.compare(LE, (byte) 'Z'));
VectorMask<Byte> lat1Letter = upperA
.compare(LE, (byte) 0xDE) // <= Thorn
.and(upperA.compare(GE, (byte) 0XC0)) // >= A-grave
.and(upperA.compare(NE, (byte) 0xD7)); // Excluding multiplication
-------------
PR: https://git.openjdk.org/jdk/pull/12790