Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2023-01-16 Thread Claes Redestad
On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov  wrote:

>>> Also, I'd like to note that C2 auto-vectorization support is not too far 
>>> away from being able to optimize hash code computations. At some point, I 
>>> was able to achieve some promising results with modest tweaking of 
>>> SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt 
>>> http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/
>> 
>> Intriguing. How far off is this - and do you think it'll be able to match 
>> the efficiency we see here with a memoized coefficient table etc?
>> 
>> If we turn this intrinsic into a stub we might also be able to reuse the 
>> optimization in other places, including from within the VM (calculating 
>> String hashCodes happen in a couple of places, including String 
>> deduplication). So I think there are still a few compelling reasons to go 
>> the manual route and continue on this path.
>
>> How far off is this ...?
> 
> Back then it looked way too constrained (tight constraints on code shapes). 
> But I considered it as a generally applicable optimization. 
> 
>>  ... do you think it'll be able to match the efficiency we see here with a 
>> memoized coefficient table etc?
> 
> Yes, it is able to build the constant table at runtime when folding 
> multiplications of constant coefficients produced during loop unrolling and 
> then packing scalars into a constant vector.
> 
> Moreover, briefly looking at the code shape, the vectorizer would produce a 
> more optimal loop shape (pre-loop would align vector accesses and would use 
> 512-bit vectors when available; vector post-loop could help as well).

I've opted to include the changes spurred by @iwanowww's comments since it led 
to a number of revisions to the intrinsified method API, and it would be 
strange to introduce an intrinsified method just to change the API drastically 
in a follow-up. Basically `ArraysSupport.vectorizedHashCode` has been changed 
to take an offset + length, an initial value and the logical basic type of the 
array elements. Which means any necessary scaling of index and length needs to 
be taken care of before calling the intrinsic. This makes the implementation 
more flexible at no measurable performance cost. Overall the refactoring might 
have reduced complexity a bit. 

Reviewers might observe that nothing is currently passing anything but `0` and 
`length` to `vectorizedHashCode` outside of the simple sanity test I've added, 
but I've verified this feature can be used to some effect elsewhere in the JDK, 
e.g: 
https://github.com/openjdk/jdk/compare/pr/10847...cl4es:jdk:zipcoder-hashcode?expand=1
 (which improves speed of opening `ZipFile` by a small percentage in 
microbenchmarks).

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2023-01-09 Thread Claes Redestad
On Thu, 22 Dec 2022 13:10:02 GMT, Claes Redestad  wrote:

>> @cl4es Thanks for passing the constant node through, the code looks much 
>> cleaner now. The attached patch should handle the signed bytes/shorts as 
>> well. Please take a look. 
>> [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch)
>
> I ran tests and some quick microbenchmarking to validate @sviswa7's patch to 
> activate vectorization for `short` and `byte` arrays and it looks good:
> 
> Before:
> 
> Benchmark   (size)  Mode  Cnt ScoreError  Units
> ArraysHashCode.bytes 1  avgt5  7845.586 ± 23.440  ns/op
> ArraysHashCode.chars 1  avgt5  1203.163 ± 11.995  ns/op
> ArraysHashCode.ints  1  avgt5  1131.915 ±  7.843  ns/op
> ArraysHashCode.multibytes1  avgt5  4136.487 ±  5.790  ns/op
> ArraysHashCode.multichars1  avgt5   671.328 ± 17.629  ns/op
> ArraysHashCode.multiints 1  avgt5   699.051 ±  8.135  ns/op
> ArraysHashCode.multishorts   1  avgt5  4139.300 ± 10.633  ns/op
> ArraysHashCode.shorts1  avgt5  7844.019 ± 26.071  ns/op
> 
> 
> After: 
> 
> Benchmark   (size)  Mode  Cnt ScoreError  Units
> ArraysHashCode.bytes 1  avgt5  1193.208 ±  1.965  ns/op
> ArraysHashCode.chars 1  avgt5  1193.311 ±  5.941  ns/op
> ArraysHashCode.ints  1  avgt5  1132.592 ± 10.410  ns/op
> ArraysHashCode.multibytes1  avgt5   657.343 ± 25.343  ns/op
> ArraysHashCode.multichars1  avgt5   672.668 ±  5.229  ns/op
> ArraysHashCode.multiints 1  avgt5   697.143 ±  3.929  ns/op
> ArraysHashCode.multishorts   1  avgt5   666.738 ± 12.236  ns/op
> ArraysHashCode.shorts1  avgt5  1193.563 ±  5.449  ns/op

> @cl4es There seem to be failure on windows-x64 platform pre submit tests. 
> Could you please take a look?

It looks like the 
`as_Address(ExternalAddress(StubRoutines::x86::arrays_hashcode_powers_of_31() + 
...)` trick is running into some reachability issue on Windows, hitting the 
`assert(reachable(adr), "must be");` in `macroAssembler_x86.cpp`. Might be 
related to ASLR or some quirk of the VS compiler. I'll investigate.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2023-01-06 Thread Sandhya Viswanathan
On Thu, 22 Dec 2022 13:10:02 GMT, Claes Redestad  wrote:

>> @cl4es Thanks for passing the constant node through, the code looks much 
>> cleaner now. The attached patch should handle the signed bytes/shorts as 
>> well. Please take a look. 
>> [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch)
>
> I ran tests and some quick microbenchmarking to validate @sviswa7's patch to 
> activate vectorization for `short` and `byte` arrays and it looks good:
> 
> Before:
> 
> Benchmark   (size)  Mode  Cnt ScoreError  Units
> ArraysHashCode.bytes 1  avgt5  7845.586 ± 23.440  ns/op
> ArraysHashCode.chars 1  avgt5  1203.163 ± 11.995  ns/op
> ArraysHashCode.ints  1  avgt5  1131.915 ±  7.843  ns/op
> ArraysHashCode.multibytes1  avgt5  4136.487 ±  5.790  ns/op
> ArraysHashCode.multichars1  avgt5   671.328 ± 17.629  ns/op
> ArraysHashCode.multiints 1  avgt5   699.051 ±  8.135  ns/op
> ArraysHashCode.multishorts   1  avgt5  4139.300 ± 10.633  ns/op
> ArraysHashCode.shorts1  avgt5  7844.019 ± 26.071  ns/op
> 
> 
> After: 
> 
> Benchmark   (size)  Mode  Cnt ScoreError  Units
> ArraysHashCode.bytes 1  avgt5  1193.208 ±  1.965  ns/op
> ArraysHashCode.chars 1  avgt5  1193.311 ±  5.941  ns/op
> ArraysHashCode.ints  1  avgt5  1132.592 ± 10.410  ns/op
> ArraysHashCode.multibytes1  avgt5   657.343 ± 25.343  ns/op
> ArraysHashCode.multichars1  avgt5   672.668 ±  5.229  ns/op
> ArraysHashCode.multiints 1  avgt5   697.143 ±  3.929  ns/op
> ArraysHashCode.multishorts   1  avgt5   666.738 ± 12.236  ns/op
> ArraysHashCode.shorts1  avgt5  1193.563 ±  5.449  ns/op

@cl4es There seem to be failure on windows-x64 platform pre submit tests. Could 
you please take a look?

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-22 Thread Claes Redestad
On Wed, 21 Dec 2022 00:11:34 GMT, Sandhya Viswanathan 
 wrote:

>> Passing the constant node through as an input as suggested by @iwanowww and 
>> @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a 
>> significant chunk of code and a little bit of complexity from the proposed 
>> patch.
>
> @cl4es Thanks for passing the constant node through, the code looks much 
> cleaner now. The attached patch should handle the signed bytes/shorts as 
> well. Please take a look. 
> [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch)

I ran tests and some quick microbenchmarking to validate @sviswa7's patch to 
activate vectorization for `short` and `byte` arrays and it looks good:

Before:

Benchmark   (size)  Mode  Cnt ScoreError  Units
ArraysHashCode.bytes 1  avgt5  7845.586 ± 23.440  ns/op
ArraysHashCode.chars 1  avgt5  1203.163 ± 11.995  ns/op
ArraysHashCode.ints  1  avgt5  1131.915 ±  7.843  ns/op
ArraysHashCode.multibytes1  avgt5  4136.487 ±  5.790  ns/op
ArraysHashCode.multichars1  avgt5   671.328 ± 17.629  ns/op
ArraysHashCode.multiints 1  avgt5   699.051 ±  8.135  ns/op
ArraysHashCode.multishorts   1  avgt5  4139.300 ± 10.633  ns/op
ArraysHashCode.shorts1  avgt5  7844.019 ± 26.071  ns/op


After: 

Benchmark   (size)  Mode  Cnt ScoreError  Units
ArraysHashCode.bytes 1  avgt5  1193.208 ±  1.965  ns/op
ArraysHashCode.chars 1  avgt5  1193.311 ±  5.941  ns/op
ArraysHashCode.ints  1  avgt5  1132.592 ± 10.410  ns/op
ArraysHashCode.multibytes1  avgt5   657.343 ± 25.343  ns/op
ArraysHashCode.multichars1  avgt5   672.668 ±  5.229  ns/op
ArraysHashCode.multiints 1  avgt5   697.143 ±  3.929  ns/op
ArraysHashCode.multishorts   1  avgt5   666.738 ± 12.236  ns/op
ArraysHashCode.shorts1  avgt5  1193.563 ±  5.449  ns/op

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-20 Thread Sandhya Viswanathan
On Tue, 20 Dec 2022 19:52:34 GMT, Claes Redestad  wrote:

>> src/java.base/share/classes/java/lang/StringUTF16.java line 418:
>> 
>>> 416: return 0;
>>> 417: } else {
>>> 418: return ArraysSupport.vectorizedHashCode(value, 
>>> ArraysSupport.UTF16);
>> 
>> Special case for 1 missing here.
>
> Intentionally left out. Array length is always even for `UTF16` arrays, but 
> we could add a case for `2` that'd return `getChar(bytes, 0)` but I didn't 
> see much of a win when I tested this.

I do see a 1.5x gain with this special case added:
   return switch (value.length) {
case 0 -> 0;
case 2 -> getChar(value, 0);
default -> ArraysSupport.vectorizedHashCode(value, 
ArraysSupport.UTF16);
};
before: 0.987 ns/op 
after: 0.640 ns/op

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-20 Thread Sandhya Viswanathan
On Tue, 20 Dec 2022 21:11:18 GMT, Claes Redestad  wrote:

>>> How far off is this ...?
>> 
>> Back then it looked way too constrained (tight constraints on code shapes). 
>> But I considered it as a generally applicable optimization. 
>> 
>>>  ... do you think it'll be able to match the efficiency we see here with a 
>>> memoized coefficient table etc?
>> 
>> Yes, it is able to build the constant table at runtime when folding 
>> multiplications of constant coefficients produced during loop unrolling and 
>> then packing scalars into a constant vector.
>> 
>> Moreover, briefly looking at the code shape, the vectorizer would produce a 
>> more optimal loop shape (pre-loop would align vector accesses and would use 
>> 512-bit vectors when available; vector post-loop could help as well).
>
> Passing the constant node through as an input as suggested by @iwanowww and 
> @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a 
> significant chunk of code and a little bit of complexity from the proposed 
> patch.

@cl4es Thanks for passing the constant node through, the code looks much 
cleaner now. The attached patch should handle the signed bytes/shorts as well. 
Please take a look. 
[signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch)

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-20 Thread Claes Redestad
On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov  wrote:

>>> Also, I'd like to note that C2 auto-vectorization support is not too far 
>>> away from being able to optimize hash code computations. At some point, I 
>>> was able to achieve some promising results with modest tweaking of 
>>> SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt 
>>> http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/
>> 
>> Intriguing. How far off is this - and do you think it'll be able to match 
>> the efficiency we see here with a memoized coefficient table etc?
>> 
>> If we turn this intrinsic into a stub we might also be able to reuse the 
>> optimization in other places, including from within the VM (calculating 
>> String hashCodes happen in a couple of places, including String 
>> deduplication). So I think there are still a few compelling reasons to go 
>> the manual route and continue on this path.
>
>> How far off is this ...?
> 
> Back then it looked way too constrained (tight constraints on code shapes). 
> But I considered it as a generally applicable optimization. 
> 
>>  ... do you think it'll be able to match the efficiency we see here with a 
>> memoized coefficient table etc?
> 
> Yes, it is able to build the constant table at runtime when folding 
> multiplications of constant coefficients produced during loop unrolling and 
> then packing scalars into a constant vector.
> 
> Moreover, briefly looking at the code shape, the vectorizer would produce a 
> more optimal loop shape (pre-loop would align vector accesses and would use 
> 512-bit vectors when available; vector post-loop could help as well).

Passing the constant node through as an input as suggested by @iwanowww and 
@sviswa7 meant we could eliminate most of the `instruct` blocks, removing a 
significant chunk of code and a little bit of complexity from the proposed 
patch.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-20 Thread Claes Redestad
On Fri, 16 Dec 2022 23:00:53 GMT, Sandhya Viswanathan 
 wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> src/hotspot/cpu/x86/stubRoutines_x86.cpp line 230:
> 
>> 228: #endif // _LP64
>> 229: 
>> 230: jint StubRoutines::x86::_arrays_hashcode_powers_of_31[] =
> 
> This should be declared only for LP64.

Hmm, I guess same goes for all the new `arrays_hashcode` methods in 
`c2_MacroAssembler_x86`, since we only wire this up properly on 64-bit. I'll 
make a pass to put `_LP64` guards around all new methods.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-20 Thread Claes Redestad
On Fri, 16 Dec 2022 22:58:23 GMT, Sandhya Viswanathan 
 wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> src/hotspot/cpu/x86/vm_version_x86.cpp line 1671:
> 
>> 1669:   }
>> 1670:   if (UseAVX >= 2) {
>> 1671: FLAG_SET_ERGO_IF_DEFAULT(UseVectorizedHashCodeIntrinsic, true);
> 
> This could be just FLAG_SET_DEFAULT instead of FLAG_SET_ERGO_IF_DEFAULT.

Right, it seems HW-dependent intrinsics in generally doesn't mark that they've 
been enabled ergonomically, rather just make it on "by default" when support is 
available.

> src/java.base/share/classes/java/lang/StringUTF16.java line 418:
> 
>> 416: return 0;
>> 417: } else {
>> 418: return ArraysSupport.vectorizedHashCode(value, 
>> ArraysSupport.UTF16);
> 
> Special case for 1 missing here.

Intentionally left out. Array length is always even for `UTF16` arrays, but we 
could add a case for `2` that'd return `getChar(bytes, 0)` but I didn't see 
much of a win when I tested this.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-16 Thread Sandhya Viswanathan
On Sun, 13 Nov 2022 20:57:44 GMT, Claes Redestad  wrote:

>> src/hotspot/cpu/x86/x86_64.ad line 12073:
>> 
>>> 12071:  legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, 
>>> rRegI tmp3, rFlagsReg cr)
>>> 12072: %{
>>> 12073:   predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == 
>>> VectorizedHashCodeNode::LATIN1);
>> 
>> If you represent `VectorizedHashCodeNode::mode()` as an input, it would 
>> allow to abstract over supported modes and come up with a single AD 
>> instruction. Take a look at `VectorMaskCmp` for an example (not a perfect 
>> one though since it has both _predicate member and constant input which is 
>> redundant).
>
> Thanks for the pointer, I'll check it out!

I agree with Vladimir, adding mode as another input will help. Please take a 
look at the RoundDoubleModeNode.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-16 Thread Sandhya Viswanathan
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-16 Thread Sandhya Viswanathan
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-15 Thread Ismael Juma
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-12-15 Thread Ludovic Henry
On Wed, 16 Nov 2022 18:18:55 GMT, Claes Redestad  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> I'm getting pulled into other tasks and would request for this to be either 
> accepted as-is, rejected or picked up by someone else to rewrite it to 
> something that can be accepted.
> 
> Obviously I'm biased towards acceptance: While imperfect, it provides 
> improved testing - both functional and performance-wise - and establishes a 
> significantly improved benchmark for more future-proof solutions to beat. 
> There are many ways to iteratively improve upon this solution, some of which 
> would even simplify the implementation. But in the face of upcoming changes 
> that might allow C2 to optimize these kinds of loops without intrinsic 
> support I am not sure spending more time on perfecting the current patch is 
> worth our while.
> 
> Rejecting it might be the reasonable thing to do, too, especially if the C2 
> loop optimizations @iwanowww points out might be coming around sooner rather 
> than later. Even if that's not coming soon, the PR at hand adds a chunk of 
> complexity for the compiler team to maintain.

@cl4es @iwanowww is that change still good to go forward? What else would you 
like to see for it to be merged? Thanks!

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-21 Thread Sandhya Viswanathan
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-16 Thread Claes Redestad
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-14 Thread Vladimir Ivanov
On Sun, 13 Nov 2022 21:08:53 GMT, Claes Redestad  wrote:

> How far off is this ...?

Back then it looked way too constrained (tight constraints on code shapes). But 
I considered it as a generally applicable optimization. 

>  ... do you think it'll be able to match the efficiency we see here with a 
> memoized coefficient table etc?

Yes, it is able to build the constant table at runtime when folding 
multiplications of constant coefficients produced during loop unrolling and 
then packing scalars into a constant vector.

Moreover, briefly looking at the code shape, the vectorizer would produce a 
more optimal loop shape (pre-loop would align vector accesses and would use 
512-bit vectors when available; vector post-loop could help as well).

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-14 Thread Vladimir Ivanov
On Sun, 13 Nov 2022 19:50:46 GMT, Claes Redestad  wrote:

> ... several challenges were brought up to the table, including how to deal 
> with all the different contingencies that might be the result of a safepoint, 
> including deoptimization.

FTR if the intrinsic is represented as a stand-alone stub, there's no need to 
care about deoptimization. (In such cases, deopts happen on return from the 
stub.) It wouldn't be allowed to be a leaf call anymore, but a safepoint check 
and an OOP map would do the job.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-14 Thread Vladimir Ivanov
On Sun, 13 Nov 2022 21:01:21 GMT, Claes Redestad  wrote:

>> src/hotspot/share/opto/intrinsicnode.hpp line 175:
>> 
>>> 173:   // as well as adjusting for special treatment of various encoding of 
>>> String
>>> 174:   // arrays. Must correspond to declared constants in 
>>> jdk.internal.util.ArraysSupport
>>> 175:   typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, 
>>> SHORT = 4, INT = 5 } HashMode;
>> 
>> I question the need for `LATIN1` and `UTF16` modes. If you lift some of 
>> input adjustments (initial value and input size) into JDK, it becomes 
>> indistinguishable from `BYTE`/`CHAR`.  Then you can reuse existing constants 
>> for basic types.
>
> UTF16 can easily be replaced with CHAR by lifting up the shift as you say, 
> but LATIN1 needs to be distinguished from BYTE since the former needs 
> unsigned semantics. Modeling in a signed/unsigned input is possible, but I 
> figured we might as well call it UNSIGNED_BYTE and decouple it logically from 
> String::LATIN1.

FTR `T_BOOLEAN` effectively represents unsigned byte.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-14 Thread Ludovic Henry
On Sun, 13 Nov 2022 21:08:53 GMT, Claes Redestad  wrote:

> Also, I'd like to note that C2 auto-vectorization support is not too far away 
> from being able to optimize hash code computations. At some point, I was able 
> to achieve some promising results with modest tweaking of SuperWord pass:
https://github.com/iwanowww/jdk/blob/superword/notes.txt
http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/

That would be extremely helpful not just for this case but for many other cases 
that today require the Vector API or handrolled intrinsics. For cases that 
would be great to support, a good guide is the [gcc autovectorization 
support](https://gcc.gnu.org/projects/tree-ssa/vectorization.html) given they 
use SLP as well.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Tobias Hartmann
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Claes Redestad
On Sat, 12 Nov 2022 02:08:19 GMT, Vladimir Ivanov  wrote:

> Also, I'd like to note that C2 auto-vectorization support is not too far away 
> from being able to optimize hash code computations. At some point, I was able 
> to achieve some promising results with modest tweaking of SuperWord pass: 
> https://github.com/iwanowww/jdk/blob/superword/notes.txt 
> http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/

Intriguing. How far off is this - and do you think it'll be able to match the 
efficiency we see here with a memoized coefficient table etc?

If we turn this intrinsic into a stub we might also be able to reuse the 
optimization in other places, including from within the VM (calculating String 
hashCodes happen in a couple of places, including String deduplication). So I 
think there are still a few compelling reasons to go the manual route and 
continue on this path.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Claes Redestad
On Sat, 12 Nov 2022 01:35:39 GMT, Vladimir Ivanov  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> src/java.base/share/classes/jdk/internal/util/ArraysSupport.java line 185:
> 
>> 183:  */
>> 184: @IntrinsicCandidate
>> 185: public static int vectorizedHashCode(Object array, byte mode) {
> 
> The intrinsic can be generalized by:
> 1. expanding `array` input into `base`, `offset`, and `length`. It will make 
> it applicable to any type of data source (on-heap/off-heap `ByteBuffer`s, 
> `MemorySegment`s. 
> 2. passing initial value as a parameter.
> 
> Basically, hash code computation can be represented as a reduction: 
> `reduce(initial_val, (acc, v) -> 31 * acc + v, data)`. You hardcode the 
> operation, but can make the rest variable. 
> 
> (Even the operation can be slightly generalized if you make 31 variable and 
> then precompute the table at runtime. But right now I don't see much value in 
> investing into that.)

I've been thinking of generalizing as thus as a possible follow-up: get the 
base operation on entire arrays in, then generalize carefully while ensuring 
that doesn't add too much complexity, introduce unforeseen overheads etc.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Claes Redestad
On Sat, 12 Nov 2022 01:28:51 GMT, Vladimir Ivanov  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> src/hotspot/share/opto/intrinsicnode.hpp line 175:
> 
>> 173:   // as well as adjusting for special treatment of various encoding of 
>> String
>> 174:   // arrays. Must correspond to declared constants in 
>> jdk.internal.util.ArraysSupport
>> 175:   typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, 
>> SHORT = 4, INT = 5 } HashMode;
> 
> I question the need for `LATIN1` and `UTF16` modes. If you lift some of input 
> adjustments (initial value and input size) into JDK, it becomes 
> indistinguishable from `BYTE`/`CHAR`.  Then you can reuse existing constants 
> for basic types.

UTF16 can easily be replaced with CHAR by lifting up the shift as you say, but 
LATIN1 needs to be distinguished from BYTE since the former needs unsigned 
semantics. Modeling in a signed/unsigned input is possible, but I figured we 
might as well call it UNSIGNED_BYTE and decouple it logically from 
String::LATIN1.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Claes Redestad
On Sat, 12 Nov 2022 01:10:50 GMT, Vladimir Ivanov  wrote:

>> src/hotspot/cpu/x86/x86_64.ad line 12081:
>> 
>>> 12079:   format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result   // KILL 
>>> all" %}
>>> 12080:   ins_encode %{
>>> 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, 
>>> $result$$Register,
>> 
>> What's the motivation to keep the stub code inlined instead of calling into 
>> a stand-alone pre-generated version of the stub?
>
> Also, switching to stand-alone stubs would enable us to compose a generic 
> stub version (as we do in `StubGenerator::generate_generic_copy()` for 
> arraycopy stubs).  But it would be even better to do the dispatching on JDK 
> side and always pass a constant into the intrinsic.

There are no single reason this code evolved the way it did. @luhenry worked on 
it initially and was guided towards intrinsifying what was originally a 
JDK-level unrolling. Then I took over and have tried to find a path of least 
resistance from there. @luhenry have discussed rewriting part or all of this as 
a stub, for various reasons. I've been scoping that out, but with no experience 
writing stub versions I figured perhaps this could be done in a follow-up. If 
you think there's a compelling enough reason to rewrite this as a stub up front 
I can try and find the time to do so.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Claes Redestad
On Sat, 12 Nov 2022 01:06:27 GMT, Vladimir Ivanov  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> src/hotspot/cpu/x86/x86_64.ad line 12073:
> 
>> 12071:  legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, 
>> rRegI tmp3, rFlagsReg cr)
>> 12072: %{
>> 12073:   predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == 
>> VectorizedHashCodeNode::LATIN1);
> 
> If you represent `VectorizedHashCodeNode::mode()` as an input, it would allow 
> to abstract over supported modes and come up with a single AD instruction. 
> Take a look at `VectorMaskCmp` for an example (not a perfect one though since 
> it has both _predicate member and constant input which is redundant).

Thanks for the pointer, I'll check it out!

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-13 Thread Claes Redestad
On Sat, 12 Nov 2022 15:27:09 GMT, Piotr Tarsa  wrote:

> Out of curiosity: how does this intrinsic affect time-to-safepoint? Does it 
> matter? I don't see any safepoint poll, but then I don't precisely know how 
> safepoints work, so I could be missing something. Theoretically, with 2^31 
> elements count limit in Java, the whole computation is always a fraction of a 
> second, but maybe it would matter with e.g. ZGC, which could ask for a 
> safepoint while the thread is hashing an array with 2 billion ints.

This intrinsic - like several others before it - does not add safepoint checks. 
There's at least one RFE filed to address this deficiency, and hopefully we can 
come up with a shared strategy to interleave safepoint checks in the various 
intrinsics that operate over Strings and arrays: 
https://bugs.openjdk.org/browse/JDK-8233300

When I brought this up to an internal discussion with @TobiHartmann and @fisk 
last week several challenges were brought up to the table, including how to 
deal with all the different contingencies that might be the result of a 
safepoint, including deoptimization. I think enhancing these intrinsics to poll 
for safepoints is important to tackle tail-end latencies for extremely latency 
sensitive applications. In the meantime those applications could (should?) turn 
off such intrinsics, avoid huge arrays altogether, or both.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-12 Thread Piotr Tarsa
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Vladimir Ivanov
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Vladimir Ivanov
On Sat, 12 Nov 2022 00:55:56 GMT, Vladimir Ivanov  wrote:

>> Claes Redestad has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Missing & 0xff in StringLatin1::hashCode
>
> src/hotspot/cpu/x86/x86_64.ad line 12081:
> 
>> 12079:   format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result   // KILL 
>> all" %}
>> 12080:   ins_encode %{
>> 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, 
>> $result$$Register,
> 
> What's the motivation to keep the stub code inlined instead of calling into a 
> stand-alone pre-generated version of the stub?

Also, switching to stand-alone stubs would enable us to compose a generic stub 
version (as we do in `StubGenerator::generate_generic_copy()` for arraycopy 
stubs).  But it would be even better to do the dispatching on JDK side and 
always pass a constant into the intrinsic.

-

PR: https://git.openjdk.org/jdk/pull/10847


Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Vladimir Ivanov
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Claes Redestad
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Claes Redestad
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Piotr Tarsa
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad  wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify 
>> polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
>> method. To make this work I've harmonized how they are invoked so that 
>> there's less special handling and checks in the intrinsic. Mainly do the 
>> null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the 
>> factor and start values which are now hard-coded (always 31, and a start 
>> value of either one for `Arrays` or zero for `String`). It seems somewhat 
>> premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all 
>> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
>> few trivial method calls which increase the call stack depth, so surprises 
>> cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look 
>> like this:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020 
>>  ns/op
>> 
>> Baseline:
>> 
>> Benchmark   (size)  Mode  Cnt ScoreError 
>>  Units
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512 
>>  ns/op
>> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630 
>>  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as 
>> ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
>> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
>> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
>> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
>> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
>> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
>> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
>> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark  (size)  Mode  Cnt ScoreError  Units
>> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
>> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
>> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
>> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
>> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
>> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
>> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
>> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and 
>> faster on large inputs for `char` and `int` (the ones currently vectorized). 
>> I aim to fix `byte` and `short` cases before integrating, though it might be 
>> acceptable to hand that off as 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

2022-11-11 Thread Claes Redestad
> Continuing the work initiated by @luhenry to unroll and then intrinsify 
> polynomial hash loops.
> 
> I've rewired the library changes to route via a single `@IntrinsicCandidate` 
> method. To make this work I've harmonized how they are invoked so that 
> there's less special handling and checks in the intrinsic. Mainly do the 
> null-check outside of the intrinsic for `Arrays.hashCode` cases.
> 
> Having a centralized entry point means it'll be easier to parameterize the 
> factor and start values which are now hard-coded (always 31, and a start 
> value of either one for `Arrays` or zero for `String`). It seems somewhat 
> premature to parameterize this up front.
> 
> The current implementation is performance neutral on microbenchmarks on all 
> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
> few trivial method calls which increase the call stack depth, so surprises 
> cannot be ruled out on complex workloads.
> 
> With the most recent fixes the x64 intrinsic results on my workstation look 
> like this:
> 
> Benchmark   (size)  Mode  Cnt ScoreError  
> Units
> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.199 ±  0.017  
> ns/op
> StringHashCode.Algorithm.defaultLatin1  10  avgt5 6.933 ±  0.049  
> ns/op
> StringHashCode.Algorithm.defaultLatin1 100  avgt529.935 ±  0.221  
> ns/op
> StringHashCode.Algorithm.defaultLatin1   1  avgt5  1596.982 ±  7.020  
> ns/op
> 
> Baseline:
> 
> Benchmark   (size)  Mode  Cnt ScoreError  
> Units
> StringHashCode.Algorithm.defaultLatin1   1  avgt5 2.200 ±  0.013  
> ns/op
> StringHashCode.Algorithm.defaultLatin1  10  avgt5 9.424 ±  0.122  
> ns/op
> StringHashCode.Algorithm.defaultLatin1 100  avgt590.541 ±  0.512  
> ns/op
> StringHashCode.Algorithm.defaultLatin1   1  avgt5  9425.321 ± 67.630  
> ns/op
> 
> I.e. no measurable overhead compared to baseline even for `size == 1`.
> 
> The vectorized code now nominally works for all unsigned cases as well as 
> ints, though more testing would be good.
> 
> Benchmark for `Arrays.hashCode`:
> 
> Benchmark  (size)  Mode  Cnt ScoreError  Units
> ArraysHashCode.bytes1  avgt5 1.884 ±  0.013  ns/op
> ArraysHashCode.bytes   10  avgt5 6.955 ±  0.040  ns/op
> ArraysHashCode.bytes  100  avgt587.218 ±  0.595  ns/op
> ArraysHashCode.bytes1  avgt5  9419.591 ± 38.308  ns/op
> ArraysHashCode.chars1  avgt5 2.200 ±  0.010  ns/op
> ArraysHashCode.chars   10  avgt5 6.935 ±  0.034  ns/op
> ArraysHashCode.chars  100  avgt530.216 ±  0.134  ns/op
> ArraysHashCode.chars1  avgt5  1601.629 ±  6.418  ns/op
> ArraysHashCode.ints 1  avgt5 2.200 ±  0.007  ns/op
> ArraysHashCode.ints10  avgt5 6.936 ±  0.034  ns/op
> ArraysHashCode.ints   100  avgt529.412 ±  0.268  ns/op
> ArraysHashCode.ints 1  avgt5  1610.578 ±  7.785  ns/op
> ArraysHashCode.shorts   1  avgt5 1.885 ±  0.012  ns/op
> ArraysHashCode.shorts  10  avgt5 6.961 ±  0.034  ns/op
> ArraysHashCode.shorts 100  avgt587.095 ±  0.417  ns/op
> ArraysHashCode.shorts   1  avgt5  9420.617 ± 50.089  ns/op
> 
> Baseline:
> 
> Benchmark  (size)  Mode  Cnt ScoreError  Units
> ArraysHashCode.bytes1  avgt5 3.213 ±  0.207  ns/op
> ArraysHashCode.bytes   10  avgt5 8.483 ±  0.040  ns/op
> ArraysHashCode.bytes  100  avgt590.315 ±  0.655  ns/op
> ArraysHashCode.bytes1  avgt5  9422.094 ± 62.402  ns/op
> ArraysHashCode.chars1  avgt5 3.040 ±  0.066  ns/op
> ArraysHashCode.chars   10  avgt5 8.497 ±  0.074  ns/op
> ArraysHashCode.chars  100  avgt590.074 ±  0.387  ns/op
> ArraysHashCode.chars1  avgt5  9420.474 ± 41.619  ns/op
> ArraysHashCode.ints 1  avgt5 2.827 ±  0.019  ns/op
> ArraysHashCode.ints10  avgt5 7.727 ±  0.043  ns/op
> ArraysHashCode.ints   100  avgt589.405 ±  0.593  ns/op
> ArraysHashCode.ints 1  avgt5  9426.539 ± 51.308  ns/op
> ArraysHashCode.shorts   1  avgt5 3.071 ±  0.062  ns/op
> ArraysHashCode.shorts  10  avgt5 8.168 ±  0.049  ns/op
> ArraysHashCode.shorts 100  avgt590.399 ±  0.292  ns/op
> ArraysHashCode.shorts   1  avgt5  9420.171 ± 44.474  ns/op
> 
> 
> As we can see the `Arrays` intrinsics are faster for small inputs, and faster 
> on large inputs for `char` and `int` (the ones currently vectorized). I aim 
> to fix `byte` and `short` cases before integrating, though it might be 
> acceptable to hand that off as follow-up enhancements to not further delay 
> integration of this enhancement.

Claes Redestad has updated the pull request incrementally with one