Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov wrote: >>> Also, I'd like to note that C2 auto-vectorization support is not too far >>> away from being able to optimize hash code computations. At some point, I >>> was able to achieve some promising results with modest tweaking of >>> SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt >>> http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ >> >> Intriguing. How far off is this - and do you think it'll be able to match >> the efficiency we see here with a memoized coefficient table etc? >> >> If we turn this intrinsic into a stub we might also be able to reuse the >> optimization in other places, including from within the VM (calculating >> String hashCodes happen in a couple of places, including String >> deduplication). So I think there are still a few compelling reasons to go >> the manual route and continue on this path. > >> How far off is this ...? > > Back then it looked way too constrained (tight constraints on code shapes). > But I considered it as a generally applicable optimization. > >> ... do you think it'll be able to match the efficiency we see here with a >> memoized coefficient table etc? > > Yes, it is able to build the constant table at runtime when folding > multiplications of constant coefficients produced during loop unrolling and > then packing scalars into a constant vector. > > Moreover, briefly looking at the code shape, the vectorizer would produce a > more optimal loop shape (pre-loop would align vector accesses and would use > 512-bit vectors when available; vector post-loop could help as well). I've opted to include the changes spurred by @iwanowww's comments since it led to a number of revisions to the intrinsified method API, and it would be strange to introduce an intrinsified method just to change the API drastically in a follow-up. Basically `ArraysSupport.vectorizedHashCode` has been changed to take an offset + length, an initial value and the logical basic type of the array elements. Which means any necessary scaling of index and length needs to be taken care of before calling the intrinsic. This makes the implementation more flexible at no measurable performance cost. Overall the refactoring might have reduced complexity a bit. Reviewers might observe that nothing is currently passing anything but `0` and `length` to `vectorizedHashCode` outside of the simple sanity test I've added, but I've verified this feature can be used to some effect elsewhere in the JDK, e.g: https://github.com/openjdk/jdk/compare/pr/10847...cl4es:jdk:zipcoder-hashcode?expand=1 (which improves speed of opening `ZipFile` by a small percentage in microbenchmarks). - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Thu, 22 Dec 2022 13:10:02 GMT, Claes Redestad wrote: >> @cl4es Thanks for passing the constant node through, the code looks much >> cleaner now. The attached patch should handle the signed bytes/shorts as >> well. Please take a look. >> [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch) > > I ran tests and some quick microbenchmarking to validate @sviswa7's patch to > activate vectorization for `short` and `byte` arrays and it looks good: > > Before: > > Benchmark (size) Mode Cnt ScoreError Units > ArraysHashCode.bytes 1 avgt5 7845.586 ± 23.440 ns/op > ArraysHashCode.chars 1 avgt5 1203.163 ± 11.995 ns/op > ArraysHashCode.ints 1 avgt5 1131.915 ± 7.843 ns/op > ArraysHashCode.multibytes1 avgt5 4136.487 ± 5.790 ns/op > ArraysHashCode.multichars1 avgt5 671.328 ± 17.629 ns/op > ArraysHashCode.multiints 1 avgt5 699.051 ± 8.135 ns/op > ArraysHashCode.multishorts 1 avgt5 4139.300 ± 10.633 ns/op > ArraysHashCode.shorts1 avgt5 7844.019 ± 26.071 ns/op > > > After: > > Benchmark (size) Mode Cnt ScoreError Units > ArraysHashCode.bytes 1 avgt5 1193.208 ± 1.965 ns/op > ArraysHashCode.chars 1 avgt5 1193.311 ± 5.941 ns/op > ArraysHashCode.ints 1 avgt5 1132.592 ± 10.410 ns/op > ArraysHashCode.multibytes1 avgt5 657.343 ± 25.343 ns/op > ArraysHashCode.multichars1 avgt5 672.668 ± 5.229 ns/op > ArraysHashCode.multiints 1 avgt5 697.143 ± 3.929 ns/op > ArraysHashCode.multishorts 1 avgt5 666.738 ± 12.236 ns/op > ArraysHashCode.shorts1 avgt5 1193.563 ± 5.449 ns/op > @cl4es There seem to be failure on windows-x64 platform pre submit tests. > Could you please take a look? It looks like the `as_Address(ExternalAddress(StubRoutines::x86::arrays_hashcode_powers_of_31() + ...)` trick is running into some reachability issue on Windows, hitting the `assert(reachable(adr), "must be");` in `macroAssembler_x86.cpp`. Might be related to ASLR or some quirk of the VS compiler. I'll investigate. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Thu, 22 Dec 2022 13:10:02 GMT, Claes Redestad wrote: >> @cl4es Thanks for passing the constant node through, the code looks much >> cleaner now. The attached patch should handle the signed bytes/shorts as >> well. Please take a look. >> [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch) > > I ran tests and some quick microbenchmarking to validate @sviswa7's patch to > activate vectorization for `short` and `byte` arrays and it looks good: > > Before: > > Benchmark (size) Mode Cnt ScoreError Units > ArraysHashCode.bytes 1 avgt5 7845.586 ± 23.440 ns/op > ArraysHashCode.chars 1 avgt5 1203.163 ± 11.995 ns/op > ArraysHashCode.ints 1 avgt5 1131.915 ± 7.843 ns/op > ArraysHashCode.multibytes1 avgt5 4136.487 ± 5.790 ns/op > ArraysHashCode.multichars1 avgt5 671.328 ± 17.629 ns/op > ArraysHashCode.multiints 1 avgt5 699.051 ± 8.135 ns/op > ArraysHashCode.multishorts 1 avgt5 4139.300 ± 10.633 ns/op > ArraysHashCode.shorts1 avgt5 7844.019 ± 26.071 ns/op > > > After: > > Benchmark (size) Mode Cnt ScoreError Units > ArraysHashCode.bytes 1 avgt5 1193.208 ± 1.965 ns/op > ArraysHashCode.chars 1 avgt5 1193.311 ± 5.941 ns/op > ArraysHashCode.ints 1 avgt5 1132.592 ± 10.410 ns/op > ArraysHashCode.multibytes1 avgt5 657.343 ± 25.343 ns/op > ArraysHashCode.multichars1 avgt5 672.668 ± 5.229 ns/op > ArraysHashCode.multiints 1 avgt5 697.143 ± 3.929 ns/op > ArraysHashCode.multishorts 1 avgt5 666.738 ± 12.236 ns/op > ArraysHashCode.shorts1 avgt5 1193.563 ± 5.449 ns/op @cl4es There seem to be failure on windows-x64 platform pre submit tests. Could you please take a look? - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Wed, 21 Dec 2022 00:11:34 GMT, Sandhya Viswanathan wrote: >> Passing the constant node through as an input as suggested by @iwanowww and >> @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a >> significant chunk of code and a little bit of complexity from the proposed >> patch. > > @cl4es Thanks for passing the constant node through, the code looks much > cleaner now. The attached patch should handle the signed bytes/shorts as > well. Please take a look. > [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch) I ran tests and some quick microbenchmarking to validate @sviswa7's patch to activate vectorization for `short` and `byte` arrays and it looks good: Before: Benchmark (size) Mode Cnt ScoreError Units ArraysHashCode.bytes 1 avgt5 7845.586 ± 23.440 ns/op ArraysHashCode.chars 1 avgt5 1203.163 ± 11.995 ns/op ArraysHashCode.ints 1 avgt5 1131.915 ± 7.843 ns/op ArraysHashCode.multibytes1 avgt5 4136.487 ± 5.790 ns/op ArraysHashCode.multichars1 avgt5 671.328 ± 17.629 ns/op ArraysHashCode.multiints 1 avgt5 699.051 ± 8.135 ns/op ArraysHashCode.multishorts 1 avgt5 4139.300 ± 10.633 ns/op ArraysHashCode.shorts1 avgt5 7844.019 ± 26.071 ns/op After: Benchmark (size) Mode Cnt ScoreError Units ArraysHashCode.bytes 1 avgt5 1193.208 ± 1.965 ns/op ArraysHashCode.chars 1 avgt5 1193.311 ± 5.941 ns/op ArraysHashCode.ints 1 avgt5 1132.592 ± 10.410 ns/op ArraysHashCode.multibytes1 avgt5 657.343 ± 25.343 ns/op ArraysHashCode.multichars1 avgt5 672.668 ± 5.229 ns/op ArraysHashCode.multiints 1 avgt5 697.143 ± 3.929 ns/op ArraysHashCode.multishorts 1 avgt5 666.738 ± 12.236 ns/op ArraysHashCode.shorts1 avgt5 1193.563 ± 5.449 ns/op - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Tue, 20 Dec 2022 19:52:34 GMT, Claes Redestad wrote: >> src/java.base/share/classes/java/lang/StringUTF16.java line 418: >> >>> 416: return 0; >>> 417: } else { >>> 418: return ArraysSupport.vectorizedHashCode(value, >>> ArraysSupport.UTF16); >> >> Special case for 1 missing here. > > Intentionally left out. Array length is always even for `UTF16` arrays, but > we could add a case for `2` that'd return `getChar(bytes, 0)` but I didn't > see much of a win when I tested this. I do see a 1.5x gain with this special case added: return switch (value.length) { case 0 -> 0; case 2 -> getChar(value, 0); default -> ArraysSupport.vectorizedHashCode(value, ArraysSupport.UTF16); }; before: 0.987 ns/op after: 0.640 ns/op - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Tue, 20 Dec 2022 21:11:18 GMT, Claes Redestad wrote: >>> How far off is this ...? >> >> Back then it looked way too constrained (tight constraints on code shapes). >> But I considered it as a generally applicable optimization. >> >>> ... do you think it'll be able to match the efficiency we see here with a >>> memoized coefficient table etc? >> >> Yes, it is able to build the constant table at runtime when folding >> multiplications of constant coefficients produced during loop unrolling and >> then packing scalars into a constant vector. >> >> Moreover, briefly looking at the code shape, the vectorizer would produce a >> more optimal loop shape (pre-loop would align vector accesses and would use >> 512-bit vectors when available; vector post-loop could help as well). > > Passing the constant node through as an input as suggested by @iwanowww and > @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a > significant chunk of code and a little bit of complexity from the proposed > patch. @cl4es Thanks for passing the constant node through, the code looks much cleaner now. The attached patch should handle the signed bytes/shorts as well. Please take a look. [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch) - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov wrote: >>> Also, I'd like to note that C2 auto-vectorization support is not too far >>> away from being able to optimize hash code computations. At some point, I >>> was able to achieve some promising results with modest tweaking of >>> SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt >>> http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ >> >> Intriguing. How far off is this - and do you think it'll be able to match >> the efficiency we see here with a memoized coefficient table etc? >> >> If we turn this intrinsic into a stub we might also be able to reuse the >> optimization in other places, including from within the VM (calculating >> String hashCodes happen in a couple of places, including String >> deduplication). So I think there are still a few compelling reasons to go >> the manual route and continue on this path. > >> How far off is this ...? > > Back then it looked way too constrained (tight constraints on code shapes). > But I considered it as a generally applicable optimization. > >> ... do you think it'll be able to match the efficiency we see here with a >> memoized coefficient table etc? > > Yes, it is able to build the constant table at runtime when folding > multiplications of constant coefficients produced during loop unrolling and > then packing scalars into a constant vector. > > Moreover, briefly looking at the code shape, the vectorizer would produce a > more optimal loop shape (pre-loop would align vector accesses and would use > 512-bit vectors when available; vector post-loop could help as well). Passing the constant node through as an input as suggested by @iwanowww and @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a significant chunk of code and a little bit of complexity from the proposed patch. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 16 Dec 2022 23:00:53 GMT, Sandhya Viswanathan wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/stubRoutines_x86.cpp line 230: > >> 228: #endif // _LP64 >> 229: >> 230: jint StubRoutines::x86::_arrays_hashcode_powers_of_31[] = > > This should be declared only for LP64. Hmm, I guess same goes for all the new `arrays_hashcode` methods in `c2_MacroAssembler_x86`, since we only wire this up properly on 64-bit. I'll make a pass to put `_LP64` guards around all new methods. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 16 Dec 2022 22:58:23 GMT, Sandhya Viswanathan wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1671: > >> 1669: } >> 1670: if (UseAVX >= 2) { >> 1671: FLAG_SET_ERGO_IF_DEFAULT(UseVectorizedHashCodeIntrinsic, true); > > This could be just FLAG_SET_DEFAULT instead of FLAG_SET_ERGO_IF_DEFAULT. Right, it seems HW-dependent intrinsics in generally doesn't mark that they've been enabled ergonomically, rather just make it on "by default" when support is available. > src/java.base/share/classes/java/lang/StringUTF16.java line 418: > >> 416: return 0; >> 417: } else { >> 418: return ArraysSupport.vectorizedHashCode(value, >> ArraysSupport.UTF16); > > Special case for 1 missing here. Intentionally left out. Array length is always even for `UTF16` arrays, but we could add a case for `2` that'd return `getChar(bytes, 0)` but I didn't see much of a win when I tested this. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sun, 13 Nov 2022 20:57:44 GMT, Claes Redestad wrote: >> src/hotspot/cpu/x86/x86_64.ad line 12073: >> >>> 12071: legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, >>> rRegI tmp3, rFlagsReg cr) >>> 12072: %{ >>> 12073: predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == >>> VectorizedHashCodeNode::LATIN1); >> >> If you represent `VectorizedHashCodeNode::mode()` as an input, it would >> allow to abstract over supported modes and come up with a single AD >> instruction. Take a look at `VectorMaskCmp` for an example (not a perfect >> one though since it has both _predicate member and constant input which is >> redundant). > > Thanks for the pointer, I'll check it out! I agree with Vladimir, adding mode as another input will help. Please take a look at the RoundDoubleModeNode. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Wed, 16 Nov 2022 18:18:55 GMT, Claes Redestad wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > I'm getting pulled into other tasks and would request for this to be either > accepted as-is, rejected or picked up by someone else to rewrite it to > something that can be accepted. > > Obviously I'm biased towards acceptance: While imperfect, it provides > improved testing - both functional and performance-wise - and establishes a > significantly improved benchmark for more future-proof solutions to beat. > There are many ways to iteratively improve upon this solution, some of which > would even simplify the implementation. But in the face of upcoming changes > that might allow C2 to optimize these kinds of loops without intrinsic > support I am not sure spending more time on perfecting the current patch is > worth our while. > > Rejecting it might be the reasonable thing to do, too, especially if the C2 > loop optimizations @iwanowww points out might be coming around sooner rather > than later. Even if that's not coming soon, the PR at hand adds a chunk of > complexity for the compiler team to maintain. @cl4es @iwanowww is that change still good to go forward? What else would you like to see for it to be merged? Thanks! - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sun, 13 Nov 2022 21:08:53 GMT, Claes Redestad wrote: > How far off is this ...? Back then it looked way too constrained (tight constraints on code shapes). But I considered it as a generally applicable optimization. > ... do you think it'll be able to match the efficiency we see here with a > memoized coefficient table etc? Yes, it is able to build the constant table at runtime when folding multiplications of constant coefficients produced during loop unrolling and then packing scalars into a constant vector. Moreover, briefly looking at the code shape, the vectorizer would produce a more optimal loop shape (pre-loop would align vector accesses and would use 512-bit vectors when available; vector post-loop could help as well). - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sun, 13 Nov 2022 19:50:46 GMT, Claes Redestad wrote: > ... several challenges were brought up to the table, including how to deal > with all the different contingencies that might be the result of a safepoint, > including deoptimization. FTR if the intrinsic is represented as a stand-alone stub, there's no need to care about deoptimization. (In such cases, deopts happen on return from the stub.) It wouldn't be allowed to be a leaf call anymore, but a safepoint check and an OOP map would do the job. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sun, 13 Nov 2022 21:01:21 GMT, Claes Redestad wrote: >> src/hotspot/share/opto/intrinsicnode.hpp line 175: >> >>> 173: // as well as adjusting for special treatment of various encoding of >>> String >>> 174: // arrays. Must correspond to declared constants in >>> jdk.internal.util.ArraysSupport >>> 175: typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, >>> SHORT = 4, INT = 5 } HashMode; >> >> I question the need for `LATIN1` and `UTF16` modes. If you lift some of >> input adjustments (initial value and input size) into JDK, it becomes >> indistinguishable from `BYTE`/`CHAR`. Then you can reuse existing constants >> for basic types. > > UTF16 can easily be replaced with CHAR by lifting up the shift as you say, > but LATIN1 needs to be distinguished from BYTE since the former needs > unsigned semantics. Modeling in a signed/unsigned input is possible, but I > figured we might as well call it UNSIGNED_BYTE and decouple it logically from > String::LATIN1. FTR `T_BOOLEAN` effectively represents unsigned byte. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sun, 13 Nov 2022 21:08:53 GMT, Claes Redestad wrote: > Also, I'd like to note that C2 auto-vectorization support is not too far away > from being able to optimize hash code computations. At some point, I was able > to achieve some promising results with modest tweaking of SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ That would be extremely helpful not just for this case but for many other cases that today require the Vector API or handrolled intrinsics. For cases that would be great to support, a good guide is the [gcc autovectorization support](https://gcc.gnu.org/projects/tree-ssa/vectorization.html) given they use SLP as well. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 02:08:19 GMT, Vladimir Ivanov wrote: > Also, I'd like to note that C2 auto-vectorization support is not too far away > from being able to optimize hash code computations. At some point, I was able > to achieve some promising results with modest tweaking of SuperWord pass: > https://github.com/iwanowww/jdk/blob/superword/notes.txt > http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ Intriguing. How far off is this - and do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc? If we turn this intrinsic into a stub we might also be able to reuse the optimization in other places, including from within the VM (calculating String hashCodes happen in a couple of places, including String deduplication). So I think there are still a few compelling reasons to go the manual route and continue on this path. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 01:35:39 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/java.base/share/classes/jdk/internal/util/ArraysSupport.java line 185: > >> 183: */ >> 184: @IntrinsicCandidate >> 185: public static int vectorizedHashCode(Object array, byte mode) { > > The intrinsic can be generalized by: > 1. expanding `array` input into `base`, `offset`, and `length`. It will make > it applicable to any type of data source (on-heap/off-heap `ByteBuffer`s, > `MemorySegment`s. > 2. passing initial value as a parameter. > > Basically, hash code computation can be represented as a reduction: > `reduce(initial_val, (acc, v) -> 31 * acc + v, data)`. You hardcode the > operation, but can make the rest variable. > > (Even the operation can be slightly generalized if you make 31 variable and > then precompute the table at runtime. But right now I don't see much value in > investing into that.) I've been thinking of generalizing as thus as a possible follow-up: get the base operation on entire arrays in, then generalize carefully while ensuring that doesn't add too much complexity, introduce unforeseen overheads etc. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 01:28:51 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/share/opto/intrinsicnode.hpp line 175: > >> 173: // as well as adjusting for special treatment of various encoding of >> String >> 174: // arrays. Must correspond to declared constants in >> jdk.internal.util.ArraysSupport >> 175: typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, >> SHORT = 4, INT = 5 } HashMode; > > I question the need for `LATIN1` and `UTF16` modes. If you lift some of input > adjustments (initial value and input size) into JDK, it becomes > indistinguishable from `BYTE`/`CHAR`. Then you can reuse existing constants > for basic types. UTF16 can easily be replaced with CHAR by lifting up the shift as you say, but LATIN1 needs to be distinguished from BYTE since the former needs unsigned semantics. Modeling in a signed/unsigned input is possible, but I figured we might as well call it UNSIGNED_BYTE and decouple it logically from String::LATIN1. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 01:10:50 GMT, Vladimir Ivanov wrote: >> src/hotspot/cpu/x86/x86_64.ad line 12081: >> >>> 12079: format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result // KILL >>> all" %} >>> 12080: ins_encode %{ >>> 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, >>> $result$$Register, >> >> What's the motivation to keep the stub code inlined instead of calling into >> a stand-alone pre-generated version of the stub? > > Also, switching to stand-alone stubs would enable us to compose a generic > stub version (as we do in `StubGenerator::generate_generic_copy()` for > arraycopy stubs). But it would be even better to do the dispatching on JDK > side and always pass a constant into the intrinsic. There are no single reason this code evolved the way it did. @luhenry worked on it initially and was guided towards intrinsifying what was originally a JDK-level unrolling. Then I took over and have tried to find a path of least resistance from there. @luhenry have discussed rewriting part or all of this as a stub, for various reasons. I've been scoping that out, but with no experience writing stub versions I figured perhaps this could be done in a follow-up. If you think there's a compelling enough reason to rewrite this as a stub up front I can try and find the time to do so. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 01:06:27 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/x86_64.ad line 12073: > >> 12071: legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, >> rRegI tmp3, rFlagsReg cr) >> 12072: %{ >> 12073: predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == >> VectorizedHashCodeNode::LATIN1); > > If you represent `VectorizedHashCodeNode::mode()` as an input, it would allow > to abstract over supported modes and come up with a single AD instruction. > Take a look at `VectorMaskCmp` for an example (not a perfect one though since > it has both _predicate member and constant input which is redundant). Thanks for the pointer, I'll check it out! - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 15:27:09 GMT, Piotr Tarsa wrote: > Out of curiosity: how does this intrinsic affect time-to-safepoint? Does it > matter? I don't see any safepoint poll, but then I don't precisely know how > safepoints work, so I could be missing something. Theoretically, with 2^31 > elements count limit in Java, the whole computation is always a fraction of a > second, but maybe it would matter with e.g. ZGC, which could ask for a > safepoint while the thread is hashing an array with 2 billion ints. This intrinsic - like several others before it - does not add safepoint checks. There's at least one RFE filed to address this deficiency, and hopefully we can come up with a shared strategy to interleave safepoint checks in the various intrinsics that operate over Strings and arrays: https://bugs.openjdk.org/browse/JDK-8233300 When I brought this up to an internal discussion with @TobiHartmann and @fisk last week several challenges were brought up to the table, including how to deal with all the different contingencies that might be the result of a safepoint, including deoptimization. I think enhancing these intrinsics to poll for safepoints is important to tackle tail-end latencies for extremely latency sensitive applications. In the meantime those applications could (should?) turn off such intrinsics, avoid huge arrays altogether, or both. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Sat, 12 Nov 2022 00:55:56 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/x86_64.ad line 12081: > >> 12079: format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result // KILL >> all" %} >> 12080: ins_encode %{ >> 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, >> $result$$Register, > > What's the motivation to keep the stub code inlined instead of calling into a > stand-alone pre-generated version of the stub? Also, switching to stand-alone stubs would enable us to compose a generic stub version (as we do in `StubGenerator::generate_generic_copy()` for arraycopy stubs). But it would be even better to do the dispatching on JDK side and always pass a constant into the intrinsic. - PR: https://git.openjdk.org/jdk/pull/10847
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op >> ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op >> ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op >> ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt ScoreError Units >> ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op >> ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op >> ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op >> ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op >> ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as
Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]
> Continuing the work initiated by @luhenry to unroll and then intrinsify > polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` > method. To make this work I've harmonized how they are invoked so that > there's less special handling and checks in the intrinsic. Mainly do the > null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the > factor and start values which are now hard-coded (always 31, and a start > value of either one for `Arrays` or zero for `String`). It seems somewhat > premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all > tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a > few trivial method calls which increase the call stack depth, so surprises > cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look > like this: > > Benchmark (size) Mode Cnt ScoreError > Units > StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.199 ± 0.017 > ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt5 6.933 ± 0.049 > ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt529.935 ± 0.221 > ns/op > StringHashCode.Algorithm.defaultLatin1 1 avgt5 1596.982 ± 7.020 > ns/op > > Baseline: > > Benchmark (size) Mode Cnt ScoreError > Units > StringHashCode.Algorithm.defaultLatin1 1 avgt5 2.200 ± 0.013 > ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt5 9.424 ± 0.122 > ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt590.541 ± 0.512 > ns/op > StringHashCode.Algorithm.defaultLatin1 1 avgt5 9425.321 ± 67.630 > ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as > ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt ScoreError Units > ArraysHashCode.bytes1 avgt5 1.884 ± 0.013 ns/op > ArraysHashCode.bytes 10 avgt5 6.955 ± 0.040 ns/op > ArraysHashCode.bytes 100 avgt587.218 ± 0.595 ns/op > ArraysHashCode.bytes1 avgt5 9419.591 ± 38.308 ns/op > ArraysHashCode.chars1 avgt5 2.200 ± 0.010 ns/op > ArraysHashCode.chars 10 avgt5 6.935 ± 0.034 ns/op > ArraysHashCode.chars 100 avgt530.216 ± 0.134 ns/op > ArraysHashCode.chars1 avgt5 1601.629 ± 6.418 ns/op > ArraysHashCode.ints 1 avgt5 2.200 ± 0.007 ns/op > ArraysHashCode.ints10 avgt5 6.936 ± 0.034 ns/op > ArraysHashCode.ints 100 avgt529.412 ± 0.268 ns/op > ArraysHashCode.ints 1 avgt5 1610.578 ± 7.785 ns/op > ArraysHashCode.shorts 1 avgt5 1.885 ± 0.012 ns/op > ArraysHashCode.shorts 10 avgt5 6.961 ± 0.034 ns/op > ArraysHashCode.shorts 100 avgt587.095 ± 0.417 ns/op > ArraysHashCode.shorts 1 avgt5 9420.617 ± 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt ScoreError Units > ArraysHashCode.bytes1 avgt5 3.213 ± 0.207 ns/op > ArraysHashCode.bytes 10 avgt5 8.483 ± 0.040 ns/op > ArraysHashCode.bytes 100 avgt590.315 ± 0.655 ns/op > ArraysHashCode.bytes1 avgt5 9422.094 ± 62.402 ns/op > ArraysHashCode.chars1 avgt5 3.040 ± 0.066 ns/op > ArraysHashCode.chars 10 avgt5 8.497 ± 0.074 ns/op > ArraysHashCode.chars 100 avgt590.074 ± 0.387 ns/op > ArraysHashCode.chars1 avgt5 9420.474 ± 41.619 ns/op > ArraysHashCode.ints 1 avgt5 2.827 ± 0.019 ns/op > ArraysHashCode.ints10 avgt5 7.727 ± 0.043 ns/op > ArraysHashCode.ints 100 avgt589.405 ± 0.593 ns/op > ArraysHashCode.ints 1 avgt5 9426.539 ± 51.308 ns/op > ArraysHashCode.shorts 1 avgt5 3.071 ± 0.062 ns/op > ArraysHashCode.shorts 10 avgt5 8.168 ± 0.049 ns/op > ArraysHashCode.shorts 100 avgt590.399 ± 0.292 ns/op > ArraysHashCode.shorts 1 avgt5 9420.171 ± 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster > on large inputs for `char` and `int` (the ones currently vectorized). I aim > to fix `byte` and `short` cases before integrating, though it might be > acceptable to hand that off as follow-up enhancements to not further delay > integration of this enhancement. Claes Redestad has updated the pull request incrementally with one