On Mon, 31 Oct 2022 21:48:37 GMT, Claes Redestad <redes...@openjdk.org> wrote:
>> Continuing the work initiated by @luhenry to unroll and then intrinsify >> polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` >> method. To make this work I've harmonized how they are invoked so that >> there's less special handling and checks in the intrinsic. Mainly do the >> null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the >> factor and start values which are now hard-coded (always 31, and a start >> value of either one for `Arrays` or zero for `String`). It seems somewhat >> premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all >> tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a >> few trivial method calls which increase the call stack depth, so surprises >> cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look >> like this: >> >> Benchmark (size) Mode Cnt Score Error >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ± 0.017 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ± 0.049 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ± 0.221 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ± 7.020 >> ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error >> Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ± 0.013 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ± 0.122 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ± 0.512 >> ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ± 67.630 >> ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as >> ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ± 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ± 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ± 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ± 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ± 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ± 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ± 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ± 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ± 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ± 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ± 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ± 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ± 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ± 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ± 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ± 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ± 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ± 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ± 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ± 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ± 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ± 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ± 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ± 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ± 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ± 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ± 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ± 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ± 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ± 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ± 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and >> faster on large inputs for `char` and `int` (the ones currently vectorized). >> I aim to fix `byte` and `short` cases before integrating, though it might be >> acceptable to hand that off as follow-up enhancements to not further delay >> integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional > commit since the last revision: > > Change scalar unroll to 2 element stride, minding dependency chain A stride of 2 allows small element cases to perform a bit better, while also performing better than before on longer arrays for the `byte` and `short` cases that don't get any benefit from vectorization: Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.bytes 1 avgt 5 1.414 ± 0.005 ns/op ArraysHashCode.bytes 10 avgt 5 6.908 ± 0.020 ns/op ArraysHashCode.bytes 100 avgt 5 73.666 ± 0.390 ns/op ArraysHashCode.bytes 10000 avgt 5 7846.994 ± 53.628 ns/op ArraysHashCode.chars 1 avgt 5 1.414 ± 0.007 ns/op ArraysHashCode.chars 10 avgt 5 7.229 ± 0.044 ns/op ArraysHashCode.chars 100 avgt 5 30.718 ± 0.229 ns/op ArraysHashCode.chars 10000 avgt 5 1621.463 ± 116.286 ns/op ArraysHashCode.ints 1 avgt 5 1.414 ± 0.008 ns/op ArraysHashCode.ints 10 avgt 5 7.540 ± 0.042 ns/op ArraysHashCode.ints 100 avgt 5 29.429 ± 0.121 ns/op ArraysHashCode.ints 10000 avgt 5 1600.855 ± 9.274 ns/op ArraysHashCode.shorts 1 avgt 5 1.414 ± 0.010 ns/op ArraysHashCode.shorts 10 avgt 5 6.914 ± 0.045 ns/op ArraysHashCode.shorts 100 avgt 5 73.684 ± 0.501 ns/op ArraysHashCode.shorts 10000 avgt 5 7846.829 ± 49.984 ns/op I've also made some changes to improve the String cases, which can avoid the first coeff*h multiplication on first pass. This gets the size 1 latin1 case down to 1.1ns/op without penalizing the empty case. We're now improving over the baseline on almost all* tested sizes: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.946 ± 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.108 ± 0.003 ns/op StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.042 ± 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 31 avgt 5 18.636 ± 0.286 ns/op StringHashCode.Algorithm.defaultLatin1 32 avgt 5 15.938 ± 1.086 ns/op StringHashCode.Algorithm.defaultUTF16 0 avgt 5 1.257 ± 0.004 ns/op StringHashCode.Algorithm.defaultUTF16 1 avgt 5 2.198 ± 0.005 ns/op StringHashCode.Algorithm.defaultUTF16 2 avgt 5 2.559 ± 0.011 ns/op StringHashCode.Algorithm.defaultUTF16 31 avgt 5 15.754 ± 0.036 ns/op StringHashCode.Algorithm.defaultUTF16 32 avgt 5 16.616 ± 0.042 ns/op Baseline: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.942 ± 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.991 ± 0.013 ns/op StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.831 ± 0.021 ns/op StringHashCode.Algorithm.defaultLatin1 31 avgt 5 25.042 ± 0.112 ns/op StringHashCode.Algorithm.defaultLatin1 32 avgt 5 25.857 ± 0.133 ns/op StringHashCode.Algorithm.defaultUTF16 0 avgt 5 0.789 ± 0.006 ns/op StringHashCode.Algorithm.defaultUTF16 1 avgt 5 3.459 ± 0.007 ns/op StringHashCode.Algorithm.defaultUTF16 2 avgt 5 4.400 ± 0.010 ns/op StringHashCode.Algorithm.defaultUTF16 31 avgt 5 25.721 ± 0.067 ns/op StringHashCode.Algorithm.defaultUTF16 32 avgt 5 27.162 ± 0.093 ns/op There's a negligible regression on `defaultUTF16` for size = 0 due moving the length shift up earlier. This can only happen when running with CompactStrings disabled. And even if you were the change significantly helps improve size 1-31, which should more than make up for a small cost increase hashing empty strings. ------------- PR: https://git.openjdk.org/jdk/pull/10847