RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops

Claes Redestad Fri, 28 Oct 2022 13:49:06 -0700

Continuing the work initiated by @luhenry to unroll and then intrinsify 
polynomial hash loops.


I've rewired the library changes to route via a single `@IntrinsicCandidate` 
method. To make this work I've harmonized how they are invoked so that there's 
less special handling and checks in the intrinsic. Mainly do the null-check 
outside of the intrinsic for `Arrays.hashCode` cases.

Having a centralized entry point means it'll be easier to parameterize the 
factor and start values which are now hard-coded (always 31, and a start value 
of either one for `Arrays` or zero for `String`). It seems somewhat premature 
to parameterize this up front.

The current implementation is performance neutral on microbenchmarks on all 
tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a 
few trivial method calls which increase the call stack depth, so surprises 
cannot be ruled out on complex workloads.

With the most recent fixes the x64 intrinsic results on my workstation look 
like this:

Benchmark                               (size)  Mode  Cnt     Score    Error  
Units
StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.199 ±  0.017  
ns/op
StringHashCode.Algorithm.defaultLatin1      10  avgt    5     6.933 ±  0.049  
ns/op
StringHashCode.Algorithm.defaultLatin1     100  avgt    5    29.935 ±  0.221  
ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1596.982 ±  7.020  
ns/op

Baseline:

Benchmark                               (size)  Mode  Cnt     Score    Error  
Units
StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.200 ±  0.013  
ns/op
StringHashCode.Algorithm.defaultLatin1      10  avgt    5     9.424 ±  0.122  
ns/op
StringHashCode.Algorithm.defaultLatin1     100  avgt    5    90.541 ±  0.512  
ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  9425.321 ± 67.630  
ns/op

I.e. no measurable overhead compared to baseline even for `size == 1`.

The vectorized code now nominally works for all unsigned cases as well as ints, 
though more testing would be good.

Benchmark for `Arrays.hashCode`:

Benchmark              (size)  Mode  Cnt     Score    Error  Units
ArraysHashCode.bytes        1  avgt    5     1.884 ±  0.013  ns/op
ArraysHashCode.bytes       10  avgt    5     6.955 ±  0.040  ns/op
ArraysHashCode.bytes      100  avgt    5    87.218 ±  0.595  ns/op
ArraysHashCode.bytes    10000  avgt    5  9419.591 ± 38.308  ns/op
ArraysHashCode.chars        1  avgt    5     2.200 ±  0.010  ns/op
ArraysHashCode.chars       10  avgt    5     6.935 ±  0.034  ns/op
ArraysHashCode.chars      100  avgt    5    30.216 ±  0.134  ns/op
ArraysHashCode.chars    10000  avgt    5  1601.629 ±  6.418  ns/op
ArraysHashCode.ints         1  avgt    5     2.200 ±  0.007  ns/op
ArraysHashCode.ints        10  avgt    5     6.936 ±  0.034  ns/op
ArraysHashCode.ints       100  avgt    5    29.412 ±  0.268  ns/op
ArraysHashCode.ints     10000  avgt    5  1610.578 ±  7.785  ns/op
ArraysHashCode.shorts       1  avgt    5     1.885 ±  0.012  ns/op
ArraysHashCode.shorts      10  avgt    5     6.961 ±  0.034  ns/op
ArraysHashCode.shorts     100  avgt    5    87.095 ±  0.417  ns/op
ArraysHashCode.shorts   10000  avgt    5  9420.617 ± 50.089  ns/op

Baseline:

Benchmark              (size)  Mode  Cnt     Score    Error  Units
ArraysHashCode.bytes        1  avgt    5     3.213 ±  0.207  ns/op
ArraysHashCode.bytes       10  avgt    5     8.483 ±  0.040  ns/op
ArraysHashCode.bytes      100  avgt    5    90.315 ±  0.655  ns/op
ArraysHashCode.bytes    10000  avgt    5  9422.094 ± 62.402  ns/op
ArraysHashCode.chars        1  avgt    5     3.040 ±  0.066  ns/op
ArraysHashCode.chars       10  avgt    5     8.497 ±  0.074  ns/op
ArraysHashCode.chars      100  avgt    5    90.074 ±  0.387  ns/op
ArraysHashCode.chars    10000  avgt    5  9420.474 ± 41.619  ns/op
ArraysHashCode.ints         1  avgt    5     2.827 ±  0.019  ns/op
ArraysHashCode.ints        10  avgt    5     7.727 ±  0.043  ns/op
ArraysHashCode.ints       100  avgt    5    89.405 ±  0.593  ns/op
ArraysHashCode.ints     10000  avgt    5  9426.539 ± 51.308  ns/op
ArraysHashCode.shorts       1  avgt    5     3.071 ±  0.062  ns/op
ArraysHashCode.shorts      10  avgt    5     8.168 ±  0.049  ns/op
ArraysHashCode.shorts     100  avgt    5    90.399 ±  0.292  ns/op
ArraysHashCode.shorts   10000  avgt    5  9420.171 ± 44.474  ns/op


As we can see the `Arrays` intrinsics are faster for small inputs, and faster 
on large inputs for `char` and `int` (the ones currently vectorized). I aim to 
fix `byte` and `short` cases before integrating, though it might be acceptable 
to hand that off as follow-up enhancements to not further delay integration of 
this enhancement.

-------------

Commit messages:
 - ws
 - Add ArraysHashCode microbenchmarks
 - Fixed vector loops for int and char arrays
 - Split up Arrays/HashCode tests
 - Fixes, optimized short inputs, temporarily disabled vector loop for 
Arrays.hashCode cases, added and improved tests
 - typo
 - Add Arrays.hashCode tests, enable intrinsic by default on x86
 - Correct start values for array hashCode methods
 - Merge branch 'master' into 8282664-polyhash
 - Fold identical ops; only add coef expansion for Arrays cases
 - ... and 28 more: https://git.openjdk.org/jdk/compare/303548ba...22fec5f0

Changes: https://git.openjdk.org/jdk/pull/10847/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8282664
  Stats: 1129 lines in 32 files changed: 1071 ins; 32 del; 26 mod
  Patch: https://git.openjdk.org/jdk/pull/10847.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847

PR: https://git.openjdk.org/jdk/pull/10847

RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops

Reply via email to