On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry <luhe...@openjdk.org> wrote:
>> Despite the hash value being cached for Strings, computing the hash still >> represents a significant CPU usage for applications handling lots of text. >> >> Even though it would be generally better to do it through an enhancement to >> the autovectorizer, the complexity of doing it by hand is trivial and the >> gain is sizable (2x speedup) even without the Vector API. The algorithm has >> been proposed by Richard Startin and Paul Sandoz [1]. >> >> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz` >> >> >> Benchmark (size) Mode Cnt Score >> Error Units >> StringHashCode.Algorithm.scalarLatin1 0 avgt 25 2.111 >> ± 0.210 ns/op >> StringHashCode.Algorithm.scalarLatin1 1 avgt 25 3.500 >> ± 0.127 ns/op >> StringHashCode.Algorithm.scalarLatin1 10 avgt 25 7.001 >> ± 0.099 ns/op >> StringHashCode.Algorithm.scalarLatin1 100 avgt 25 61.285 >> ± 0.444 ns/op >> StringHashCode.Algorithm.scalarLatin1 1000 avgt 25 628.995 >> ± 0.846 ns/op >> StringHashCode.Algorithm.scalarLatin1 10000 avgt 25 6307.990 >> ± 4.071 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 0 avgt 25 2.358 >> ± 0.092 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 1 avgt 25 3.631 >> ± 0.159 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 10 avgt 25 7.049 >> ± 0.019 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 100 avgt 25 33.626 >> ± 1.218 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 1000 avgt 25 317.811 >> ± 1.225 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 10000 avgt 25 3212.333 >> ± 14.621 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 0 avgt 25 2.356 >> ± 0.097 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 1 avgt 25 3.630 >> ± 0.158 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 10 avgt 25 8.724 >> ± 0.065 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 100 avgt 25 32.402 >> ± 0.019 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 1000 avgt 25 321.949 >> ± 0.251 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 10000 avgt 25 3202.083 >> ± 1.667 ns/op >> StringHashCode.Algorithm.scalarUTF16 0 avgt 25 2.135 >> ± 0.191 ns/op >> StringHashCode.Algorithm.scalarUTF16 1 avgt 25 5.202 >> ± 0.362 ns/op >> StringHashCode.Algorithm.scalarUTF16 10 avgt 25 11.105 >> ± 0.112 ns/op >> StringHashCode.Algorithm.scalarUTF16 100 avgt 25 75.974 >> ± 0.702 ns/op >> StringHashCode.Algorithm.scalarUTF16 1000 avgt 25 716.429 >> ± 3.290 ns/op >> StringHashCode.Algorithm.scalarUTF16 10000 avgt 25 7095.459 >> ± 43.847 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 0 avgt 25 2.381 >> ± 0.038 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 1 avgt 25 5.268 >> ± 0.422 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 10 avgt 25 11.248 >> ± 0.178 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 100 avgt 25 52.966 >> ± 0.089 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 1000 avgt 25 450.912 >> ± 1.834 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 10000 avgt 25 4403.988 >> ± 2.927 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 0 avgt 25 2.401 >> ± 0.032 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 1 avgt 25 5.091 >> ± 0.396 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 10 avgt 25 12.801 >> ± 0.189 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 100 avgt 25 52.068 >> ± 0.032 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 1000 avgt 25 453.270 >> ± 0.340 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 10000 avgt 25 4433.112 >> ± 2.699 ns/op >> >> >> At Datadog, we handle a great amount of text (through logs management for >> example), and hashing String represents a large part of our CPU usage. It's >> very unlikely that we are the only one as String.hashCode is such a core >> feature of the JVM-based languages with its use in HashMap for example. >> Having even only a 2x speedup would allow us to save thousands of CPU cores >> per month and improve correspondingly the energy/carbon impact. >> >> [1] >> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf > > Ludovic Henry has updated the pull request incrementally with one additional > commit since the last revision: > > Add UTF-16 benchmarks An awkward effect of this implementation is that it perturbs results on small Strings a bit. Adding a branch in the trivial case, but also regressing on certain lengths (try size=7). The added complexity seem to be no issue for JITs in these microbenchmarks, but I worry that the increased code size might play tricks with inlining heuristics in real applications. After chatting a bit with @richardstartin regarding the observation that preventing a strength reduction on the constant 31 value being part of the improvement I devised an experiment which simply makes the 31 non-constant as to disable the strength reduction: private static int base = 31; @Benchmark public int scalarLatin1_NoStrengthReduction() { int h = 0; int i = 0, len = latin1.length; for (; i < len; i++) { h = base * h + (latin1[i] & 0xff); } return h; } Interestingly results of that get planted in the middle of the baseline on large inputs, while avoiding most of the irregularities on small inputs compared to manually unrolled versions: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.scalarLatin1 1 avgt 3 2.910 ? 0.068 ns/op StringHashCode.Algorithm.scalarLatin1 7 avgt 3 6.530 ? 0.065 ns/op StringHashCode.Algorithm.scalarLatin1 8 avgt 3 7.472 ? 0.034 ns/op StringHashCode.Algorithm.scalarLatin1 15 avgt 3 12.750 ? 0.028 ns/op StringHashCode.Algorithm.scalarLatin1 100 avgt 3 99.190 ? 0.618 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled16 1 avgt 3 3.119 ? 0.015 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled16 7 avgt 3 11.556 ? 4.690 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled16 8 avgt 3 7.740 ? 0.005 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled16 15 avgt 3 13.030 ? 0.124 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled16 100 avgt 3 46.470 ? 0.496 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled8 1 avgt 3 3.123 ? 0.057 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled8 7 avgt 3 11.380 ? 0.085 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled8 8 avgt 3 5.849 ? 0.583 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled8 15 avgt 3 12.312 ? 0.025 ns/op StringHashCode.Algorithm.scalarLatin1Unrolled8 100 avgt 3 45.751 ? 0.146 ns/op StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 1 avgt 3 3.173 ? 0.015 ns/op StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 7 avgt 3 5.229 ? 0.455 ns/op StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 8 avgt 3 5.679 ? 0.012 ns/op StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 15 avgt 3 8.731 ? 0.103 ns/op StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 100 avgt 3 70.954 ? 3.386 ns/op I wonder if this might be a safer play while we investigate intrinsification and other possible enhancements? ------------- PR: https://git.openjdk.java.net/jdk/pull/7700