Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2]

Claes Redestad Tue, 08 Mar 2022 07:18:34 -0800

On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry <luhe...@openjdk.org> wrote:


>> Despite the hash value being cached for Strings, computing the hash still 
>> represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to 
>> the autovectorizer, the complexity of doing it by hand is trivial and the 
>> gain is sizable (2x speedup) even without the Vector API. The algorithm has 
>> been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark                                        (size)  Mode  Cnt     Score 
>>    Error  Units
>> StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 
>> ±  0.210  ns/op
>> StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 
>> ±  0.127  ns/op
>> StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 
>> ±  0.099  ns/op
>> StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 
>> ±  0.444  ns/op
>> StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 
>> ±  0.846  ns/op
>> StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 
>> ±  4.071  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 
>> ±  0.092  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 
>> ±  0.159  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 
>> ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 
>> ±  1.218  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 
>> ±  1.225  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 
>> ± 14.621  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 
>> ±  0.097  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 
>> ±  0.158  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 
>> ±  0.065  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 
>> ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 
>> ±  0.251  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 
>> ±  1.667  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 
>> ±  0.191  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 
>> ±  0.362  ns/op
>> StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 
>> ±  0.112  ns/op
>> StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 
>> ±  0.702  ns/op
>> StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 
>> ±  3.290  ns/op
>> StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 
>> ± 43.847  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 
>> ±  0.038  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 
>> ±  0.422  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 
>> ±  0.178  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 
>> ±  0.089  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 
>> ±  1.834  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 
>> ±  2.927  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 
>> ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 
>> ±  0.396  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 
>> ±  0.189  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 
>> ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 
>> ±  0.340  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 
>> ±  2.699  ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for 
>> example), and hashing String represents a large part of our CPU usage. It's 
>> very unlikely that we are the only one as String.hashCode is such a core 
>> feature of the JVM-based languages with its use in HashMap for example. 
>> Having even only a 2x speedup would allow us to save thousands of CPU cores 
>> per month and improve correspondingly the energy/carbon impact.
>> 
>> [1] 
>> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add UTF-16 benchmarks

An awkward effect of this implementation is that it perturbs results on small 
Strings a bit. Adding a branch in the trivial case, but also regressing on 
certain lengths (try size=7). The added complexity seem to be no issue for JITs 
in these microbenchmarks, but I worry that the increased code size might play 
tricks with inlining heuristics in real applications.

After chatting a bit with @richardstartin regarding the observation that 
preventing a strength reduction on the constant 31 value being part of the 
improvement I devised an experiment which simply makes the 31 non-constant as 
to disable the strength reduction:

        private static int base = 31;
        @Benchmark
        public int scalarLatin1_NoStrengthReduction() {
            int h = 0;
            int i = 0, len = latin1.length;
            for (; i < len; i++) {
                h = base * h + (latin1[i] & 0xff);
            }
            return h;
        }

Interestingly results of that get planted in the middle of the baseline on 
large inputs, while avoiding most of the irregularities on small inputs 
compared to manually unrolled versions:

Benchmark                                                  (size)  Mode  Cnt   
Score   Error  Units
StringHashCode.Algorithm.scalarLatin1                           1  avgt    3   
2.910 ? 0.068  ns/op
StringHashCode.Algorithm.scalarLatin1                           7  avgt    3   
6.530 ? 0.065  ns/op
StringHashCode.Algorithm.scalarLatin1                           8  avgt    3   
7.472 ? 0.034  ns/op
StringHashCode.Algorithm.scalarLatin1                          15  avgt    3  
12.750 ? 0.028  ns/op
StringHashCode.Algorithm.scalarLatin1                         100  avgt    3  
99.190 ? 0.618  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 1  avgt    3   
3.119 ? 0.015  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 7  avgt    3  
11.556 ? 4.690  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 8  avgt    3   
7.740 ? 0.005  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                15  avgt    3  
13.030 ? 0.124  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16               100  avgt    3  
46.470 ? 0.496  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  1  avgt    3   
3.123 ? 0.057  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  7  avgt    3  
11.380 ? 0.085  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  8  avgt    3   
5.849 ? 0.583  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                 15  avgt    3  
12.312 ? 0.025  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                100  avgt    3  
45.751 ? 0.146  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       1  avgt    3   
3.173 ? 0.015  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       7  avgt    3   
5.229 ? 0.455  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       8  avgt    3   
5.679 ? 0.012  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction      15  avgt    3   
8.731 ? 0.103  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction     100  avgt    3  
70.954 ? 3.386  ns/op

I wonder if this might be a safer play while we investigate intrinsification 
and other possible enhancements?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2]

Reply via email to