Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9]

2022-04-13 Thread Claes Redestad
On Thu, 7 Apr 2022 07:01:23 GMT, Ludovic Henry  wrote:

>> Despite the hash value being cached for Strings, computing the hash still 
>> represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to 
>> the autovectorizer, the complexity of doing it by hand is trivial and the 
>> gain is sizable (2x speedup) even without the Vector API. The algorithm has 
>> been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark(size)  Mode  Cnt Score 
>>Error  Units
>> StringHashCode.Algorithm.scalarLatin1 0  avgt   25 2.111 
>> ±  0.210  ns/op
>> StringHashCode.Algorithm.scalarLatin1 1  avgt   25 3.500 
>> ±  0.127  ns/op
>> StringHashCode.Algorithm.scalarLatin110  avgt   25 7.001 
>> ±  0.099  ns/op
>> StringHashCode.Algorithm.scalarLatin1   100  avgt   2561.285 
>> ±  0.444  ns/op
>> StringHashCode.Algorithm.scalarLatin1  1000  avgt   25   628.995 
>> ±  0.846  ns/op
>> StringHashCode.Algorithm.scalarLatin1 1  avgt   25  6307.990 
>> ±  4.071  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   0  avgt   25 2.358 
>> ±  0.092  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   1  avgt   25 3.631 
>> ±  0.159  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16  10  avgt   25 7.049 
>> ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 100  avgt   2533.626 
>> ±  1.218  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled161000  avgt   25   317.811 
>> ±  1.225  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   1  avgt   25  3212.333 
>> ± 14.621  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled80  avgt   25 2.356 
>> ±  0.097  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled81  avgt   25 3.630 
>> ±  0.158  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8   10  avgt   25 8.724 
>> ±  0.065  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8  100  avgt   2532.402 
>> ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 1000  avgt   25   321.949 
>> ±  0.251  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled81  avgt   25  3202.083 
>> ±  1.667  ns/op
>> StringHashCode.Algorithm.scalarUTF16  0  avgt   25 2.135 
>> ±  0.191  ns/op
>> StringHashCode.Algorithm.scalarUTF16  1  avgt   25 5.202 
>> ±  0.362  ns/op
>> StringHashCode.Algorithm.scalarUTF16 10  avgt   2511.105 
>> ±  0.112  ns/op
>> StringHashCode.Algorithm.scalarUTF16100  avgt   2575.974 
>> ±  0.702  ns/op
>> StringHashCode.Algorithm.scalarUTF16   1000  avgt   25   716.429 
>> ±  3.290  ns/op
>> StringHashCode.Algorithm.scalarUTF16  1  avgt   25  7095.459 
>> ± 43.847  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled160  avgt   25 2.381 
>> ±  0.038  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled161  avgt   25 5.268 
>> ±  0.422  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16   10  avgt   2511.248 
>> ±  0.178  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16  100  avgt   2552.966 
>> ±  0.089  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 1000  avgt   25   450.912 
>> ±  1.834  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled161  avgt   25  4403.988 
>> ±  2.927  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 0  avgt   25 2.401 
>> ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 1  avgt   25 5.091 
>> ±  0.396  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled810  avgt   2512.801 
>> ±  0.189  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8   100  avgt   2552.068 
>> ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8  1000  avgt   25   453.270 
>> ±  0.340  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 1  avgt   25  4433.112 
>> ±  2.699  ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for 
>> example), and hashing String represents a large part of our CPU usage. It's 
>> very unlikely that we are the only one as String.hashCode is such a core 
>> feature of the JVM-based languages with its use in HashMap for example. 
>> Having even only a 2x speedup would allow us to save thousands of CPU cores 
>> per month and improve correspondingly the energy/carbon impact.
>> 
>> [1] 
>> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request incrementally with one additional 
> commit since 

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9]

2022-04-07 Thread Ludovic Henry
> Despite the hash value being cached for Strings, computing the hash still 
> represents a significant CPU usage for applications handling lots of text.
> 
> Even though it would be generally better to do it through an enhancement to 
> the autovectorizer, the complexity of doing it by hand is trivial and the 
> gain is sizable (2x speedup) even without the Vector API. The algorithm has 
> been proposed by Richard Startin and Paul Sandoz [1].
> 
> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
> 
> 
> Benchmark(size)  Mode  Cnt Score  
>   Error  Units
> StringHashCode.Algorithm.scalarLatin1 0  avgt   25 2.111 
> ±  0.210  ns/op
> StringHashCode.Algorithm.scalarLatin1 1  avgt   25 3.500 
> ±  0.127  ns/op
> StringHashCode.Algorithm.scalarLatin110  avgt   25 7.001 
> ±  0.099  ns/op
> StringHashCode.Algorithm.scalarLatin1   100  avgt   2561.285 
> ±  0.444  ns/op
> StringHashCode.Algorithm.scalarLatin1  1000  avgt   25   628.995 
> ±  0.846  ns/op
> StringHashCode.Algorithm.scalarLatin1 1  avgt   25  6307.990 
> ±  4.071  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16   0  avgt   25 2.358 
> ±  0.092  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16   1  avgt   25 3.631 
> ±  0.159  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16  10  avgt   25 7.049 
> ±  0.019  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16 100  avgt   2533.626 
> ±  1.218  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled161000  avgt   25   317.811 
> ±  1.225  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16   1  avgt   25  3212.333 
> ± 14.621  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled80  avgt   25 2.356 
> ±  0.097  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled81  avgt   25 3.630 
> ±  0.158  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8   10  avgt   25 8.724 
> ±  0.065  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8  100  avgt   2532.402 
> ±  0.019  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8 1000  avgt   25   321.949 
> ±  0.251  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled81  avgt   25  3202.083 
> ±  1.667  ns/op
> StringHashCode.Algorithm.scalarUTF16  0  avgt   25 2.135 
> ±  0.191  ns/op
> StringHashCode.Algorithm.scalarUTF16  1  avgt   25 5.202 
> ±  0.362  ns/op
> StringHashCode.Algorithm.scalarUTF16 10  avgt   2511.105 
> ±  0.112  ns/op
> StringHashCode.Algorithm.scalarUTF16100  avgt   2575.974 
> ±  0.702  ns/op
> StringHashCode.Algorithm.scalarUTF16   1000  avgt   25   716.429 
> ±  3.290  ns/op
> StringHashCode.Algorithm.scalarUTF16  1  avgt   25  7095.459 
> ± 43.847  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled160  avgt   25 2.381 
> ±  0.038  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled161  avgt   25 5.268 
> ±  0.422  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16   10  avgt   2511.248 
> ±  0.178  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16  100  avgt   2552.966 
> ±  0.089  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16 1000  avgt   25   450.912 
> ±  1.834  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled161  avgt   25  4403.988 
> ±  2.927  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8 0  avgt   25 2.401 
> ±  0.032  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8 1  avgt   25 5.091 
> ±  0.396  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled810  avgt   2512.801 
> ±  0.189  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8   100  avgt   2552.068 
> ±  0.032  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8  1000  avgt   25   453.270 
> ±  0.340  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8 1  avgt   25  4433.112 
> ±  2.699  ns/op
> 
> 
> At Datadog, we handle a great amount of text (through logs management for 
> example), and hashing String represents a large part of our CPU usage. It's 
> very unlikely that we are the only one as String.hashCode is such a core 
> feature of the JVM-based languages with its use in HashMap for example. 
> Having even only a 2x speedup would allow us to save thousands of CPU cores 
> per month and improve correspondingly the energy/carbon impact.
> 
> [1] 
> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf

Ludovic Henry has updated the pull request incrementally with one additional 
commit since the last revision:

  Fix some merge conflicts

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/7700/files
  - new: