On Fri, 2 Sep 2022 09:32:56 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> This PR delivers ChaCha20 intrinsics that accelerate the core block function 
>> that generates key stream from the key, counter and nonce.  Intrinsics have 
>> been written for the following platforms and instruction sets:
>> 
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>> 
>> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the 
>> pending Poly1305 intrinsics to be delivered in #10582)
>> 
>> x86_64
>> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
>> 
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt       Score      Error  
>> Units
>> ChaCha20.decrypt                  256    thrpt   40  772956.829 ± 4434.965  
>> ops/s
>> ChaCha20.decrypt                 1024    thrpt   40  230478.075 ±  660.617  
>> ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   61504.367 ±  187.485  
>> ops/s
>> ChaCha20.decrypt                16384    thrpt   40   15671.893 ±   59.860  
>> ops/s
>> ChaCha20.encrypt                  256    thrpt   40  793708.698 ± 3587.562  
>> ops/s
>> ChaCha20.encrypt                 1024    thrpt   40  232413.842 ±  808.766  
>> ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   61586.483 ±   94.821  
>> ops/s
>> ChaCha20.encrypt                16384    thrpt   40   15749.637 ±   34.497  
>> ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40  219991.514 ± 2117.364  
>> ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40  101672.568 ± 1921.214  
>> ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40   32582.073 ±  946.061  
>> ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    8485.793 ±   26.348  
>> ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40  291605.327 ± 2893.898  
>> ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40  121034.948 ± 2545.312  
>> ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40   32657.343 ±  114.322  
>> ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    8527.834 ±   33.711  
>> ops/s
>> 
>> Intrinsics enabled (-XX:UseAVX=1)
>> ---------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1293211.662 ±  9833.892 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   450135.559 ±  1614.303 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   123675.797 ±   576.160 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    31707.566 ±    93.988 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   250683.639 ±  3990.340 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   131000.144 ±  2895.410 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    45215.542 ±  1368.148 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    11879.307 ±    55.006 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   355255.774 ±  5397.267 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   156057.380 ±  4294.091 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    47016.845 ±  1618.779 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    12113.919 ±    45.792 
>>  ops/s
>> 
>> Intrinsics enabled (-XX:UseAVX=2)
>> ---------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1824729.604 ± 12130.198 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   746024.477 ±  3921.472 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   219662.823 ±  2128.901 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    57198.868 ±   221.973 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   260529.149 ±  4298.662 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   144967.984 ±  4558.697 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    50047.575 ±   171.204 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    13976.999 ±    72.299 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   378971.408 ±  9324.721 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   179361.248 ±  7968.109 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    55727.145 ±  2860.765 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    14205.830 ±    59.411 
>>  ops/s
>> 
>> Intrinsics enabled (-XX:UseAVX=3)
>> ---------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1182958.956 ±  7782.532 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40  1003530.400 ± 10315.996 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   339428.341 ±  2376.804 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    92903.498 ±  1112.425 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1266584.736 ±  5101.597 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40  1059717.173 ±  9435.649 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   350520.581 ±  2787.593 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    95181.548 ±  1638.579 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   200722.479 ±  2045.896 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   124660.386 ±  3869.517 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    44059.327 ±   143.765 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    12412.936 ±    54.845 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   274528.005 ±  2945.416 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   145146.188 ±   857.254 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    47045.637 ±   128.049 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    12643.929 ±    55.748 
>>  ops/s
>> 
>> aarch64
>> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
>>   part : 0xd0c, revision : 1
>> 
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1301037.920 ±  1734.836 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   387115.013 ±  1122.264 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   102591.108 ±   229.456 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    25878.583 ±    89.351 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1332737.880 ±  2478.508 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   390288.663 ±  2361.851 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   101882.728 ±   744.907 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    26001.888 ±    71.907 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   351189.393 ±  2209.148 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   142960.999 ±   361.619 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    42437.822 ±    85.557 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    11173.152 ±    24.969 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   444870.664 ± 12571.799 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   158481.143 ±  2149.208 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    43610.721 ±   282.795 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    11150.783 ±    27.911 
>>  ops/s
>> 
>> Intrinsics enabled
>> ------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1907215.648 ±  3163.767 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   631804.007 ±   736.430 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   172280.991 ±   362.190 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    44150.254 ±    98.927 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1990050.859 ±  6380.625 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   636574.405 ±  3332.471 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   173258.615 ±   327.199 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    44191.925 ±    72.996 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   360555.774 ±  1988.467 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   162093.489 ±   413.684 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    50799.888 ±   110.955 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    13560.165 ±    32.208 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   458079.724 ± 13746.235 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   188228.966 ±  3498.480 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    52665.733 ±   151.740 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    13606.192 ±    52.134 
>>  ops/s
>> 
>> Special thanks to the folks who have made many helpful comments while this 
>> PR was in draft form.
>
> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2521:
> 
>> 2519: #undef INSN3
>> 2520: #undef INSN4
>> 2521: 
> 
> This code to handle the AdvSIMD load/store single structure and AdvSIMD 
> load/store single structure (post-indexed) is excessive.
> 
> Every one of these instructions has the the format, 
> 
> `0|Q|0011010|L|R|00000|opcode|S|size|Rn|Rt`
> 
> or
> 
> `0|Q|0011011|L|R|   Rm|opcode|S|size|Rn|Rt`
> 
> Perhaps consider using a `RegSet regs` for the registers. Then the 
> instruction encoding to use (1,2,3,or 4 consecutive registers) can be picked 
> up from `regs.size()`. There only needs to be a single routine for all of the 
> `ld_st` variants.

Thanks for the suggestion.  I will look into this.  I can see how `regs.size()` 
could simplify these macros.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4068:
> 
>> 4066:         __ ext(c, __ T16B, c, c, cCnt);             \
>> 4067:         __ ext(d, __ T16B, d, d, dCnt);             \
>> 4068: 
> 
> There's a fairly extensive use of macros here for the rounds, but I don't 
> think there's any need for them to be macros. `SHIFT_LANES` and all the other 
> macros here should be functions. This would reduce the size of the libjvm.so 
> binary.

Thanks for the feedback.  I've been wondering if I might need something like a 
macroAssembler_<arch>_chapoly.cpp file to handle these kinds of things and 
future functions for Poly1305 when I start in on that.  I wasn't aware of the 
impact on libjvm.so going the macro approach versus functions.  I'll pull these 
out to functions.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4141:
> 
>> 4139:     // rotation tbl instruction.
>> 4140:     __ lea(tmpAddr, ExternalAddress(
>> 4141:                 StubRoutines::aarch64::chacha20_constdata()));
> 
> Better to move `cc20_gen_constdata()` to the start of `cc20_gen_constdata()`, 
> mark it with a `Label`, and use `adr(tmpAddr, LABEL);` .

I think I see what you're saying from looking at `generate_sha1_implCompress()` 
and how it uses adr.  I also see what looks like a similar approach in some 
functions in the same file where it defines the constant value via a `static 
const uint64_t[] foo = { ... };` and then loads that address via `lea(reg, 
ExternalAddress((address) foo)` and proceeds from there (see 
`generate_sha3_implCompress()`).  To my eye that looks a bit more 
straightforward and the approach seems to be used more often than the adr 
approach in the file for defining constants.  What I don't know is if one 
approach is better than the other for other reasons like performance or memory 
consumption.  Do you have any feelings one way or the other?

-------------

PR: https://git.openjdk.org/jdk/pull/7702

Reply via email to