On Wed, 16 Mar 2022 00:48:17 GMT, Sandhya Viswanathan <sviswanat...@openjdk.org> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function >> that generates key stream from the key, counter and nonce. Intrinsics have >> been written for the following platforms and instruction sets: >> >> - x86_64: AVX, AVX2 and AVX512 >> - aarch64: platforms that support the advanced SIMD instructions >> >> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the >> pending Poly1305 intrinsics to be delivered in #10582) >> >> x86_64 >> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz >> >> Java only (-XX:-UseChaCha20Intrinsics) >> -------------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=1) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=2) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=3) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748 >> ops/s >> >> aarch64 >> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3, >> part : 0xd0c, revision : 1 >> >> Java only (-XX:-UseChaCha20Intrinsics) >> -------------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911 >> ops/s >> >> Intrinsics enabled >> ------------------ >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134 >> ops/s >> >> Special thanks to the folks who have made many helpful comments while this >> PR was in draft form. > > src/hotspot/cpu/x86/assembler_x86.cpp line 5027: > >> 5025: (vector_len == AVX_512bit ? VM_Version::supports_evex() : >> 0)), ""); >> 5026: NOT_LP64(assert(VM_Version::supports_sse2(), "")); >> 5027: InstructionAttr attributes(vector_len, /* rex_w */ false, /* >> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true); > > legacy_mode here should be _legacy_mode_bw. Good catch, fixed, along with all the other similar findings below. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5682: > >> 5680: /* Add mask for 4-block ChaCha20 Block calculations */ >> 5681: address chacha20_ctradd_avx512() { >> 5682: __ align(CodeEntryAlignment); > > This could be __ align64(); Done > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5698: > >> 5696: /* Scatter mask for key stream output on AVX-512 */ >> 5697: address chacha20_scmask_avx512() { >> 5698: __ align(CodeEntryAlignment); > > This could be __ align64(); Done > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5728: > >> 5726: const XMMRegister zmm_cVec = xmm2; >> 5727: const XMMRegister zmm_dVec = xmm3; >> 5728: const XMMRegister zmm_scratch = xmm4; > > We could have 5 additional scratch registers zmm_s1 .. zmm_s5 (mapping to > xmm5 ... xmm9) to keep values read from memory into registers. For AVX-512 I was able to get it to work with 4 scratch registers fortunately. For AVX and AVX2 I think the same approach can work, but since there are no lanewise bit rotation instructions (just L/R shifts) that I can find I need a 5th scratch register. For the 32-bit version it is a little more complicated as there are only 8 SIMD registers to work with. I think even there I could simply read the state from memory for one memory-to-register add instead of doing 4, and then hold the other 128-bit state lines on 3 scratch registers. I'm going to experiment with that a bit to see how much I can limit memory fetches to get some improvements on both 64-bit and 32-bit. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5738: > >> 5736: __ evbroadcasti32x4(zmm_bVec, Address(state, 16), >> Assembler::AVX_512bit); >> 5737: __ evbroadcasti32x4(zmm_cVec, Address(state, 32), >> Assembler::AVX_512bit); >> 5738: __ evbroadcasti32x4(zmm_dVec, Address(state, 48), >> Assembler::AVX_512bit); > > zmm_aVec to zmm_dVec could be copied into zmm_s1 to zmm_s4 respectively > thereby eliminating broadcast needed later. For example: > __ evmovdquq(zmm_s1, zmm_aVec, Assembler::AVX_512bit); A good suggestion, this has been changed. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5740: > >> 5738: __ evbroadcasti32x4(zmm_dVec, Address(state, 48), >> Assembler::AVX_512bit); >> 5739: >> 5740: __ vpaddd(zmm_dVec, zmm_dVec, >> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), >> Assembler::AVX_512bit, rax); > > The chacha20_counter_addmask_avx512() could be preloaded into zmm_s5 before > line 5735 as follows: > __ evmovdquq(zmm_s5, > ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), > Assembler::AVX_512bit, rax); > vpaddd can then use zmm_s5 also the later usage could use zmm_s5 directly. Another good improvement, done. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5827: > >> 5825: __ evbroadcasti32x4(zmm_scratch, Address(state, 48), >> Assembler::AVX_512bit); >> 5826: __ vpaddd(zmm_dVec, zmm_dVec, zmm_scratch, Assembler::AVX_512bit); >> 5827: __ vpaddd(zmm_dVec, zmm_dVec, >> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), >> Assembler::AVX_512bit, rax); > > These could directly use the values in zmm_s1 to zmm_s5 registers : > __ vpaddd(zmm_aVec, zmm_aVec, zmm_s1, Assembler::AVX_512bit); > ... > __ vpaddd(zmm_dVec, zmm_dVec, zmm_s5, Assembler::AVX_512bit); Keeping the original broadcasted state data on registers was a good idea, as it saved me the extra reach out to memory at the end of the loop. Fixed as recommended. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5842: > >> 5840: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, >> 32), writeMask, zmm_cVec, Assembler::AVX_512bit); >> 5841: __ knotwl(writeMask, writeMask); >> 5842: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, >> 48), writeMask, zmm_dVec, Assembler::AVX_512bit); > > Using the vextracti32x4 instead of evpscatterdd would give better performance: > __ vextracti32x4(Address(result, 0), zmm_aVec, 0); > __ vextracti32x4(Address(result, 64), zmm_aVec, 1); > __ vextracti32x4(Address(result, 128), zmm_aVec, 2); > __ vextracti32x4(Address(result, 192), zmm_aVec, 3); > __ vextracti32x4(Address(result, 16), zmm_bVec, 0); > __ vextracti32x4(Address(result, 80), zmm_bVec, 1); > __ vextracti32x4(Address(result, 144), zmm_bVec, 2); > __ vextracti32x4(Address(result, 208), zmm_bVec, 3); > __ vextracti32x4(Address(result, 32), zmm_cVec, 0); > __ vextracti32x4(Address(result, 96), zmm_cVec, 1); > __ vextracti32x4(Address(result, 160), zmm_cVec, 2); > __ vextracti32x4(Address(result, 224), zmm_cVec, 3); > __ vextracti32x4(Address(result, 48), zmm_dVec, 0); > __ vextracti32x4(Address(result, 112), zmm_dVec, 1); > __ vextracti32x4(Address(result, 176), zmm_dVec, 2); > __ vextracti32x4(Address(result, 240), zmm_dVec, 3); I have been wondering about this approach for a while now, since I did something similar for the AVX2 version. I had assumed that using evpscatterdd used less instructions and therefore would be more efficient, but I'm more than happy to move to the vextracti32x4 approach. I'll be eager to see how it impacts performance along with the increased storage of intermediate data on additional XMMRegister objects. ------------- PR: https://git.openjdk.org/jdk/pull/7702