On Mon, 6 Jun 2022 20:57:49 GMT, Vladimir Ivanov <vliva...@openjdk.org> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function >> that generates key stream from the key, counter and nonce. Intrinsics have >> been written for the following platforms and instruction sets: >> >> - x86_64: AVX, AVX2 and AVX512 >> - aarch64: platforms that support the advanced SIMD instructions >> >> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the >> pending Poly1305 intrinsics to be delivered in #10582) >> >> x86_64 >> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz >> >> Java only (-XX:-UseChaCha20Intrinsics) >> -------------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=1) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=2) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411 >> ops/s >> >> Intrinsics enabled (-XX:UseAVX=3) >> --------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748 >> ops/s >> >> aarch64 >> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3, >> part : 0xd0c, revision : 1 >> >> Java only (-XX:-UseChaCha20Intrinsics) >> -------------------------------------- >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911 >> ops/s >> >> Intrinsics enabled >> ------------------ >> Benchmark (dataSize) Mode Cnt Score Error >> Units >> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767 >> ops/s >> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430 >> ops/s >> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190 >> ops/s >> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927 >> ops/s >> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625 >> ops/s >> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471 >> ops/s >> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199 >> ops/s >> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996 >> ops/s >> >> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467 >> ops/s >> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684 >> ops/s >> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955 >> ops/s >> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208 >> ops/s >> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235 >> ops/s >> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480 >> ops/s >> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740 >> ops/s >> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134 >> ops/s >> >> Special thanks to the folks who have made many helpful comments while this >> PR was in draft form. > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5364: > >> 5362: >> 5363: /* The 2-block AVX/AVX2-enabled ChaCha20 block function >> implementation */ >> 5364: address generate_chacha20Block_avx() { > > Considering you already introduce a dedicated CPP file, it makes sense to > move the guts of this function into `macroAssembler_x86_chapoly.cpp`. I've updated the code to follow your example with AES and moved the intrinsics into their own stubGenerator_x86_64_chapoly.cpp. I hope that will compartmentalize things. I'm not sure if I should combine the macroAssembler_x86_64_chapoly.cpp with the new stubGenerator file. Certainly willing to do that, but I wanted to get the dedicated stubGenerator file working first since the last merge was ugly. Going forward merges should be much easier now that my code is compartmentalized. ------------- PR: https://git.openjdk.org/jdk/pull/7702