Re: RFR: 8247645: ChaCha20 intrinsics

Jamil Nimeh Sun, 06 Nov 2022 23:43:34 -0800

On Wed, 16 Mar 2022 00:48:17 GMT, Sandhya Viswanathan 
<sviswanat...@openjdk.org> wrote:


>> This PR delivers ChaCha20 intrinsics that accelerate the core block function 
>> that generates key stream from the key, counter and nonce.  Intrinsics have 
>> been written for the following platforms and instruction sets:
>> 
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>> 
>> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the 
>> pending Poly1305 intrinsics to be delivered in #10582)
>> 
>> x86_64
>> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
>> 
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt       Score      Error  
>> Units
>> ChaCha20.decrypt                  256    thrpt   40  772956.829 ± 4434.965  
>> ops/s
>> ChaCha20.decrypt                 1024    thrpt   40  230478.075 ±  660.617  
>> ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   61504.367 ±  187.485  
>> ops/s
>> ChaCha20.decrypt                16384    thrpt   40   15671.893 ±   59.860  
>> ops/s
>> ChaCha20.encrypt                  256    thrpt   40  793708.698 ± 3587.562  
>> ops/s
>> ChaCha20.encrypt                 1024    thrpt   40  232413.842 ±  808.766  
>> ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   61586.483 ±   94.821  
>> ops/s
>> ChaCha20.encrypt                16384    thrpt   40   15749.637 ±   34.497  
>> ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40  219991.514 ± 2117.364  
>> ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40  101672.568 ± 1921.214  
>> ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40   32582.073 ±  946.061  
>> ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    8485.793 ±   26.348  
>> ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40  291605.327 ± 2893.898  
>> ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40  121034.948 ± 2545.312  
>> ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40   32657.343 ±  114.322  
>> ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    8527.834 ±   33.711  
>> ops/s
>> 
>> Intrinsics enabled (-XX:UseAVX=1)
>> ---------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1293211.662 ±  9833.892 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   450135.559 ±  1614.303 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   123675.797 ±   576.160 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    31707.566 ±    93.988 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   250683.639 ±  3990.340 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   131000.144 ±  2895.410 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    45215.542 ±  1368.148 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    11879.307 ±    55.006 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   355255.774 ±  5397.267 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   156057.380 ±  4294.091 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    47016.845 ±  1618.779 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    12113.919 ±    45.792 
>>  ops/s
>> 
>> Intrinsics enabled (-XX:UseAVX=2)
>> ---------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1824729.604 ± 12130.198 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   746024.477 ±  3921.472 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   219662.823 ±  2128.901 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    57198.868 ±   221.973 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   260529.149 ±  4298.662 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   144967.984 ±  4558.697 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    50047.575 ±   171.204 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    13976.999 ±    72.299 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   378971.408 ±  9324.721 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   179361.248 ±  7968.109 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    55727.145 ±  2860.765 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    14205.830 ±    59.411 
>>  ops/s
>> 
>> Intrinsics enabled (-XX:UseAVX=3)
>> ---------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1182958.956 ±  7782.532 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40  1003530.400 ± 10315.996 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   339428.341 ±  2376.804 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    92903.498 ±  1112.425 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1266584.736 ±  5101.597 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40  1059717.173 ±  9435.649 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   350520.581 ±  2787.593 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    95181.548 ±  1638.579 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   200722.479 ±  2045.896 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   124660.386 ±  3869.517 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    44059.327 ±   143.765 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    12412.936 ±    54.845 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   274528.005 ±  2945.416 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   145146.188 ±   857.254 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    47045.637 ±   128.049 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    12643.929 ±    55.748 
>>  ops/s
>> 
>> aarch64
>> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
>>   part : 0xd0c, revision : 1
>> 
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1301037.920 ±  1734.836 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   387115.013 ±  1122.264 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   102591.108 ±   229.456 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    25878.583 ±    89.351 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1332737.880 ±  2478.508 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   390288.663 ±  2361.851 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   101882.728 ±   744.907 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    26001.888 ±    71.907 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   351189.393 ±  2209.148 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   142960.999 ±   361.619 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    42437.822 ±    85.557 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    11173.152 ±    24.969 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   444870.664 ± 12571.799 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   158481.143 ±  2149.208 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    43610.721 ±   282.795 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    11150.783 ±    27.911 
>>  ops/s
>> 
>> Intrinsics enabled
>> ------------------
>> Benchmark                  (dataSize)     Mode  Cnt        Score       Error 
>>  Units
>> ChaCha20.decrypt                  256    thrpt   40  1907215.648 ±  3163.767 
>>  ops/s
>> ChaCha20.decrypt                 1024    thrpt   40   631804.007 ±   736.430 
>>  ops/s
>> ChaCha20.decrypt                 4096    thrpt   40   172280.991 ±   362.190 
>>  ops/s
>> ChaCha20.decrypt                16384    thrpt   40    44150.254 ±    98.927 
>>  ops/s
>> ChaCha20.encrypt                  256    thrpt   40  1990050.859 ±  6380.625 
>>  ops/s
>> ChaCha20.encrypt                 1024    thrpt   40   636574.405 ±  3332.471 
>>  ops/s
>> ChaCha20.encrypt                 4096    thrpt   40   173258.615 ±   327.199 
>>  ops/s
>> ChaCha20.encrypt                16384    thrpt   40    44191.925 ±    72.996 
>>  ops/s
>> 
>> ChaCha20Poly1305.decrypt          256    thrpt   40   360555.774 ±  1988.467 
>>  ops/s
>> ChaCha20Poly1305.decrypt         1024    thrpt   40   162093.489 ±   413.684 
>>  ops/s
>> ChaCha20Poly1305.decrypt         4096    thrpt   40    50799.888 ±   110.955 
>>  ops/s
>> ChaCha20Poly1305.decrypt        16384    thrpt   40    13560.165 ±    32.208 
>>  ops/s
>> ChaCha20Poly1305.encrypt          256    thrpt   40   458079.724 ± 13746.235 
>>  ops/s
>> ChaCha20Poly1305.encrypt         1024    thrpt   40   188228.966 ±  3498.480 
>>  ops/s
>> ChaCha20Poly1305.encrypt         4096    thrpt   40    52665.733 ±   151.740 
>>  ops/s
>> ChaCha20Poly1305.encrypt        16384    thrpt   40    13606.192 ±    52.134 
>>  ops/s
>> 
>> Special thanks to the folks who have made many helpful comments while this 
>> PR was in draft form.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 5027:
> 
>> 5025:             (vector_len == AVX_512bit ? VM_Version::supports_evex() : 
>> 0)), "");
>> 5026:     NOT_LP64(assert(VM_Version::supports_sse2(), ""));
>> 5027:     InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
>> legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
> 
> legacy_mode here should be _legacy_mode_bw.

Good catch, fixed, along with all the other similar findings below.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5682:
> 
>> 5680:   /* Add mask for 4-block ChaCha20 Block calculations */
>> 5681:   address chacha20_ctradd_avx512() {
>> 5682:     __ align(CodeEntryAlignment);
> 
> This could be __ align64();

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5698:
> 
>> 5696:   /* Scatter mask for key stream output on AVX-512 */
>> 5697:   address chacha20_scmask_avx512() {
>> 5698:     __ align(CodeEntryAlignment);
> 
> This could be __ align64();

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5728:
> 
>> 5726:     const XMMRegister zmm_cVec = xmm2;
>> 5727:     const XMMRegister zmm_dVec = xmm3;
>> 5728:     const XMMRegister zmm_scratch = xmm4;
> 
> We could have 5 additional scratch registers zmm_s1 .. zmm_s5 (mapping to 
> xmm5 ... xmm9)  to keep values read from memory into registers.

For AVX-512 I was able to get it to work with 4 scratch registers fortunately.  
For AVX and AVX2 I think the same approach can work, but since there are no 
lanewise bit rotation instructions (just L/R shifts) that I can find I need a 
5th scratch register.

For the 32-bit version it is a little more complicated as there are only 8 SIMD 
registers to work with.  I think even there I could simply read the state from 
memory for one memory-to-register add instead of doing 4, and then hold the 
other 128-bit state lines on 3 scratch registers.  I'm going to experiment with 
that a bit to see how much I can limit memory fetches to get some improvements 
on both 64-bit and 32-bit.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5738:
> 
>> 5736:     __ evbroadcasti32x4(zmm_bVec, Address(state, 16), 
>> Assembler::AVX_512bit);
>> 5737:     __ evbroadcasti32x4(zmm_cVec, Address(state, 32), 
>> Assembler::AVX_512bit);
>> 5738:     __ evbroadcasti32x4(zmm_dVec, Address(state, 48), 
>> Assembler::AVX_512bit);
> 
> zmm_aVec to zmm_dVec could be copied into zmm_s1 to zmm_s4 respectively 
> thereby eliminating broadcast needed later. For example:
>  __ evmovdquq(zmm_s1, zmm_aVec, Assembler::AVX_512bit);

A good suggestion, this has been changed.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5740:
> 
>> 5738:     __ evbroadcasti32x4(zmm_dVec, Address(state, 48), 
>> Assembler::AVX_512bit);
>> 5739: 
>> 5740:     __ vpaddd(zmm_dVec, zmm_dVec, 
>> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), 
>> Assembler::AVX_512bit, rax);
> 
> The chacha20_counter_addmask_avx512() could be preloaded into zmm_s5 before 
> line 5735 as follows:
>  __ evmovdquq(zmm_s5, 
> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), 
> Assembler::AVX_512bit, rax);
> vpaddd can then use zmm_s5 also the later usage could use zmm_s5 directly.

Another good improvement, done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5827:
> 
>> 5825:     __ evbroadcasti32x4(zmm_scratch, Address(state, 48), 
>> Assembler::AVX_512bit);
>> 5826:     __ vpaddd(zmm_dVec, zmm_dVec, zmm_scratch, Assembler::AVX_512bit);
>> 5827:     __ vpaddd(zmm_dVec, zmm_dVec, 
>> ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), 
>> Assembler::AVX_512bit, rax);
> 
> These could directly use the values in zmm_s1 to zmm_s5 registers  :
>     __ vpaddd(zmm_aVec, zmm_aVec, zmm_s1, Assembler::AVX_512bit);
>     ...
>     __ vpaddd(zmm_dVec, zmm_dVec, zmm_s5, Assembler::AVX_512bit);

Keeping the original broadcasted state data on registers was a good idea, as it 
saved me the extra reach out to memory at the end of the loop.  Fixed as 
recommended.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5842:
> 
>> 5840:     __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 
>> 32), writeMask, zmm_cVec, Assembler::AVX_512bit);
>> 5841:     __ knotwl(writeMask, writeMask);
>> 5842:     __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 
>> 48), writeMask, zmm_dVec, Assembler::AVX_512bit);
> 
> Using the vextracti32x4 instead of evpscatterdd would give better performance:
>     __ vextracti32x4(Address(result, 0), zmm_aVec, 0);
>     __ vextracti32x4(Address(result, 64), zmm_aVec, 1);
>     __ vextracti32x4(Address(result, 128), zmm_aVec, 2);
>     __ vextracti32x4(Address(result, 192), zmm_aVec, 3);
>     __ vextracti32x4(Address(result, 16), zmm_bVec, 0);
>     __ vextracti32x4(Address(result, 80), zmm_bVec, 1);
>     __ vextracti32x4(Address(result, 144), zmm_bVec, 2);
>     __ vextracti32x4(Address(result, 208), zmm_bVec, 3);
>     __ vextracti32x4(Address(result, 32), zmm_cVec, 0);
>     __ vextracti32x4(Address(result, 96), zmm_cVec, 1);
>     __ vextracti32x4(Address(result, 160), zmm_cVec, 2);
>     __ vextracti32x4(Address(result, 224), zmm_cVec, 3);
>     __ vextracti32x4(Address(result, 48), zmm_dVec, 0);
>     __ vextracti32x4(Address(result, 112), zmm_dVec, 1);
>     __ vextracti32x4(Address(result, 176), zmm_dVec, 2);
>     __ vextracti32x4(Address(result, 240), zmm_dVec, 3);

I have been wondering about this approach for a while now, since I did 
something similar for the AVX2 version.  I had assumed that using evpscatterdd 
used less instructions and therefore would be more efficient, but I'm more than 
happy to move to the vextracti32x4 approach.  I'll be eager to see how it 
impacts performance along with the increased storage of intermediate data on 
additional XMMRegister objects.

-------------

PR: https://git.openjdk.org/jdk/pull/7702

Re: RFR: 8247645: ChaCha20 intrinsics

Reply via email to