On Fri, 28 Nov 2025 09:42:22 GMT, Aleksey Shipilev <[email protected]> wrote:

> Oh man, the `pos`/`len` modifications in current code are confusing. I 
> scratched my head for quite a while trying to comprehend why does `__ 
> bind(MESG_BELOW_32_BLKS)` split the `pos += 16` and `len -= 16`? On a 
> surface, that just looks like a bug.

The combination of handling for the fall through from `ENCRYPT_16_BLKS` and 
conditional entry to `MESG_BELOW_32_BLKS` cases are subtle.

I had also missed the fall through case in my initial proposed fix (with 
comp/jcc) until @sviswa7 pointed it out and suggested the current fix. The fix 
for `StubGenerator::aesgcm_avx512` with moving `__ addl(pos, 16 * 16)` to be 
before `__ bind(MESG_BELOW_32_BLKS)` works correctly for both the fall through 
and conditional jump cases now.

> 
> But looks that way because we do `initial_blocks_16_avx512` twice, do `pos += 
> 16` twice, but only do the `len += 32` after the second call. Which does not 
> help if we shortcut after the first call. In fact, I am not at all sure that 
> checking `len < 32` _without_ modifying `len` beforehand does not break the 
> assumptions downstream:
> 
> ```
>   initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx,  ADDBE_4x4, ADDBE_1234, ADD_1234, 
> SHUF_MASK, stack_offset);
>   __ addl(pos, 16 * 16);
>   __ cmpl(len, 32 * 16);
>   __ jcc(Assembler::below, MESG_BELOW_32_BLKS);
> ```
> 
> Really, in these kind of intrinsics, _you want_ to make sure `pos` and `len` 
> updates are tightly bound together, otherwise these kinds of mistakes would 
> keep happening. You will lose on code density a bit, but would have more 
> readable and robust code.
> 
> Shouldn't it be like this?
> 
> ```
> diff --git a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp 
> b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
> index 1e728ffa279..a16e25b075d 100644
> --- a/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
> +++ b/src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp
> @@ -3475,12 +3475,14 @@ void StubGenerator::aesgcm_avx512(Register in, 
> Register len, Register ct, Regist
>  
>    initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx,  ADDBE_4x4, ADDBE_1234, ADD_1234, 
> SHUF_MASK, stack_offset);
>    __ addl(pos, 16 * 16);
> +  __ subl(len, 16 * 16);
> +
>    __ cmpl(len, 32 * 16);
>    __ jcc(Assembler::below, MESG_BELOW_32_BLKS);
>  
>    initial_blocks_16_avx512(in, out, ct, pos, key, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, ADD_1234, 
> SHUF_MASK, stack_offset + 16);
>    __ addl(pos, 16 * 16);
> -  __ subl(len, 32 * 16);
> +  __ subl(len, 16 * 16);
>  
>    __ cmpl(len, 32 * 16);
>    __ jcc(Assembler::below, NO_BIG_BLKS);
> @@ -3491,24 +3493,27 @@ void StubGenerator::aesgcm_avx512(Register in, 
> Register len, Register ct, Regist
>    ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, 
> ADD_1234, SHUF_MASK,
>                                      true, true, false, false, false, 
> ghashin_offset, aesout_offset, HashKey_32);
>    __ addl(pos, 16 * 16);
> +  __ subl(len, 16 * 16);
>  
>    ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, 
> ADD_1234, SHUF_MASK,
>                                      true, false, true, false, true, 
> ghashin_offset + 16, aesout_offset + 16, HashKey_16);
>    __ evmovdquq(AAD_HASHx, ZTMP4, Assembler::AVX_512bit);
>    __ addl(pos, 16 * 16);
> -  __ subl(len, 32 * 16);
> +  __ subl(len, 16 * 16);
>    __ jmp(ENCRYPT_BIG_BLKS_NO_HXOR);
>  
>    __ bind(ENCRYPT_BIG_NBLKS);
>    ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, 
> ADD_1234, SHUF_MASK,
>                                      false, true, false, false, false, 
> ghashin_offset, aesout_offset, HashKey_32);
>    __ addl(pos, 16 * 16);
> +  __ subl(len, 16 * 16);
> +
>    ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, 
> CTR_CHECK, rounds, key, CTR_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, 
> ADD_1234, SHUF_MASK,
>                                      false, false, true, true, true, 
> ghashin_offset + 16, aesout_offset + 16, HashKey_16);
>  
>    __ movdqu(AAD_HASHx, ZTMP4);
>    __ addl(pos, 16 * 16);
> -  __ subl(len, 32 * 16);
> +  __ subl(len, 16 * 16);
>  
>    __ bind(NO_BIG_BLKS);
>    __ cmpl(len, 16 * 16);
> @@ -3525,9 +3530,9 @@ void StubGenerator::aesgcm_avx512(Register in, Register 
> len, Register ct, Regist
>  
>    ghash16_avx512(false, true, false, false, true, in, pos, 
> avx512_subkeyHtbl, AAD_HASHx, SHUF_MASK, stack_offset, 16 * 16, 0, 
> HashKey_16);
>    __ addl(pos, 16 * 16);
> +  __ subl(len, 16 * 16);
>  
>    __ bind(MESG_BELOW_32_BLKS);
> -  __ subl(len, 16 * 16);
>    gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, SHUF_MASK, 
> avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true);
>  
>    __ bind(GHASH_DONE);
> ```

Improving readability is a good idea, hand-rolled assembly however is mostly 
motivated by performance. While `sub` with immediate value  is fast and takes 
one cpu cycle, I would agree with the original author of `aesgcm_avx512` on 
combining two `sub` instructions into one instruction whenever possible.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28363#issuecomment-3590992730

Reply via email to