> On Dec 11, 2023, at 10:32 AM, Niels Möller <ni...@lysator.liu.se> wrote: > > Danny Tsen <dt...@us.ibm.com> writes: > >> Here is the version 2 for AES/GCM stitched patch. The stitched code is >> in all assembly and m4 macros are used. The overall performance >> improved around ~110% and 120% for encrypt and decrypt respectably. >> Please see the attached patch and aes benchmark. > > Thanks, comments below. > >> --- a/fat-ppc.c >> +++ b/fat-ppc.c >> @@ -226,6 +231,8 @@ fat_init (void) >> _nettle_ghash_update_arm64() */ >> _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64; >> _nettle_ghash_update_vec = _nettle_ghash_update_ppc64; >> + _nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64; >> + _nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64; >> } >> else >> { > > Fat setup is a bit tricky, here it looks like > _nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I > would suspect that breaks when the extensions aren't available. You can > test that with NETTLE_FAT_OVERRIDE=none.
Sure. I’ll test and see to fix it. > >> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, >> size_t length, uint8_t *dst, const uint8_t *src) >> { >> +#if defined(HAVE_NATIVE_AES_GCM_STITCH) >> + if (length >= 128) { >> + PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src); >> + if (length == 0) { >> + return; >> + } >> + } >> +#endif /* HAVE_NATIVE_AES_GCM_STITCH */ >> + >> GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); >> } > > In a non-fat build, it's right with a compile-time #if to select if the > optimized code should be called. But in a fat build, we'de need a valid > function in all cases, but doing different things depending on the > runtime fat initialization. One could do that with two versions of > gcm_aes128_encrypt (which is likely preferable if we do something > similar for other archs that has separate assembly for aes128, aes192, > etc). Or we would need to call some function unconditionally, which > would be a nop if the extensions are not available. E.g, do something > like > > #if HAVE_NATIVE_fat_aes_gcm_encrypt > void > gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, > size_t length, uint8_t *dst, const uint8_t *src) > { > size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x, &ctx->gcm.ctr, > _AES128_ROUNDS, &ctx->cipher.keys, length, > dst, src); > ctx->data_size += done; > length -= done; > if (length > 0) > { > src += done; > dst += done; > GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); > } > } > #endif > > where the C-implementation of _gcm_aes_encrypt could just return 0. > > And it's preferable that te same interface could be used on other archs, > even if they don't do 8 blocks at a time like your ppc code. > >> --- a/gcm.h >> +++ b/gcm.h >> @@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key >> *key, >> (nettle_cipher_func *) (encrypt), \ >> (length), (digest))) >> >> +#if defined(HAVE_NATIVE_AES_GCM_STITCH) >> +#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt >> +#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt >> +void >> +_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr, >> + size_t len, uint8_t *dst, const uint8_t *src); >> +void >> +_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr, >> + size_t len, uint8_t *dst, const uint8_t *src); >> +struct ppc_gcm_aes_context { >> + uint8_t *x; >> + uint8_t *htable; >> + struct aes128_ctx *rkeys; >> +}; >> +#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \ >> + { \ >> + (gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \ >> + (gcm_aes_ctx)->htable = (uint8_t *) (key); \ >> + (gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \ >> + } >> + >> +#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \ >> + { \ >> + size_t rem_len = 0; \ >> + struct ppc_gcm_aes_context c; \ >> + struct gcm_ctx *gctx = &(ctx)->gcm; \ >> + GET_PPC_CTX(&c, gctx, &(ctx)->key, &(ctx)->cipher); \ >> + if ((encrypt)) { \ >> + _ppc_gcm_aes_encrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), >> (dst), (src)); \ >> + } else { \ >> + _ppc_gcm_aes_decrypt(&c, (rounds), (&(ctx)->gcm)->ctr.b, (length), >> (dst), (src)); \ >> + } \ >> + rem_len = (length) % (GCM_BLOCK_SIZE * 8); \ >> + (length) -= rem_len; \ >> + gctx->data_size += (length); \ >> + (dst) += (length); \ >> + (src) += (length); \ >> + (length) = rem_len; \ >> + } >> +#endif /* HAVE_NATIVE_AES_GCM_STITCH */ > > This looks a little awkward. I think it would be better to pass the > various pointers needed by the assembly implementation as separate > (register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx > directly (with the disadvantage that assembly code needs to know > corresponding offsets). > I’ll see what’s a better options. >> --- a/powerpc64/machine.m4 >> +++ b/powerpc64/machine.m4 >> @@ -63,3 +63,40 @@ C INC_VR(VR, INC) >> define(`INC_VR',`ifelse(substr($1,0,1),`v', >> ``v'eval($2+substr($1,1,len($1)))', >> `eval($2+$1)')') >> + >> +C Adding state and round key 0 >> +C XOR_4RK0(state, state, rkey0) >> +define(`XOR_4RK0', >> + `vxor $1, $1, $5 >> + vxor $2, $2, $5 >> + vxor $3, $3, $5 >> + vxor $4, $4, $5') >> + >> +C Do 4 vcipher/vcipherlast >> +C VCIPHER(vcipher/vcipherlast, state, state, rkey) >> +define(`VCIPHER4', >> + `$1 $2, $2, $6 >> + $1 $3, $3, $6 >> + $1 $4, $4, $6 >> + $1 $5, $5, $6') > > I thing this could be generalized to a OP_4WAY macro, used as > > OP_4WAY(vxor/vcipher/vcipherlast, a, b, c, d, k) > > One could also consider generalizing it to arbitrary number of registers > with an m4 lop, to have > > OP_NWAY(op, k, a, b,..., x) > > expand to > > op a, a, k > op b, b, k > ... > op x, x, k > > But that may be overkill if only 4-way and 8-way are used. > This is only for ppc not for other architecture and defined in machine.m4. So, I don’t see the point. >> +C Adding multiplication product >> +C ADD_PROD(c1, c2, a, b) >> +define(`ADD_PROD', >> + `vxor $1,$1,$3 >> + vxor $2,$2,$4') > > Maybe rename GF_ADD; ADD_PROD is not so specific. > >> +C GF multification of L/M and data >> +C GF_MUL( >> +C GF_MUL(F, R, HL, HM, S) >> +define(`GF_MUL', >> + `vpmsumd $1,$3,$5 >> + vpmsumd $2,$4,$5') > > Looks like GF_MUL is only used in the pattern GF_MUL; GF_MUL; > ADD_PROD? So could perhaps combine in one macro. With a commend saying > which operation it performs. > This is to be more flexible to use each macro so not to combine them. >> --- /dev/null >> +++ b/powerpc64/p8/gcm-aes-decrypt.asm > [...] > >> +define(`SAVE_GPR', `std $1, $2(SP)') >> +define(`RESTORE_GPR', `ld $1, $2(SP)') > > I don't think these macros add much readability. One could possibly have > macros that take a range of registers, but not sure that's worth the > effort. > Probably not. >> +.align 5 >> +L8x_round_loop1: >> + lxvd2x VSR(K),r11,RK >> + vperm K,K,K,LE_MASK >> + VCIPHER4(vcipher, S0, S1, S2, S3, K) >> + VCIPHER4(vcipher, S4, S5, S6, S7, K) >> + addi r11,r11,0x10 >> + bdnz L8x_round_loop1 >> + >> + lxvd2x VSR(K),r11,RK >> + vperm K,K,K,LE_MASK >> + VCIPHER4(vcipherlast, S0, S1, S2, S3, K) >> + VCIPHER4(vcipherlast, S4, S5, S6, S7, K) >> + >> + cmpdi LOOP, 0 >> + beq do_ghash >> + >> +.align 5 >> +Loop8x_de: > > Is there a good reason why you have another copy decrypt round loop, > before the main loop? > I tested both. Yes, this will improve around 5 - 6%. Thanks. -Danny > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se