Hi Niels, Here is the v5 patch from your comments. Please review.
Thanks. -Danny > On Feb 14, 2024, at 8:46 AM, Niels Möller <ni...@lysator.liu.se> wrote: > > Danny Tsen <dt...@us.ibm.com> writes: > >> Here is the new patch v4 for AES/GCM stitched implementation and >> benchmark based on the current repo. > > Thanks. I'm not able to read it all carefully at the moment, but I have > a few comments, see below. > > In the mean time, I've also tried to implement something similar for > x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to > the contrary, my stitched implementation seems considerably slower... > But at least that helped me understand the higher-level issues better. > >> --- a/gcm-aes128.c >> +++ b/gcm-aes128.c >> @@ -63,14 +63,30 @@ void >> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, >> size_t length, uint8_t *dst, const uint8_t *src) >> { >> - GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); >> + size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b, >> + _AES128_ROUNDS, &ctx->cipher.keys, length, >> dst, src); > > I know I asked you to explode the context into many separate arguments > to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's > not so nice to have so many arguments that they can't be passed in > registers. Second, when running a fat build on a machine where the > needed instructions are unavailable, it's a bit of a waste to have to > spend lots of instructions on preparing those arguments for calling a > nop function. So to reduce overhead, I'm now leaning towards an > interface like > > /* To reduce the number of arguments (e.g., maximum of 6 register > arguments on x86_64), pass a pointer to gcm_key, which really is a > pointer to the first member of the appropriate gcm_aes*_ctx > struct. */ > size_t > _gcm_aes_encrypt (struct gcm_key *CTX, > unsigned rounds, > size_t size, uint8_t *dst, const uint8_t *src); > > That's not so pretty, but I think that is workable and efficient, and > since it is an internal function, the interface can be changed if this > is implemented on other architectures and we find out that it needs some > tweaks. See > https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads > for the code I wrote to accept that ctx argument. > > It would also be nice to have a #define around the code calling > _gcm_aes_encrypt, so that it is compiled only if (i) we have an > non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat > build, which may select a non-trivial implementation of _gcm_aes_encrypt > at run time. > >> + ctx->gcm.data_size += done; >> + length -= done; >> + if (length > 0) { > > Not sure of the check for length > 0 is needed. It is fine to call > gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a > call with length 0, though, which may be a more common case when > _gcm_aes_encrypt is used? > >> +define(`SAVE_GPR', `std $1, $2(SP)') >> +define(`RESTORE_GPR', `ld $1, $2(SP)') > > I think the above two macros are unneeded, it's easier to read to use > std and ld directly. > >> +define(`SAVE_VR', >> + `li r11, $2 >> + stvx $1, r11, $3') >> +define(`RESTORE_VR', >> + `li r11, $2 >> + lvx $1, r11, $3') > > It would be nice if we could trim the use of vector registers so we > don't need to save and restore lots of them. And if we need two > instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR > that also adjusts the stack pointer, and doesn't need to use an additional > register for indexing? > >> + C load table elements >> + li r9,1*16 >> + li r10,2*16 >> + li r11,3*16 >> + lxvd2x VSR(H1M),0,HT >> + lxvd2x VSR(H1L),r9,HT >> + lxvd2x VSR(H2M),r10,HT >> + lxvd2x VSR(H2L),r11,HT >> + li r9,4*16 >> + li r10,5*16 >> + li r11,6*16 >> + li r12,7*16 >> + lxvd2x VSR(H3M),r9,HT >> + lxvd2x VSR(H3L),r10,HT >> + lxvd2x VSR(H4M),r11,HT >> + lxvd2x VSR(H4L),r12,HT > > I think it would be nicer to follow the style I tried to implement in my > recent updates, using some registers (e.g., r9-r11) as offsets, > initializing them only once, and using everywhere. E.g., in this case, > the loading could be > > lxvd2x VSR(H1M),0,HT > lxvd2x VSR(H1L),r9,HT > lxvd2x VSR(H2M),r10,HT > lxvd2x VSR(H2L),r11,HT > addi HT, HT, 64 > lxvd2x VSR(H3M),0,HT > lxvd2x VSR(H3L),r9,HT > lxvd2x VSR(H4M),r10,HT > lxvd2x VSR(H4L),r11,HT > >> + C do two 4x ghash >> + >> + C previous digest combining >> + xxlor vs0, VSR(S0), VSR(S0) >> + vxor S0,S0,D >> + >> + GF_MUL(F2, R2, H3L, H3M, S1) >> + GF_MUL(F, R, H4L, H4M, S0) >> + vxor F,F,F2 >> + vxor R,R,R2 >> + >> + xxlor VSR(S0), vs0, vs0 >> + >> + GF_MUL(F3, R3, H2L, H2M, S2) >> + GF_MUL(F4, R4, H1L, H1M, S3) >> + vxor F3,F3,F4 >> + vxor R3,R3,R4 >> + >> + vxor F,F,F3 >> + vxor D,R,R3 >> + GHASH_REDUCE(D, F, POLY_L, R2, F2) C R2, F2 used as temporaries > > It may be possible to reduce number of registers, without making the > code slower, by accumulating differently. You ultimately accumulate the > values into F, D, before the final reduction. Maybe you don't need > separate registers for all of the F, F2, F3, F4 registers, and all of D, > R, R2, R3, R4? > > If you have any short-lived registers used for the AES part, they could > also overlap with short-lived registers used for ghash. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se