Hi Niels, Please let me know when you merge the code and we can work from there.
Thanks. -Danny ________________________________ From: Niels Möller <ni...@lysator.liu.se> Sent: Friday, February 23, 2024 1:07 AM To: Danny Tsen <dt...@us.ibm.com> Cc: nettle-bugs@lists.lysator.liu.se <nettle-bugs@lists.lysator.liu.se>; George Wilson <gcwil...@us.ibm.com> Subject: [EXTERNAL] Re: ppc64 micro optimization Danny Tsen <dt...@us.ibm.com> writes: > Here is the v5 patch from your comments. Please review. Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though. > --- a/gcm-aes128.c > +++ b/gcm-aes128.c > @@ -63,6 +63,11 @@ void > gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, > size_t length, uint8_t *dst, const uint8_t *src) > { > + size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, > length, dst, src); > + ctx->gcm.data_size += done; > + length -= done; > + src += done; > + dst += done; > GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); > } We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean. Ah, and I think you you could write &ctx->key instead of the explicit cast. > + C load table elements > + li r9,1*16 > + li r10,2*16 > + li r11,3*16 > + lxvd2x VSR(H1M),0,HT > + lxvd2x VSR(H1L),r9,HT > + lxvd2x VSR(H2M),r10,HT > + lxvd2x VSR(H2L),r11,HT > + addi HT, HT, 64 > + lxvd2x VSR(H3M),0,HT > + lxvd2x VSR(H3L),r9,HT > + lxvd2x VSR(H4M),r10,HT > + lxvd2x VSR(H4L),r11,HT > + > + li r25,0x10 > + li r26,0x20 > + li r27,0x30 > + li r28,0x40 > + li r29,0x50 > + li r30,0x60 > + li r31,0x70 I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27. > +.align 5 > + C increase ctr value as input to aes_encrypt > + vaddudm S1, S0, CNT1 > + vaddudm S2, S1, CNT1 > + vaddudm S3, S2, CNT1 > + vaddudm S4, S3, CNT1 > + vaddudm S5, S4, CNT1 > + vaddudm S6, S5, CNT1 > + vaddudm S7, S6, CNT1 This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se