Hi Niels, More comments. Please see inline.
> On Nov 21, 2023, at 1:46 PM, Danny Tsen <dt...@us.ibm.com> wrote: > > Hi Niels, > > Thanks for the quick response. > > I'll think more thru your comments here and it may take some more time to get > an update. And just a quick answer to 4 of your questions. > > > 1. Depends on some special registers from caller. This is so that I don't > need to change the registers used in aes_internal_encrypt and gf_mul_4x > functions. This is a way to minimize too much change in the existing code. > But I can change that for sure. m4 macro could be helpful here. > 2. The reason to use gcm_encrypt is to minimize duplicate code in > gcm_aes128..., but I can change that. > 3. Yes, 4x blocks won't provide the same performance as 8x. > 4. Yes, function call did introduce quite a lot of overhead in a loop. We > can call gf_mul_4x from _ghash_update but the stack handling has to be > changed and I tried not to change anything in _ghash_update since my code > dosen't call _ghash_update. But I guess I can use m4 macro instead. > > Thanks. > -Danny > ________________________________ > From: Niels Möller <ni...@lysator.liu.se> > Sent: Tuesday, November 21, 2023 1:07 PM > To: Danny Tsen <dt...@us.ibm.com> > Cc: nettle-bugs@lists.lysator.liu.se <nettle-bugs@lists.lysator.liu.se>; > George Wilson <gcwil...@us.ibm.com> > Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with > stitched implementation > > Danny Tsen <dt...@us.ibm.com> writes: > >> This patch provides a performance improvement over AES/GCM with stitched >> implementation for ppc64. The code is a wrapper in assembly to handle >> multiple 8 >> blocks and handle big and little endian. >> >> The overall improvement is based on the nettle-benchmark with ~80% >> improvement for >> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. >> The >> benchmark was run on a P10 machine with 3.896GHz CPU. > > That's a pretty nice performance improvements. A first round of comments > below, mainly structural. > > (And I think attachments didn't make it to the list, possibly because > some of them had Content-type: application/octet-stream rather than > text/plain). > >> +#if defined(__powerpc64__) || defined(__powerpc__) >> +#define HAVE_AES_GCM_STITCH 1 >> +#endif > > If the C code needs to know about optional assembly functions, the > HAVE_NATIVE tests are intended for that. > >> void >> gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key, >> const void *cipher, nettle_cipher_func *f, >> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key >> *key, >> { >> assert(ctx->data_size % GCM_BLOCK_SIZE == 0); >> >> +#if defined(HAVE_AES_GCM_STITCH) >> + size_t rem_len = 0; >> + >> + if (length >= 128) { >> + int rounds = 0; >> + if (f == (nettle_cipher_func *) aes128_encrypt) { >> + rounds = _AES128_ROUNDS; >> + } else if (f == (nettle_cipher_func *) aes192_encrypt) { >> + rounds = _AES192_ROUNDS; >> + } else if (f == (nettle_cipher_func *) aes256_encrypt) { >> + rounds = _AES256_ROUNDS; >> + } >> + if (rounds) { >> + struct gcm_aes_context c; >> + get_ctx(&c, ctx, key, cipher); >> + _nettle_ppc_gcm_aes_encrypt_ppc64(&c, rounds, ctx->ctr.b, length, >> dst, src); > > I think this is the wrong place for this dispatch, I think it should go > in gcm-aes128.c, gcm-aes192.c, etc. > >> --- a/powerpc64/p8/aes-encrypt-internal.asm >> +++ b/powerpc64/p8/aes-encrypt-internal.asm >> @@ -52,6 +52,16 @@ define(`S5', `v7') >> define(`S6', `v8') >> define(`S7', `v9') >> >> +C re-define SRC if from _gcm_aes >> +define(`S10', `v10') >> +define(`S11', `v11') >> +define(`S12', `v12') >> +define(`S13', `v13') >> +define(`S14', `v14') >> +define(`S15', `v15') >> +define(`S16', `v16') >> +define(`S17', `v17') >> + >> .file "aes-encrypt-internal.asm" >> >> .text >> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt) >> DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5) >> >> subi ROUNDS,ROUNDS,1 >> + >> + cmpdi r23, 0x5f C call from _gcm_aes >> + beq Lx8_loop >> + >> srdi LENGTH,LENGTH,4 >> >> srdi r5,LENGTH,3 #8x loop count >> @@ -93,6 +107,9 @@ Lx8_loop: >> lxvd2x VSR(K),0,KEYS >> vperm K,K,K,SWAP_MASK >> >> + cmpdi r23, 0x5f >> + beq Skip_load > > It's a little messy to have branches depending on a special register set > by some callers. I think it would be simpler to either move the round > loop (i.e., the loop with the label from L8x_round_loop:) into a > subroutine with all-register arguments, and call that from both > _nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro > expanding to the body of that loop, and use that macro in both places. > >> --- /dev/null >> +++ b/powerpc64/p8/gcm-aes-decrypt.asm >> @@ -0,0 +1,425 @@ >> +C powerpc64/p8/gcm-aes-decrypt.asm > >> +.macro SAVE_REGS >> + mflr 0 >> + std 0,16(1) >> + stdu SP,-464(SP) > > If macros are needed, please use m4 macros, like other nettle assembly code. > >> +.align 5 >> +Loop8x_de: > [...] >> + bl _nettle_aes_encrypt_ppc64 > > I suspect this reference will break in non-fat builds? > >> + nop >> + >> + C do two 4x ghash > [...] >> + bl _nettle_gf_mul_4x_ppc64 >> + nop >> + >> + bl _nettle_gf_mul_4x_ppc64 >> + nop > > So the body of the main loop is one subroutine call to do 8 aes blocks, > and two subroutine calls to do corresponding ghash. I had expected some > more instrution-level interleaving of the two operations, do you think > that could be beneficial, or is out-of-order machinery so powerful that > instruction scheduling is not so important? Interleaving at the instructions level may be a good option but due to PPC instruction pipeline this may need to have sufficient registers/vectors. Use same vectors to change contents in successive instructions may require more cycles. In that case, more vectors/scalar will get involved and all vectors assignment may have to change. That’s the reason I avoided in this case. Thanks. -Danny > > I think this could be simpler if you define subroutines (or maybe > macros) taylored to use from this loop, which can be reused by the code > to do aes and ghash separately. > > I would also be curious if you get something noticably slower if you do > only 4 blocks per loop (but if the bottleneck is the dependencies in the > aes loop, it may be that doing 8 blocks is important also in this > setting). > > For the interface between C and assembly, one could consider an > interface that can be passed an arbitrary number of block, similar to > _ghash_update. If it's too much complexity to actually do an arbitrary > number of blocks, it could return number of blocks done, and leave to > the caller (the C code) to handle the left-over. > >> --- a/powerpc64/p8/ghash-update.asm >> +++ b/powerpc64/p8/ghash-update.asm >> @@ -281,6 +281,48 @@ IF_LE(` >> blr >> EPILOGUE(_nettle_ghash_update) >> >> +C >> +C GCM multification and reduction >> +C All inputs depends on definitions >> +C >> +C .align 5 >> +C .global _nettle_gf_mul_4x >> +C _nettle_gf_mul_4x: >> +define(`FUNC_ALIGN', `5') >> +PROLOGUE(_nettle_gf_mul_4x) >> + C mflr 0 > > Could you call this subroutine also from _ghash_update? Does function > call overhead matter? > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. > _______________________________________________ > nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se > To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se > _______________________________________________ > nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se > To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se