Re: ppc64: AES/GCM Performance improvement with stitched implementation

Danny Tsen Tue, 21 Nov 2023 15:23:24 -0800

Hi Niels,

More comments.  Please see inline.


> On Nov 21, 2023, at 1:46 PM, Danny Tsen <dt...@us.ibm.com> wrote:
> 
> Hi Niels,
> 
> Thanks for the quick response.
> 
> I'll think more thru your comments here and it may take some more time to get 
> an update.  And just a quick answer to 4 of your questions.
> 
> 
>  1.  Depends on some special registers from caller.  This is so that I don't 
> need to change the registers used in aes_internal_encrypt and gf_mul_4x 
> functions.  This is a way to minimize too much change in the existing code.  
> But I can change that for sure.  m4 macro could be helpful here.
>  2.  The reason to use gcm_encrypt is to minimize duplicate code in 
> gcm_aes128..., but I can change that.
>  3.   Yes, 4x blocks won't provide the same performance as 8x.
>  4.  Yes, function call did introduce quite a lot of overhead in a loop.  We 
> can call gf_mul_4x from _ghash_update but the stack handling has to be 
> changed and I tried not to change anything in _ghash_update since my code 
> dosen't call _ghash_update.  But I guess I can use m4 macro instead.
> 
> Thanks.
> -Danny
> ________________________________
> From: Niels Möller <ni...@lysator.liu.se>
> Sent: Tuesday, November 21, 2023 1:07 PM
> To: Danny Tsen <dt...@us.ibm.com>
> Cc: nettle-bugs@lists.lysator.liu.se <nettle-bugs@lists.lysator.liu.se>; 
> George Wilson <gcwil...@us.ibm.com>
> Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with 
> stitched implementation
> 
> Danny Tsen <dt...@us.ibm.com> writes:
> 
>> This patch provides a performance improvement over AES/GCM with stitched
>> implementation for ppc64.  The code is a wrapper in assembly to handle 
>> multiple 8
>> blocks and handle big and little endian.
>> 
>> The overall improvement is based on the nettle-benchmark with ~80% 
>> improvement for
>> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline.  
>> The
>> benchmark was run on a P10 machine with 3.896GHz CPU.
> 
> That's a pretty nice performance improvements. A first round of comments
> below, mainly structural.
> 
> (And I think attachments didn't make it to the list, possibly because
> some of them had Content-type: application/octet-stream rather than
> text/plain).
> 
>> +#if defined(__powerpc64__) || defined(__powerpc__)
>> +#define HAVE_AES_GCM_STITCH 1
>> +#endif
> 
> If the C code needs to know about optional assembly functions, the
> HAVE_NATIVE tests are intended for that.
> 
>> void
>> gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,
>>            const void *cipher, nettle_cipher_func *f,
>> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key 
>> *key,
>> {
>>   assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
>> 
>> +#if defined(HAVE_AES_GCM_STITCH)
>> +  size_t rem_len = 0;
>> +
>> +  if (length >= 128) {
>> +    int rounds = 0;
>> +    if (f == (nettle_cipher_func *) aes128_encrypt) {
>> +      rounds = _AES128_ROUNDS;
>> +    } else if (f == (nettle_cipher_func *) aes192_encrypt) {
>> +      rounds = _AES192_ROUNDS;
>> +    } else if (f == (nettle_cipher_func *) aes256_encrypt) {
>> +      rounds = _AES256_ROUNDS;
>> +    }
>> +    if (rounds) {
>> +      struct gcm_aes_context c;
>> +      get_ctx(&c, ctx, key, cipher);
>> +      _nettle_ppc_gcm_aes_encrypt_ppc64(&c, rounds, ctx->ctr.b, length, 
>> dst, src);
> 
> I think this is the wrong place for this dispatch, I think it should go
> in gcm-aes128.c, gcm-aes192.c, etc.
> 
>> --- a/powerpc64/p8/aes-encrypt-internal.asm
>> +++ b/powerpc64/p8/aes-encrypt-internal.asm
>> @@ -52,6 +52,16 @@ define(`S5', `v7')
>> define(`S6', `v8')
>> define(`S7', `v9')
>> 
>> +C re-define SRC if from _gcm_aes
>> +define(`S10', `v10')
>> +define(`S11', `v11')
>> +define(`S12', `v12')
>> +define(`S13', `v13')
>> +define(`S14', `v14')
>> +define(`S15', `v15')
>> +define(`S16', `v16')
>> +define(`S17', `v17')
>> +
>> .file "aes-encrypt-internal.asm"
>> 
>> .text
>> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt)
>>  DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5)
>> 
>>  subi ROUNDS,ROUNDS,1
>> +
>> + cmpdi r23, 0x5f             C call from _gcm_aes
>> + beq Lx8_loop
>> +
>>  srdi LENGTH,LENGTH,4
>> 
>>  srdi r5,LENGTH,3 #8x loop count
>> @@ -93,6 +107,9 @@ Lx8_loop:
>>  lxvd2x VSR(K),0,KEYS
>>  vperm   K,K,K,SWAP_MASK
>> 
>> + cmpdi r23, 0x5f
>> + beq Skip_load
> 
> It's a little messy to have branches depending on a special register set
> by some callers. I think it would be simpler to either move the round
> loop (i.e., the loop with the label from L8x_round_loop:) into a
> subroutine with all-register arguments, and call that from both
> _nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro
> expanding to the body of that loop, and use that macro in both places.
> 
>> --- /dev/null
>> +++ b/powerpc64/p8/gcm-aes-decrypt.asm
>> @@ -0,0 +1,425 @@
>> +C powerpc64/p8/gcm-aes-decrypt.asm
> 
>> +.macro SAVE_REGS
>> + mflr 0
>> + std 0,16(1)
>> + stdu  SP,-464(SP)
> 
> If macros are needed, please use m4 macros, like other nettle assembly code.
> 
>> +.align 5
>> +Loop8x_de:
> [...]
>> +    bl _nettle_aes_encrypt_ppc64
> 
> I suspect this reference will break in non-fat builds?
> 
>> +    nop
>> +
>> +    C do two 4x ghash
> [...]
>> +    bl _nettle_gf_mul_4x_ppc64
>> +    nop
>> +
>> +    bl _nettle_gf_mul_4x_ppc64
>> +    nop
> 
> So the body of the main loop is one subroutine call to do 8 aes blocks,
> and two subroutine calls to do corresponding ghash. I had expected some
> more instrution-level interleaving of the two operations, do you think
> that could be beneficial, or is out-of-order machinery so powerful that
> instruction scheduling is not so important?
Interleaving at the instructions level may be a good option but due to PPC 
instruction pipeline this may need to have sufficient registers/vectors.  Use 
same vectors to change contents in successive instructions may require more 
cycles.  In that case, more vectors/scalar will get involved and all vectors 
assignment may have to change.  That’s the reason I avoided in this case.

Thanks.
-Danny

> 
> I think this could be simpler if you define subroutines (or maybe
> macros) taylored to use from this loop, which can be reused by the code
> to do aes and ghash separately.
> 
> I would also be curious if you get something noticably slower if you do
> only 4 blocks per loop (but if the bottleneck is the dependencies in the
> aes loop, it may be that doing 8 blocks is important also in this
> setting).
> 
> For the interface between C and assembly, one could consider an
> interface that can be passed an arbitrary number of block, similar to
> _ghash_update. If it's too much complexity to actually do an arbitrary
> number of blocks, it could return number of blocks done, and leave to
> the caller (the C code) to handle the left-over.
> 
>> --- a/powerpc64/p8/ghash-update.asm
>> +++ b/powerpc64/p8/ghash-update.asm
>> @@ -281,6 +281,48 @@ IF_LE(`
>>     blr
>> EPILOGUE(_nettle_ghash_update)
>> 
>> +C
>> +C GCM multification and reduction
>> +C   All inputs depends on definitions
>> +C
>> +C .align 5
>> +C .global _nettle_gf_mul_4x
>> +C _nettle_gf_mul_4x:
>> +define(`FUNC_ALIGN', `5')
>> +PROLOGUE(_nettle_gf_mul_4x)
>> +    C mflr 0
> 
> Could you call this subroutine also from _ghash_update? Does function
> call overhead matter?
> 
> Regards,
> /Niels
> 
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
> _______________________________________________
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
> _______________________________________________
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: ppc64: AES/GCM Performance improvement with stitched implementation

Reply via email to