Re: ppc64 micro optimization

Danny Tsen Tue, 20 Feb 2024 18:41:20 -0800

Hi Niels,

Here is the v5 patch from your comments.  Please review.


Thanks.
-Danny


> On Feb 14, 2024, at 8:46 AM, Niels Möller <[email protected]> wrote:
>
> Danny Tsen <[email protected]> writes:
>
>> Here is the new patch v4 for AES/GCM stitched implementation and
>> benchmark based on the current repo.
>
> Thanks. I'm not able to read it all carefully at the moment, but I have
> a few comments, see below.
>
> In the mean time, I've also tried to implement something similar for
> x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to
> the contrary, my stitched implementation seems considerably slower...
> But at least that helped me understand the higher-level issues better.
>
>> --- a/gcm-aes128.c
>> +++ b/gcm-aes128.c
>> @@ -63,14 +63,30 @@ void
>> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>>               size_t length, uint8_t *dst, const uint8_t *src)
>> {
>> -  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>> +  size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
>> +                              _AES128_ROUNDS, &ctx->cipher.keys, length, 
>> dst, src);
>
> I know I asked you to explode the context into many separate arguments
> to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's
> not so nice to have so many arguments that they can't be passed in
> registers. Second, when running a fat build on a machine where the
> needed instructions are unavailable, it's a bit of a waste to have to
> spend lots of instructions on preparing those arguments for calling a
> nop function. So to reduce overhead, I'm now leaning towards an
> interface like
>
>  /* To reduce the number of arguments (e.g., maximum of 6 register
>     arguments on x86_64), pass a pointer to gcm_key, which really is a
>     pointer to the first member of the appropriate gcm_aes*_ctx
>     struct. */
>  size_t
>  _gcm_aes_encrypt (struct gcm_key *CTX,
>                    unsigned rounds,
>                    size_t size, uint8_t *dst, const uint8_t *src);
>
> That's not so pretty, but I think that is workable and efficient, and
> since it is an internal function, the interface can be changed if this
> is implemented on other architectures and we find out that it needs some
> tweaks. See
> https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads
> for the code I wrote to accept that ctx argument.
>
> It would also be nice to have a #define around the code calling
> _gcm_aes_encrypt, so that it is compiled only if (i) we have an
> non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat
> build, which may select a non-trivial implementation of _gcm_aes_encrypt
> at run time.
>
>> +  ctx->gcm.data_size += done;
>> +  length -= done;
>> +  if (length > 0) {
>
> Not sure of the check for length > 0 is needed. It is fine to call
> gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a
> call with length 0, though, which may be a more common case when
> _gcm_aes_encrypt is used?
>
>> +define(`SAVE_GPR', `std $1, $2(SP)')
>> +define(`RESTORE_GPR', `ld $1, $2(SP)')
>
> I think the above two macros are unneeded, it's easier to read to use
> std and ld directly.
>
>> +define(`SAVE_VR',
>> +  `li r11, $2
>> +   stvx $1, r11, $3')
>> +define(`RESTORE_VR',
>> +  `li r11, $2
>> +   lvx $1, r11, $3')
>
> It would be nice if we could trim the use of vector registers so we
> don't need to save and restore lots of them. And if we need two
> instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR
> that also adjusts the stack pointer, and doesn't need to use an additional
> register for indexing?
>
>> +    C load table elements
>> +    li             r9,1*16
>> +    li             r10,2*16
>> +    li             r11,3*16
>> +    lxvd2x         VSR(H1M),0,HT
>> +    lxvd2x         VSR(H1L),r9,HT
>> +    lxvd2x         VSR(H2M),r10,HT
>> +    lxvd2x         VSR(H2L),r11,HT
>> +    li             r9,4*16
>> +    li             r10,5*16
>> +    li             r11,6*16
>> +    li             r12,7*16
>> +    lxvd2x         VSR(H3M),r9,HT
>> +    lxvd2x         VSR(H3L),r10,HT
>> +    lxvd2x         VSR(H4M),r11,HT
>> +    lxvd2x         VSR(H4L),r12,HT
>
> I think it would be nicer to follow the style I tried to implement in my
> recent updates, using some registers (e.g., r9-r11) as offsets,
> initializing them only once, and using everywhere. E.g., in this case,
> the loading could be
>
>    lxvd2x         VSR(H1M),0,HT
>    lxvd2x         VSR(H1L),r9,HT
>    lxvd2x         VSR(H2M),r10,HT
>    lxvd2x         VSR(H2L),r11,HT
>    addi  HT, HT, 64
>    lxvd2x         VSR(H3M),0,HT
>    lxvd2x         VSR(H3L),r9,HT
>    lxvd2x         VSR(H4M),r10,HT
>    lxvd2x         VSR(H4L),r11,HT
>
>> +    C do two 4x ghash
>> +
>> +    C previous digest combining
>> +    xxlor vs0, VSR(S0), VSR(S0)
>> +    vxor S0,S0,D
>> +
>> +    GF_MUL(F2, R2, H3L, H3M, S1)
>> +    GF_MUL(F, R, H4L, H4M, S0)
>> +    vxor           F,F,F2
>> +    vxor           R,R,R2
>> +
>> +    xxlor VSR(S0), vs0, vs0
>> +
>> +    GF_MUL(F3, R3, H2L, H2M, S2)
>> +    GF_MUL(F4, R4, H1L, H1M, S3)
>> +    vxor           F3,F3,F4
>> +    vxor           R3,R3,R4
>> +
>> +    vxor           F,F,F3
>> +    vxor           D,R,R3
>> +    GHASH_REDUCE(D, F, POLY_L, R2, F2)  C R2, F2 used as temporaries
>
> It may be possible to reduce number of registers, without making the
> code slower, by accumulating differently. You ultimately accumulate the
> values into F, D, before the final reduction. Maybe you don't need
> separate registers for all of the F, F2, F3, F4 registers, and all of D,
> R, R2, R3, R4?
>
> If you have any short-lived registers used for the AES part, they could
> also overlap with short-lived registers used for ghash.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

_______________________________________________
nettle-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: ppc64 micro optimization

Reply via email to