RE: ppc64 micro optimization

Danny Tsen Mon, 26 Feb 2024 14:51:50 -0800

Hi Niels,

Please let me know when you merge the code and we can work from there.

Thanks.
-Danny
________________________________
From: Niels Möller <ni...@lysator.liu.se>
Sent: Friday, February 23, 2024 1:07 AM
To: Danny Tsen <dt...@us.ibm.com>
Cc: nettle-bugs@lists.lysator.liu.se <nettle-bugs@lists.lysator.liu.se>; George 
Wilson <gcwil...@us.ibm.com>
Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen <dt...@us.ibm.com> writes:

> Here is the v5 patch from your comments.  Please review.

Thanks. I think this looks pretty good. Maybe I should commit it on a
branch and we can iterate from there. I'll be on vacation and mostly
offline next week, though.

> --- a/gcm-aes128.c
> +++ b/gcm-aes128.c
> @@ -63,6 +63,11 @@ void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>                size_t length, uint8_t *dst, const uint8_t *src)
>  {
> +  size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, 
> length, dst, src);
> +  ctx->gcm.data_size += done;
> +  length -= done;
> +  src += done;
> +  dst += done;
>    GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>  }

We should come up with some preprocessor things to completely omit the
new code on architectures that don't have _gcm_aes_encrypt (possibly
with some macro to reduce duplication). I think that's the main thing
I'd like to have before merge. Otherwise, looks nice and clean.

Ah, and I think you you could write &ctx->key instead of the explicit
cast.

> +    C load table elements
> +    li             r9,1*16
> +    li             r10,2*16
> +    li             r11,3*16
> +    lxvd2x         VSR(H1M),0,HT
> +    lxvd2x         VSR(H1L),r9,HT
> +    lxvd2x         VSR(H2M),r10,HT
> +    lxvd2x         VSR(H2L),r11,HT
> +    addi HT, HT, 64
> +    lxvd2x         VSR(H3M),0,HT
> +    lxvd2x         VSR(H3L),r9,HT
> +    lxvd2x         VSR(H4M),r10,HT
> +    lxvd2x         VSR(H4L),r11,HT
> +
> +    li r25,0x10
> +    li r26,0x20
> +    li r27,0x30
> +    li r28,0x40
> +    li r29,0x50
> +    li r30,0x60
> +    li r31,0x70

I still think there's opportunity to reduce number of registers (and
corresponding load-store of callee save registers. E.g, here r9-r11 are
used for the same thing as r25-r27.

> +.align 5
> +    C increase ctr value as input to aes_encrypt
> +    vaddudm S1, S0, CNT1
> +    vaddudm S2, S1, CNT1
> +    vaddudm S3, S2, CNT1
> +    vaddudm S4, S3, CNT1
> +    vaddudm S5, S4, CNT1
> +    vaddudm S6, S5, CNT1
> +    vaddudm S7, S6, CNT1

This is a rather long dependency chain; I wonder if you could make a
measurable saving of a cycle or two by using additional CNT2 or CNT4
registers (if not, it's preferable to keep the current simple chain).

Regards,
/Niels

--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

RE: ppc64 micro optimization

Reply via email to