Re: ppc64 micro optimization

2024-03-05 Thread Danny Tsen
Hi Niels,

My fault.  I did not include the gym-aes-crypt.c in the patch.  Here is the 
updated patch.  Please apply this one and we can work from there.

Thanks.
-Danny


> On Mar 5, 2024, at 1:08 PM, Niels Möller  wrote:
>
> Danny Tsen  writes:
>
>> Please let me know when you merge the code and we can work from there.
>
> Hi, I tried to apply and build with the v5 patch, and noticed some problems.
>
> Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go
> in gcm-internal.h, like on this branch,
> https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.h?ref_type=heads
> Corresponding name mangling defines should also be in gcm-internal.h,
> not in the installed gcm.h header.
>
> The file gcm-aes.c was missing in the patch. If the dummy C versions of
> _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the
> definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy
> function for both encrypt and decrypt).
>
> It would also be nice if you could check that the new code is used
> and working in a non-fat build, configured with --disable-fat
> --enable-power-crypto-ext.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


RE: ppc64 micro optimization

2024-02-26 Thread Danny Tsen
Hi Niels,

Please let me know when you merge the code and we can work from there.

Thanks.
-Danny

From: Niels Möller 
Sent: Friday, February 23, 2024 1:07 AM
To: Danny Tsen 
Cc: nettle-bugs@lists.lysator.liu.se ; George 
Wilson 
Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen  writes:

> Here is the v5 patch from your comments.  Please review.

Thanks. I think this looks pretty good. Maybe I should commit it on a
branch and we can iterate from there. I'll be on vacation and mostly
offline next week, though.

> --- a/gcm-aes128.c
> +++ b/gcm-aes128.c
> @@ -63,6 +63,11 @@ void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>size_t length, uint8_t *dst, const uint8_t *src)
>  {
> +  size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, 
> length, dst, src);
> +  ctx->gcm.data_size += done;
> +  length -= done;
> +  src += done;
> +  dst += done;
>GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>  }

We should come up with some preprocessor things to completely omit the
new code on architectures that don't have _gcm_aes_encrypt (possibly
with some macro to reduce duplication). I think that's the main thing
I'd like to have before merge. Otherwise, looks nice and clean.

Ah, and I think you you could write >key instead of the explicit
cast.

> +C load table elements
> +li r9,1*16
> +li r10,2*16
> +li r11,3*16
> +lxvd2x VSR(H1M),0,HT
> +lxvd2x VSR(H1L),r9,HT
> +lxvd2x VSR(H2M),r10,HT
> +lxvd2x VSR(H2L),r11,HT
> +addi HT, HT, 64
> +lxvd2x VSR(H3M),0,HT
> +lxvd2x VSR(H3L),r9,HT
> +lxvd2x VSR(H4M),r10,HT
> +lxvd2x VSR(H4L),r11,HT
> +
> +li r25,0x10
> +li r26,0x20
> +li r27,0x30
> +li r28,0x40
> +li r29,0x50
> +li r30,0x60
> +li r31,0x70

I still think there's opportunity to reduce number of registers (and
corresponding load-store of callee save registers. E.g, here r9-r11 are
used for the same thing as r25-r27.

> +.align 5
> +C increase ctr value as input to aes_encrypt
> +vaddudm S1, S0, CNT1
> +vaddudm S2, S1, CNT1
> +vaddudm S3, S2, CNT1
> +vaddudm S4, S3, CNT1
> +vaddudm S5, S4, CNT1
> +vaddudm S6, S5, CNT1
> +vaddudm S7, S6, CNT1

This is a rather long dependency chain; I wonder if you could make a
measurable saving of a cycle or two by using additional CNT2 or CNT4
registers (if not, it's preferable to keep the current simple chain).

Regards,
/Niels

--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-02-20 Thread Danny Tsen
Hi Niels,

Here is the v5 patch from your comments.  Please review.

Thanks.
-Danny


> On Feb 14, 2024, at 8:46 AM, Niels Möller  wrote:
>
> Danny Tsen  writes:
>
>> Here is the new patch v4 for AES/GCM stitched implementation and
>> benchmark based on the current repo.
>
> Thanks. I'm not able to read it all carefully at the moment, but I have
> a few comments, see below.
>
> In the mean time, I've also tried to implement something similar for
> x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to
> the contrary, my stitched implementation seems considerably slower...
> But at least that helped me understand the higher-level issues better.
>
>> --- a/gcm-aes128.c
>> +++ b/gcm-aes128.c
>> @@ -63,14 +63,30 @@ void
>> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>>   size_t length, uint8_t *dst, const uint8_t *src)
>> {
>> -  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>> +  size_t done = _gcm_aes_encrypt (>key, >gcm.x.b, >gcm.ctr.b,
>> +  _AES128_ROUNDS, >cipher.keys, length, 
>> dst, src);
>
> I know I asked you to explode the context into many separate arguments
> to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's
> not so nice to have so many arguments that they can't be passed in
> registers. Second, when running a fat build on a machine where the
> needed instructions are unavailable, it's a bit of a waste to have to
> spend lots of instructions on preparing those arguments for calling a
> nop function. So to reduce overhead, I'm now leaning towards an
> interface like
>
>  /* To reduce the number of arguments (e.g., maximum of 6 register
> arguments on x86_64), pass a pointer to gcm_key, which really is a
> pointer to the first member of the appropriate gcm_aes*_ctx
> struct. */
>  size_t
>  _gcm_aes_encrypt (struct gcm_key *CTX,
>unsigned rounds,
>size_t size, uint8_t *dst, const uint8_t *src);
>
> That's not so pretty, but I think that is workable and efficient, and
> since it is an internal function, the interface can be changed if this
> is implemented on other architectures and we find out that it needs some
> tweaks. See
> https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads
> for the code I wrote to accept that ctx argument.
>
> It would also be nice to have a #define around the code calling
> _gcm_aes_encrypt, so that it is compiled only if (i) we have an
> non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat
> build, which may select a non-trivial implementation of _gcm_aes_encrypt
> at run time.
>
>> +  ctx->gcm.data_size += done;
>> +  length -= done;
>> +  if (length > 0) {
>
> Not sure of the check for length > 0 is needed. It is fine to call
> gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a
> call with length 0, though, which may be a more common case when
> _gcm_aes_encrypt is used?
>
>> +define(`SAVE_GPR', `std $1, $2(SP)')
>> +define(`RESTORE_GPR', `ld $1, $2(SP)')
>
> I think the above two macros are unneeded, it's easier to read to use
> std and ld directly.
>
>> +define(`SAVE_VR',
>> +  `li r11, $2
>> +   stvx $1, r11, $3')
>> +define(`RESTORE_VR',
>> +  `li r11, $2
>> +   lvx $1, r11, $3')
>
> It would be nice if we could trim the use of vector registers so we
> don't need to save and restore lots of them. And if we need two
> instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR
> that also adjusts the stack pointer, and doesn't need to use an additional
> register for indexing?
>
>> +C load table elements
>> +li r9,1*16
>> +li r10,2*16
>> +li r11,3*16
>> +lxvd2x VSR(H1M),0,HT
>> +lxvd2x VSR(H1L),r9,HT
>> +lxvd2x VSR(H2M),r10,HT
>> +lxvd2x VSR(H2L),r11,HT
>> +li r9,4*16
>> +li r10,5*16
>> +li r11,6*16
>> +li r12,7*16
>> +lxvd2x VSR(H3M),r9,HT
>> +lxvd2x VSR(H3L),r10,HT
>> +lxvd2x VSR(H4M),r11,HT
>> +lxvd2x VSR(H4L),r12,HT
>
> I think it would be nicer to follow the style I tried to implement in my
> recent updates, using some registers (e.g., r9-r11) as offsets,
> initializing them only once, and using everywhere. E.g., in this case,
> the loading could be
>
>lxvd2x VSR(H1M),0,HT
>lxvd2x VSR(H1L),r9,HT
>lxvd2x VSR(H2

Re: ppc64 micro optimization

2024-02-03 Thread Danny Tsen
Hi Niels,

Here is the new patch v4 for AES/GCM stitched implementation and benchmark 
based on the current repo.

Thanks.
-Danny



> On Jan 31, 2024, at 4:35 AM, Niels Möller  wrote:
>
> Niels Möller  writes:
>
>> While the powerpc64 vncipher instruction really wants the original
>> subkeys, not transformed. So on power, it would be better to have a
>> _nettle_aes_invert that is essentially a memcpy, and then the aes
>> decrypt assembly code could be reworked without the xors, and run at exactly
>> the same speed as encryption.
>
> I've tried this out, see branch
> https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert . It
> appears to give the desired improvement in aes decrypt speed, making it
> run at the same speed as aes encrypt. Which is a speedup of about 80%
> when benchmarked on power10 (the cfarm120 machine).
>
>> Current _nettle_aes_invert also changes the order of the subkeys, with
>> a FIXME comment suggesting that it would be better to update the order
>> keys are accessed in the aes decryption functions.
>
> I've merged the changes to keep subkey order the same for encrypt and
> decrypt (so that the decrypt round loop uses subkeys starting at the end
> of the array), which affects all aes implementations except s390x, which
> doesn't need any subkey expansion. But I've deleted the sparc32 assembly
> rather than updating it.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
> ___
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


RE: ppc64 micro optimization

2024-01-24 Thread Danny Tsen
Hi Niels,

This may take a while before I digest the changes and I redo the changes.  It 
could be another 2 weeks.  It may take longer because I will be taking time out 
and out of the country.  Will do my best.

Thanks.
-Danny

From: Niels Möller 
Sent: Thursday, January 25, 2024 3:58 AM
To: Danny Tsen 
Cc: nettle-bugs@lists.lysator.liu.se ; George 
Wilson 
Subject: [EXTERNAL] Re: ppc64 micro optimization

Danny Tsen  writes:

> Thanks for merging the stitched implementation for PPC64 with your
> detailed information and efforts

We're not quite there yet, though. Do you think you could rebase your
work on top of recent changes? Sorry about conflicts, but I think new
macros should fit well with what you need (feel free to have additional
macros, where you find that useful).

Regards,
/Niels

--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64 micro optimization

2024-01-22 Thread Danny Tsen
Hi Niels,

Thanks for merging the stitched implementation for PPC64 with your detailed 
information and efforts

Thanks.
-Danny

> On Jan 21, 2024, at 11:27 PM, Niels Möller  wrote:
> 
> In preparing for merging the gcm-aes "stitched" implementation, I'm
> reviewing the existing ghash code. WIP branch "ppc-ghash-macros.
> 
> I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides
> that, I've been able to improve scheduling of the reduction instructions
> (adding in the result of vpmsumd last seems to improve parallelism, some
> 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also
> streamlined the way load offsets are used, and trimmed the number of
> needed vector registers slightly.
> 
> For the AES code, I've merged the new macros (I settled on the names
> OPN_XXY and OPN_XXXY), no change in speed expected from that change.
> 
> I've also tried to understand the differenct between AES encrypt and
> decrypt, where decrypt is much slower, and uses an extra xor instruction
> in the round loop. I think the reason for that is that other AES
> implementations (including x86_64 and arm64 instructions, and Nettle's C
> implementation) expect the decryption subkeys to be transformed via the
> AES "MIX_COLUMN" operation, see
> https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_type=heads#L163
>  
> 
> While the powerpc64 vncipher instruction really wants the original
> subkeys, not transformed. So on power, it would be better to have a
> _nettle_aes_invert that is essentially a memcpy, and then the aes
> decrypt assembly code could be reworked without the xors, and run at exactly
> the same speed as encryption. Current _nettle_aes_invert also changes
> the order of the subkeys, with a FIXME comment suggesting that it would
> be better to update the order keys are accessed in the aes decryption
> functions.
> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
> 
> ___
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: v3: AES/GCM Performance improvement with stitched implementation

2023-12-18 Thread Danny Tsen
Hi Niels,

Here is another revised patch testing with NETTLE_FAT_OVERRIDE.  Same 
performance as last version.  I also added a new test vector with 917 bytes for 
AES/GCM tests to test multiple blocks and partial.  Attached are the patch and 
benchmark for AES.

Thanks.
-Danny


On Dec 12, 2023, at 9:01 AM, Danny Tsen  wrote:



On Dec 11, 2023, at 10:32 AM, Niels Möller  wrote:

Danny Tsen  writes:

Here is the version 2 for AES/GCM stitched patch. The stitched code is
in all assembly and m4 macros are used. The overall performance
improved around ~110% and 120% for encrypt and decrypt respectably.
Please see the attached patch and aes benchmark.

Thanks, comments below.

--- a/fat-ppc.c
+++ b/fat-ppc.c
@@ -226,6 +231,8 @@ fat_init (void)
_nettle_ghash_update_arm64() */
 _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64;
 _nettle_ghash_update_vec = _nettle_ghash_update_ppc64;
+  _nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64;
+  _nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64;
   }
 else
   {

Fat setup is a bit tricky, here it looks like
_nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I
would suspect that breaks when the extensions aren't available. You can
test that with NETTLE_FAT_OVERRIDE=none.

Sure.  I’ll test and see to fix it.


gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
size_t length, uint8_t *dst, const uint8_t *src)
{
+#if defined(HAVE_NATIVE_AES_GCM_STITCH)
+  if (length >= 128) {
+PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src);
+if (length == 0) {
+  return;
+}
+  }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */
+
 GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}

In a non-fat build, it's right with a compile-time #if to select if the
optimized code should be called. But in a fat build, we'de need a valid
function in all cases, but doing different things depending on the
runtime fat initialization. One could do that with two versions of
gcm_aes128_encrypt (which is likely preferable if we do something
similar for other archs that has separate assembly for aes128, aes192,
etc). Or we would need to call some function unconditionally, which
would be a nop if the extensions are not available. E.g, do something
like

#if HAVE_NATIVE_fat_aes_gcm_encrypt
void
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
 size_t length, uint8_t *dst, const uint8_t *src)
{
  size_t done = _gcm_aes_encrypt (>key, >gcm.x, >gcm.ctr,
  _AES128_ROUNDS, >cipher.keys, length, dst, src);
  ctx->data_size += done;
  length -= done;
  if (length > 0)
{
  src += done;
  dst += done;
  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
}
#endif

where the C-implementation of _gcm_aes_encrypt could just return 0.

And it's preferable that te same interface could be used on other archs,
even if they don't do 8 blocks at a time like your ppc code.

--- a/gcm.h
+++ b/gcm.h
@@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key *key,
  (nettle_cipher_func *) (encrypt), \
  (length), (digest)))

+#if defined(HAVE_NATIVE_AES_GCM_STITCH)
+#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt
+#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt
+void
+_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr,
+  size_t len, uint8_t *dst, const uint8_t *src);
+void
+_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr,
+  size_t len, uint8_t *dst, const uint8_t *src);
+struct ppc_gcm_aes_context {
+  uint8_t *x;
+  uint8_t *htable;
+  struct aes128_ctx *rkeys;
+};
+#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \
+  { \
+(gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \
+(gcm_aes_ctx)->htable = (uint8_t *) (key); \
+(gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \
+  }
+
+#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \
+  { \
+size_t rem_len = 0; \
+struct ppc_gcm_aes_context c; \
+struct gcm_ctx *gctx = &(ctx)->gcm; \
+GET_PPC_CTX(, gctx, &(ctx)->key, &(ctx)->cipher); \
+if ((encrypt)) { \
+  _ppc_gcm_aes_encrypt(, (rounds), (&(ctx)->gcm)->ctr.b, (length), 
(dst), (src)); \
+} else { \
+  _ppc_gcm_aes_decrypt(, (rounds), (&(ctx)->gcm)->ctr.b, (length), 
(dst), (src)); \
+} \
+rem_len = (length) % (GCM_BLOCK_SIZE * 8); \
+(length) -= rem_len; \
+gctx->data_size += (length); \
+(dst) += (length); \
+(src) += (length); \
+(length) = rem_len; \
+  }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */

This looks a little awkward. I think it would be better to pass the
various pointers needed by the assembly implementation as separate
(register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx
directly (with the disadvantage that assembly code needs to know
corresponding offsets).


I’ll see what’s a better options.

--- a/powerpc64/machi

Re: ppc64: v2, AES/GCM Performance improvement with stitched implementation

2023-12-18 Thread Danny Tsen
Correction, 719 bytes test vector.


On Dec 12, 2023, at 9:01 AM, Danny Tsen  wrote:



On Dec 11, 2023, at 10:32 AM, Niels Möller  wrote:

Danny Tsen  writes:

Here is the version 2 for AES/GCM stitched patch. The stitched code is
in all assembly and m4 macros are used. The overall performance
improved around ~110% and 120% for encrypt and decrypt respectably.
Please see the attached patch and aes benchmark.

Thanks, comments below.

--- a/fat-ppc.c
+++ b/fat-ppc.c
@@ -226,6 +231,8 @@ fat_init (void)
_nettle_ghash_update_arm64() */
 _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64;
 _nettle_ghash_update_vec = _nettle_ghash_update_ppc64;
+  _nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64;
+  _nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64;
   }
 else
   {

Fat setup is a bit tricky, here it looks like
_nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I
would suspect that breaks when the extensions aren't available. You can
test that with NETTLE_FAT_OVERRIDE=none.

Sure.  I’ll test and see to fix it.


gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
size_t length, uint8_t *dst, const uint8_t *src)
{
+#if defined(HAVE_NATIVE_AES_GCM_STITCH)
+  if (length >= 128) {
+PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src);
+if (length == 0) {
+  return;
+}
+  }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */
+
 GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}

In a non-fat build, it's right with a compile-time #if to select if the
optimized code should be called. But in a fat build, we'de need a valid
function in all cases, but doing different things depending on the
runtime fat initialization. One could do that with two versions of
gcm_aes128_encrypt (which is likely preferable if we do something
similar for other archs that has separate assembly for aes128, aes192,
etc). Or we would need to call some function unconditionally, which
would be a nop if the extensions are not available. E.g, do something
like

#if HAVE_NATIVE_fat_aes_gcm_encrypt
void
gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
 size_t length, uint8_t *dst, const uint8_t *src)
{
  size_t done = _gcm_aes_encrypt (>key, >gcm.x, >gcm.ctr,
  _AES128_ROUNDS, >cipher.keys, length, dst, src);
  ctx->data_size += done;
  length -= done;
  if (length > 0)
{
  src += done;
  dst += done;
  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
}
}
#endif

where the C-implementation of _gcm_aes_encrypt could just return 0.

And it's preferable that te same interface could be used on other archs,
even if they don't do 8 blocks at a time like your ppc code.

--- a/gcm.h
+++ b/gcm.h
@@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key *key,
  (nettle_cipher_func *) (encrypt), \
  (length), (digest)))

+#if defined(HAVE_NATIVE_AES_GCM_STITCH)
+#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt
+#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt
+void
+_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr,
+  size_t len, uint8_t *dst, const uint8_t *src);
+void
+_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr,
+  size_t len, uint8_t *dst, const uint8_t *src);
+struct ppc_gcm_aes_context {
+  uint8_t *x;
+  uint8_t *htable;
+  struct aes128_ctx *rkeys;
+};
+#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \
+  { \
+(gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x; \
+(gcm_aes_ctx)->htable = (uint8_t *) (key); \
+(gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys; \
+  }
+
+#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \
+  { \
+size_t rem_len = 0; \
+struct ppc_gcm_aes_context c; \
+struct gcm_ctx *gctx = &(ctx)->gcm; \
+GET_PPC_CTX(, gctx, &(ctx)->key, &(ctx)->cipher); \
+if ((encrypt)) { \
+  _ppc_gcm_aes_encrypt(, (rounds), (&(ctx)->gcm)->ctr.b, (length), 
(dst), (src)); \
+} else { \
+  _ppc_gcm_aes_decrypt(, (rounds), (&(ctx)->gcm)->ctr.b, (length), 
(dst), (src)); \
+} \
+rem_len = (length) % (GCM_BLOCK_SIZE * 8); \
+(length) -= rem_len; \
+gctx->data_size += (length); \
+(dst) += (length); \
+(src) += (length); \
+(length) = rem_len; \
+  }
+#endif /* HAVE_NATIVE_AES_GCM_STITCH */

This looks a little awkward. I think it would be better to pass the
various pointers needed by the assembly implementation as separate
(register) arguments. Or pass the pointer to the struct gcm_aesxxx_ctx
directly (with the disadvantage that assembly code needs to know
corresponding offsets).


I’ll see what’s a better options.

--- a/powerpc64/machine.m4
+++ b/powerpc64/machine.m4
@@ -63,3 +63,40 @@ C INC_VR(VR, INC)
define(`INC_VR',`ifelse(substr($1,0,1),`v',
``v'eval($2+substr($1,1,len($1)))',
`eval($2+$1)')')
+
+C Adding state and round key 0
+C XOR_4RK0(state, state, rkey0)
+define

Re: ppc64: v2, AES/GCM Performance improvement with stitched implementation

2023-12-12 Thread Danny Tsen


> On Dec 11, 2023, at 10:32 AM, Niels Möller  wrote:
> 
> Danny Tsen  writes:
> 
>> Here is the version 2 for AES/GCM stitched patch. The stitched code is
>> in all assembly and m4 macros are used. The overall performance
>> improved around ~110% and 120% for encrypt and decrypt respectably.
>> Please see the attached patch and aes benchmark.
> 
> Thanks, comments below.
> 
>> --- a/fat-ppc.c
>> +++ b/fat-ppc.c
>> @@ -226,6 +231,8 @@ fat_init (void)
>>  _nettle_ghash_update_arm64() */
>>   _nettle_ghash_set_key_vec = _nettle_ghash_set_key_ppc64;
>>   _nettle_ghash_update_vec = _nettle_ghash_update_ppc64;
>> +  _nettle_ppc_gcm_aes_encrypt_vec = _nettle_ppc_gcm_aes_encrypt_ppc64;
>> +  _nettle_ppc_gcm_aes_decrypt_vec = _nettle_ppc_gcm_aes_decrypt_ppc64;
>> }
>>   else
>> {
> 
> Fat setup is a bit tricky, here it looks like
> _nettle_ppc_gcm_aes_decrypt_vec is left undefined by the else clause. I
> would suspect that breaks when the extensions aren't available. You can
> test that with NETTLE_FAT_OVERRIDE=none.

Sure.  I’ll test and see to fix it.

> 
>> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>>  size_t length, uint8_t *dst, const uint8_t *src)
>> {
>> +#if defined(HAVE_NATIVE_AES_GCM_STITCH)
>> +  if (length >= 128) {
>> +PPC_GCM_CRYPT(1, _AES128_ROUNDS, ctx, length, dst, src);
>> +if (length == 0) {
>> +  return;
>> +}
>> +  }
>> +#endif /* HAVE_NATIVE_AES_GCM_STITCH */
>> +
>>   GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>> }
> 
> In a non-fat build, it's right with a compile-time #if to select if the
> optimized code should be called. But in a fat build, we'de need a valid
> function in all cases, but doing different things depending on the
> runtime fat initialization. One could do that with two versions of
> gcm_aes128_encrypt (which is likely preferable if we do something
> similar for other archs that has separate assembly for aes128, aes192,
> etc). Or we would need to call some function unconditionally, which
> would be a nop if the extensions are not available. E.g, do something
> like
> 
>  #if HAVE_NATIVE_fat_aes_gcm_encrypt
>  void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>size_t length, uint8_t *dst, const uint8_t *src)
>  {
>size_t done = _gcm_aes_encrypt (>key, >gcm.x, >gcm.ctr,
> _AES128_ROUNDS, >cipher.keys, length, 
> dst, src);
>ctx->data_size += done;
>length -= done;
>if (length > 0) 
>  {
>src += done;
>dst += done;
>GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
>  }
>  }
>  #endif
> 
> where the C-implementation of _gcm_aes_encrypt could just return 0.
> 
> And it's preferable that te same interface could be used on other archs,
> even if they don't do 8 blocks at a time like your ppc code.
> 
>> --- a/gcm.h
>> +++ b/gcm.h
>> @@ -195,6 +195,47 @@ gcm_digest(struct gcm_ctx *ctx, const struct gcm_key 
>> *key,
>>(nettle_cipher_func *) (encrypt), \
>>(length), (digest)))
>> 
>> +#if defined(HAVE_NATIVE_AES_GCM_STITCH)
>> +#define _ppc_gcm_aes_encrypt _nettle_ppc_gcm_aes_encrypt
>> +#define _ppc_gcm_aes_decrypt _nettle_ppc_gcm_aes_decrypt
>> +void
>> +_ppc_gcm_aes_encrypt (void *ctx, size_t rounds, uint8_t *ctr,
>> +  size_t len, uint8_t *dst, const uint8_t *src);
>> +void
>> +_ppc_gcm_aes_decrypt (void *ctx, size_t rounds, uint8_t *ctr,
>> +  size_t len, uint8_t *dst, const uint8_t *src);
>> +struct ppc_gcm_aes_context {
>> +  uint8_t *x;
>> +  uint8_t *htable;
>> +  struct aes128_ctx *rkeys;
>> +};
>> +#define GET_PPC_CTX(gcm_aes_ctx, ctx, key, cipher) \
>> +  { \
>> +(gcm_aes_ctx)->x = (uint8_t *) &(ctx)->x;   \
>> +(gcm_aes_ctx)->htable = (uint8_t *) (key);  \
>> +(gcm_aes_ctx)->rkeys = (struct aes128_ctx *) (cipher)->keys;\
>> +  }
>> +
>> +#define PPC_GCM_CRYPT(encrypt, rounds, ctx, length, dst, src) \
>> +  { \
>> +size_t rem_len = 0; \
>> +struct ppc_gcm_aes_context c;   \
>> +struct gcm_ctx *gctx = &(ctx)->gcm; \
>> +GET_PPC_CTX(, gctx, &(ctx)->key, &(ctx)->cipher); \
>> +if ((encrypt)) {\
>> +  _ppc_gcm_aes_encrypt(, (rounds), (&(ctx)->gcm)->ctr.b, (length), 
>> (dst), (src)); \
>> +} else {\
>> +

Re: ppc64: v2, AES/GCM Performance improvement with stitched implementation

2023-12-07 Thread Danny Tsen
Hi Niels,

Here is the version 2 for AES/GCM stitched patch.  The stitched code is in all 
assembly and m4 macros are used.  The overall performance improved around ~110% 
and 120% for encrypt and decrypt respectably.   Please see the attached patch 
and aes benchmark.

Thanks.
-Danny


> On Nov 22, 2023, at 2:27 AM, Niels Möller  wrote:
>
> Danny Tsen  writes:
>
>> Interleaving at the instructions level may be a good option but due to
>> PPC instruction pipeline this may need to have sufficient
>> registers/vectors. Use same vectors to change contents in successive
>> instructions may require more cycles. In that case, more
>> vectors/scalar will get involved and all vectors assignment may have
>> to change. That’s the reason I avoided in this case.
>
> To investigate the potential, I would suggest some experiments with
> software pipelining.
>
> Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> round loop. I think that should be 44 instructions of aes mangling, plus
> instructions to setup the counter input, and do the final xor and
> endianness things with the message. Arrange so that it loads the AES
> state in a set of registers we can call A, operating in-place on these
> registers. But at the end, arrange the XORing so that the final
> cryptotext is located in a different set of registers, B.
>
> Then, write the instructions to do ghash using the B registers as input,
> I think that should be about 20-25 instructions. Interleave those as
> well as possible with the AES instructions (say, two aes instructions,
> one ghash instruction, etc).
>
> Software pipelining means that each iteration of the loop does aes-ctr
> on four blocks, + ghash on the output for the four *previous* blocks (so
> one needs extra code outside of the loop to deal with first and last 4
> blocks). Decrypt processing should be simpler.
>
> Then you can benchmark that loop in isolation. It doesn't need to be the
> complete function, the handling of first and last blocks can be omitted,
> and it doesn't even have to be completely correct, as long as it's the
> right instruction mix and the right data dependencies. The benchmark
> should give a good idea for the potential speedup, if any, from
> instruction-level interleaving.
>
> I would hope 4-way is doable with available vector registers (and this
> inner loop should be less than 100 instructions, so not too
> unmanageable). Going up to 8-way (like the current AES code) would also
> be interesting, but as you say, you might have a shortage of registers.
> If you have to copy state between registers and memory in each iteration
> of an 8-way loop (which it looks like you also have to do in your
> current patch), that overhead cost may outweight the gains you have from
> more independence in the AES rounds.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-22 Thread Danny Tsen


> On Nov 22, 2023, at 2:27 AM, Niels Möller  wrote:
> 
> Danny Tsen  writes:
> 
>> Interleaving at the instructions level may be a good option but due to
>> PPC instruction pipeline this may need to have sufficient
>> registers/vectors. Use same vectors to change contents in successive
>> instructions may require more cycles. In that case, more
>> vectors/scalar will get involved and all vectors assignment may have
>> to change. That’s the reason I avoided in this case.
> 
> To investigate the potential, I would suggest some experiments with
> software pipelining.
> 
> Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> round loop. I think that should be 44 instructions of aes mangling, plus
> instructions to setup the counter input, and do the final xor and
> endianness things with the message. Arrange so that it loads the AES
> state in a set of registers we can call A, operating in-place on these
> registers. But at the end, arrange the XORing so that the final
> cryptotext is located in a different set of registers, B.
> 
> Then, write the instructions to do ghash using the B registers as input,
> I think that should be about 20-25 instructions. Interleave those as
> well as possible with the AES instructions (say, two aes instructions,
> one ghash instruction, etc).
> 
> Software pipelining means that each iteration of the loop does aes-ctr
> on four blocks, + ghash on the output for the four *previous* blocks (so
> one needs extra code outside of the loop to deal with first and last 4
> blocks). Decrypt processing should be simpler.
> 
> Then you can benchmark that loop in isolation. It doesn't need to be the
> complete function, the handling of first and last blocks can be omitted,
> and it doesn't even have to be completely correct, as long as it's the
> right instruction mix and the right data dependencies. The benchmark
> should give a good idea for the potential speedup, if any, from
> instruction-level interleaving.
This is a very ideal condition.  Too much interleaving may not produce the best 
results and different architectures may have different results.  I had tried 
various way when I implemented AES/GCM stitching functions for OpenSSL.  I’ll 
give it a try since your ghash function is different.

> 
> I would hope 4-way is doable with available vector registers (and this
> inner loop should be less than 100 instructions, so not too
> unmanageable). Going up to 8-way (like the current AES code) would also
> be interesting, but as you say, you might have a shortage of registers.
> If you have to copy state between registers and memory in each iteration
> of an 8-way loop (which it looks like you also have to do in your
> current patch), that overhead cost may outweight the gains you have from
> more independence in the AES rounds.
4x unrolling may not produce the best performance.  I did that when I 
implemented this stitching function in OpenSSL and it’s in one assembly file 
and no functions calls outside the function.  Once again, calling a function 
within a loop introduce a lot of overhead.  Here are my past results for your 
reference.  First one is the original performance from OpenSSL.  The second one 
was the 4x unrolling and the third one was the 8x.  But I can try again.

(This was run on a p10 with 3.5 GHz machine)

AES-128-GCM 382128.50k  1023073.64k  2621489.41k  3604979.37k  
4018642.94k  4032080.55k
AES-128-GCM 347370.13k  1236054.06k  2778748.59k  3900567.21k  
4527158.61k  4579759.45k ( 4x AES and 4x ghash
)
AES-128-GCM 356520.19k   989983.06k  2902907.56k  4379016.19k  
5180981.25k  5249717.59k ( 8x AES and 2 4x gha
sh combined)

Thanks.
-Danny

> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Danny Tsen
Hi Niels,

More comments.  Please see inline.

> On Nov 21, 2023, at 1:46 PM, Danny Tsen  wrote:
> 
> Hi Niels,
> 
> Thanks for the quick response.
> 
> I'll think more thru your comments here and it may take some more time to get 
> an update.  And just a quick answer to 4 of your questions.
> 
> 
>  1.  Depends on some special registers from caller.  This is so that I don't 
> need to change the registers used in aes_internal_encrypt and gf_mul_4x 
> functions.  This is a way to minimize too much change in the existing code.  
> But I can change that for sure.  m4 macro could be helpful here.
>  2.  The reason to use gcm_encrypt is to minimize duplicate code in 
> gcm_aes128..., but I can change that.
>  3.   Yes, 4x blocks won't provide the same performance as 8x.
>  4.  Yes, function call did introduce quite a lot of overhead in a loop.  We 
> can call gf_mul_4x from _ghash_update but the stack handling has to be 
> changed and I tried not to change anything in _ghash_update since my code 
> dosen't call _ghash_update.  But I guess I can use m4 macro instead.
> 
> Thanks.
> -Danny
> ____
> From: Niels Möller 
> Sent: Tuesday, November 21, 2023 1:07 PM
> To: Danny Tsen 
> Cc: nettle-bugs@lists.lysator.liu.se ; 
> George Wilson 
> Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with 
> stitched implementation
> 
> Danny Tsen  writes:
> 
>> This patch provides a performance improvement over AES/GCM with stitched
>> implementation for ppc64.  The code is a wrapper in assembly to handle 
>> multiple 8
>> blocks and handle big and little endian.
>> 
>> The overall improvement is based on the nettle-benchmark with ~80% 
>> improvement for
>> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline.  
>> The
>> benchmark was run on a P10 machine with 3.896GHz CPU.
> 
> That's a pretty nice performance improvements. A first round of comments
> below, mainly structural.
> 
> (And I think attachments didn't make it to the list, possibly because
> some of them had Content-type: application/octet-stream rather than
> text/plain).
> 
>> +#if defined(__powerpc64__) || defined(__powerpc__)
>> +#define HAVE_AES_GCM_STITCH 1
>> +#endif
> 
> If the C code needs to know about optional assembly functions, the
> HAVE_NATIVE tests are intended for that.
> 
>> void
>> gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,
>>const void *cipher, nettle_cipher_func *f,
>> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key 
>> *key,
>> {
>>   assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
>> 
>> +#if defined(HAVE_AES_GCM_STITCH)
>> +  size_t rem_len = 0;
>> +
>> +  if (length >= 128) {
>> +int rounds = 0;
>> +if (f == (nettle_cipher_func *) aes128_encrypt) {
>> +  rounds = _AES128_ROUNDS;
>> +} else if (f == (nettle_cipher_func *) aes192_encrypt) {
>> +  rounds = _AES192_ROUNDS;
>> +} else if (f == (nettle_cipher_func *) aes256_encrypt) {
>> +  rounds = _AES256_ROUNDS;
>> +}
>> +if (rounds) {
>> +  struct gcm_aes_context c;
>> +  get_ctx(, ctx, key, cipher);
>> +  _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, 
>> dst, src);
> 
> I think this is the wrong place for this dispatch, I think it should go
> in gcm-aes128.c, gcm-aes192.c, etc.
> 
>> --- a/powerpc64/p8/aes-encrypt-internal.asm
>> +++ b/powerpc64/p8/aes-encrypt-internal.asm
>> @@ -52,6 +52,16 @@ define(`S5', `v7')
>> define(`S6', `v8')
>> define(`S7', `v9')
>> 
>> +C re-define SRC if from _gcm_aes
>> +define(`S10', `v10')
>> +define(`S11', `v11')
>> +define(`S12', `v12')
>> +define(`S13', `v13')
>> +define(`S14', `v14')
>> +define(`S15', `v15')
>> +define(`S16', `v16')
>> +define(`S17', `v17')
>> +
>> .file "aes-encrypt-internal.asm"
>> 
>> .text
>> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt)
>>  DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5)
>> 
>>  subi ROUNDS,ROUNDS,1
>> +
>> + cmpdi r23, 0x5f C call from _gcm_aes
>> + beq Lx8_loop
>> +
>>  srdi LENGTH,LENGTH,4
>> 
>>  srdi r5,LENGTH,3 #8x loop count
>> @@ -93,6 +107,9 @@ Lx8_loop:
>>  lxvd2x VSR(K),0,KEYS
>>  vperm   K,K,K,SWAP_MASK
>> 
>> + cmpdi r23, 0x5f
>> + beq Skip_load
> 
> It's a little messy to have branches depending on a special register set
> by some callers. I think it would be simpler to eithe

RE: Fw: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Danny Tsen
Hi Niels,

Thanks for the quick response.

I'll think more thru your comments here and it may take some more time to get 
an update.  And just a quick answer to 4 of your questions.


  1.  Depends on some special registers from caller.  This is so that I don't 
need to change the registers used in aes_internal_encrypt and gf_mul_4x 
functions.  This is a way to minimize too much change in the existing code.  
But I can change that for sure.  m4 macro could be helpful here.
  2.  The reason to use gcm_encrypt is to minimize duplicate code in 
gcm_aes128..., but I can change that.
  3.   Yes, 4x blocks won't provide the same performance as 8x.
  4.  Yes, function call did introduce quite a lot of overhead in a loop.  We 
can call gf_mul_4x from _ghash_update but the stack handling has to be changed 
and I tried not to change anything in _ghash_update since my code dosen't call 
_ghash_update.  But I guess I can use m4 macro instead.

Thanks.
-Danny

From: Niels Möller 
Sent: Tuesday, November 21, 2023 1:07 PM
To: Danny Tsen 
Cc: nettle-bugs@lists.lysator.liu.se ; George 
Wilson 
Subject: [EXTERNAL] Re: Fw: ppc64: AES/GCM Performance improvement with 
stitched implementation

Danny Tsen  writes:

> This patch provides a performance improvement over AES/GCM with stitched
> implementation for ppc64.  The code is a wrapper in assembly to handle 
> multiple 8
> blocks and handle big and little endian.
>
> The overall improvement is based on the nettle-benchmark with ~80% 
> improvement for
> AES/GCM encrypt and ~86% improvement for decrypt over the current baseline.  
> The
> benchmark was run on a P10 machine with 3.896GHz CPU.

That's a pretty nice performance improvements. A first round of comments
below, mainly structural.

(And I think attachments didn't make it to the list, possibly because
some of them had Content-type: application/octet-stream rather than
text/plain).

> +#if defined(__powerpc64__) || defined(__powerpc__)
> +#define HAVE_AES_GCM_STITCH 1
> +#endif

If the C code needs to know about optional assembly functions, the
HAVE_NATIVE tests are intended for that.

>  void
>  gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key,
> const void *cipher, nettle_cipher_func *f,
> @@ -209,6 +228,35 @@ gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key 
> *key,
>  {
>assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
>
> +#if defined(HAVE_AES_GCM_STITCH)
> +  size_t rem_len = 0;
> +
> +  if (length >= 128) {
> +int rounds = 0;
> +if (f == (nettle_cipher_func *) aes128_encrypt) {
> +  rounds = _AES128_ROUNDS;
> +} else if (f == (nettle_cipher_func *) aes192_encrypt) {
> +  rounds = _AES192_ROUNDS;
> +} else if (f == (nettle_cipher_func *) aes256_encrypt) {
> +  rounds = _AES256_ROUNDS;
> +}
> +if (rounds) {
> +  struct gcm_aes_context c;
> +  get_ctx(, ctx, key, cipher);
> +  _nettle_ppc_gcm_aes_encrypt_ppc64(, rounds, ctx->ctr.b, length, dst, 
> src);

I think this is the wrong place for this dispatch, I think it should go
in gcm-aes128.c, gcm-aes192.c, etc.

> --- a/powerpc64/p8/aes-encrypt-internal.asm
> +++ b/powerpc64/p8/aes-encrypt-internal.asm
> @@ -52,6 +52,16 @@ define(`S5', `v7')
>  define(`S6', `v8')
>  define(`S7', `v9')
>
> +C re-define SRC if from _gcm_aes
> +define(`S10', `v10')
> +define(`S11', `v11')
> +define(`S12', `v12')
> +define(`S13', `v13')
> +define(`S14', `v14')
> +define(`S15', `v15')
> +define(`S16', `v16')
> +define(`S17', `v17')
> +
>  .file "aes-encrypt-internal.asm"
>
>  .text
> @@ -66,6 +76,10 @@ PROLOGUE(_nettle_aes_encrypt)
>   DATA_LOAD_VEC(SWAP_MASK,.swap_mask,r5)
>
>   subi ROUNDS,ROUNDS,1
> +
> + cmpdi r23, 0x5f C call from _gcm_aes
> + beq Lx8_loop
> +
>   srdi LENGTH,LENGTH,4
>
>   srdi r5,LENGTH,3 #8x loop count
> @@ -93,6 +107,9 @@ Lx8_loop:
>   lxvd2x VSR(K),0,KEYS
>   vperm   K,K,K,SWAP_MASK
>
> + cmpdi r23, 0x5f
> + beq Skip_load

It's a little messy to have branches depending on a special register set
by some callers. I think it would be simpler to either move the round
loop (i.e., the loop with the label from L8x_round_loop:) into a
subroutine with all-register arguments, and call that from both
_nettle_aes_encrypt and _nettle_gcm_aes_encrypt. Or define an m4 macro
expanding to the body of that loop, and use that macro in both places.

> --- /dev/null
> +++ b/powerpc64/p8/gcm-aes-decrypt.asm
> @@ -0,0 +1,425 @@
> +C powerpc64/p8/gcm-aes-decrypt.asm

> +.macro SAVE_REGS
> + mflr 0
> + std 0,16(1)
> + stdu  SP,-464(SP)

If macros are needed, please use m4 macros, like other nettle assembly code.

> +.align 5
> +Loop8x_de:
[...]
> +bl _nettle_aes_encrypt_ppc64

I suspect thi

Fw: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-21 Thread Danny Tsen



To Whom It May Concern,

This patch provides a performance improvement over AES/GCM with stitched 
implementation for ppc64.  The code is a wrapper in assembly to handle multiple 
8 blocks and handle big and little endian.

The overall improvement is based on the nettle-benchmark with ~80% improvement 
for AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. 
 The benchmark was run on a P10 machine with 3.896GHz CPU.

Please find the attached patch and benchmarks.

Thanks.
-Danny
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Fw: ppc64: AES/GCM Performance improvement with stitched implementation

2023-11-20 Thread Danny Tsen





To Whom It May Concern,

This patch provides a performance improvement over AES/GCM with stitched 
implementation for ppc64.  The code is a wrapper in assembly to handle multiple 
8 blocks and handle big and little endian.

The overall improvement is based on the nettle-benchmark with ~80% improvement 
for AES/GCM encrypt and ~86% improvement for decrypt over the current baseline. 
 The benchmark was run on a P10 machine with 3.896GHz CPU.

Please find the attached patch and benchmarks.

Thanks.
-Danny
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se