Re: ppc64 micro optimization
Niels Möller writes: > I've added tests that set the intial counter so that the four counter > bytes wraps around 2^32, and I've verified that if these instructions > should be changed to vadduwm, to get output that agrees with nettle's > other gcm implementations. I've commit those fixes, and a fix for big-endian support, on the branch ppc64-gcm-aes-rebased. I think that's now ready for merging. I see some opportunities for further improvement, but that can be done after merge, to aid consistency with related fixes to the other ppc64 assembly files. > Another question on powerpc64 assembly: For the byte swapping, currently > done using the vperm instruction and a mask word, is there any reason to > not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword) > instead? That applies to more functions than the new gcm-aes code. A closer look at the spec indicated that xxbrd is only available from power9 (i.e., if the processor supports VSX, *and* supports ISA 3.0, if I've understood it correctly). I think it would be a good idea to consistently use pseudoops like .machine "power8" in the ppc assembly files, if that would let the assembler catch accidental use of unavailable instructions. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Niels Möller writes: > One other question: In the counter updates, > >> C increase ctr value as input to aes_encrypt >> vaddudm S1, S0, CNT1 >> vaddudm S2, S1, CNT1 >> vaddudm S3, S2, CNT1 >> vaddudm S4, S3, CNT1 >> vaddudm S5, S4, CNT1 >> vaddudm S6, S5, CNT1 >> vaddudm S7, S6, CNT1 > > shouldn't that be vadduwm (32-bit word addition, rather than 64-bit > dword addition)? As I understand it, gcm uses a 32-bit counter, which > should wrap around without any carry to higher bits if the initial value > is just below 2^32. I've added tests that set the intial counter so that the four counter bytes wraps around 2^32, and I've verified that if these instructions should be changed to vadduwm, to get output that agrees with nettle's other gcm implementations. Another question on powerpc64 assembly: For the byte swapping, currently done using the vperm instruction and a mask word, is there any reason to not use the xxbrd instruction (VSX Vector Byte-Reverse Doubleword) instead? That applies to more functions than the new gcm-aes code. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Niels Möller writes: > Below is an updated version of gcm-aes-encrypt.asm, seems to work for > me, and uses fewer of the regular registers. Some comments and > questions: > > 1. What about the vsrX registers, 0 <= X < 32? They are used to copy >values from and to the v registers (aka vsrX, 32 <= X < 64), e.g., > > xxlor vs1, VSR(S0), VSR(S0) > >Can those registers be used freely, and how? I've asked in a different forum, and as far as I understand, registers vs0-vs13 free to use ("volatile"), because half of each corresponds to a volatile floating point register (fpr0-fpr13). While registers vs14-vs31 need to be saved and restored if used (the halves corresponding to fpr14-fpr31 are non-volatile, so in principle, it would be sufficent to save and restore those halves). > 2. From my reading of the ELF v2 ABI spec, there's a "protected zone" >below the stack pointer that can be used freely for storage. Is that >right? Or maybe that's only for te ELFv2 ABI? That appears to be the same in ELFv1 ABI, see https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#STACK One other question: In the counter updates, > C increase ctr value as input to aes_encrypt > vaddudm S1, S0, CNT1 > vaddudm S2, S1, CNT1 > vaddudm S3, S2, CNT1 > vaddudm S4, S3, CNT1 > vaddudm S5, S4, CNT1 > vaddudm S6, S5, CNT1 > vaddudm S7, S6, CNT1 shouldn't that be vadduwm (32-bit word addition, rather than 64-bit dword addition)? As I understand it, gcm uses a 32-bit counter, which should wrap around without any carry to higher bits if the initial value is just below 2^32. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Niels Möller writes: > Next, I'll have a look at register usage in the assembly code. Below is an updated version of gcm-aes-encrypt.asm, seems to work for me, and uses fewer of the regular registers. Some comments and questions: 1. What about the vsrX registers, 0 <= X < 32? They are used to copy values from and to the v registers (aka vsrX, 32 <= X < 64), e.g., xxlor vs1, VSR(S0), VSR(S0) Can those registers be used freely, and how? If we can use them, we shouldn't need to save and restore any vector registers. They're not mentioned in powerpc64/README. Looking in the ELF v2 ABI spec, that seems to say that the low halves (which are used as floting point registers) are "volatile", but that's not quite enough? 2. From my reading of the ELF v2 ABI spec, there's a "protected zone" below the stack pointer that can be used freely for storage. Is that right? Or maybe that's only for te ELFv2 ABI? 3. Nit: In the copyright line, I'd like to delete the "All rights reserved" phrase. That's not in any the copyright header of any other Nettle files, including those previously contributed by IBM. Regards, /Niels -8<--- C powerpc64/p8/gcm-aes-encrypt.asm ifelse(` Copyright (C) 2023- IBM Inc. All rights reserved This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') C Register usage: define(`SP', `r1') define(`TOCP', `r2') define(`HT', `r3') define(`SRND', `r4') define(`SLEN', `r5') define(`SDST', `r6') define(`SSRC', `r7') define(`RK', `r8') C r9-r11 used as constant indices. define(`LOOP', `r12') C C vectors used in aes encrypt output C define(`K', `v1') define(`S0', `v2') define(`S1', `v3') define(`S2', `v4') define(`S3', `v5') define(`S4', `v6') define(`S5', `v7') define(`S6', `v8') define(`S7', `v9') C C ghash assigned registers and vectors C define(`ZERO', `v21') define(`POLY', `v22') define(`POLY_L', `v0') define(`D', `v10') define(`H1M', `v11') define(`H1L', `v12') define(`H2M', `v13') define(`H2L', `v14') define(`H3M', `v15') define(`H3L', `v16') define(`H4M', `v17') define(`H4L', `v18') define(`R', `v19') define(`F', `v20') define(`R2', `v21') define(`F2', `v22') define(`LE_TEMP', `v30') define(`LE_MASK', `v31') define(`CNT1', `v28') define(`LASTCNT', `v29') .file "gcm-aes-encrypt.asm" .text C size_t C _gcm_aes_encrypt(struct gcm_key *key, size_t rounds, C size_t len, uint8_t *dst, const uint8_t *src) C define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_gcm_aes_encrypt) srdi. LOOP, SLEN, 7 C loop n 8 blocks beq No_encrypt_out C 288 byte "protected zone" is sufficient for storage. stxv VSR(v20), -16(SP) stxv VSR(v21), -32(SP) stxv VSR(v22), -48(SP) stxv VSR(v28), -64(SP) stxv VSR(v29), -80(SP) stxv VSR(v30), -96(SP) stxv VSR(v31), -112(SP) vxor ZERO,ZERO,ZERO vspltisb CNT1, 1 vsldoi CNT1, ZERO, CNT1, 1 C counter 1 DATA_LOAD_VEC(POLY,.polynomial,r9) IF_LE(` li r9,0 lvsl LE_MASK,0,r9 vspltisb LE_TEMP,0x07 vxor LE_MASK,LE_MASK,LE_TEMP ') xxmrghdVSR(POLY_L),VSR(ZERO),VSR(POLY) C load table elements li r9,1*16 li r10,2*16 li r11,3*16 lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT addi HT, HT, 4048 C Advance to point to the 'CTR' field in the context lxvd2x VSR(D),r9,HT C load 'X' pointer C byte-reverse of each doubleword permuting on little-endian mode IF_LE(` vperm D,D,D,LE_MASK ') lxvb16x VSR(S0), 0, HT C Load 'CTR' sldi SLEN, LOOP, 7 addi LOOP, LOOP, -1 lxvd2x VSR(K),r11,HTC First subkey vperm K,K,K,LE_MASK .align 5 C increase ctr value as input to aes_encrypt vaddudm S1, S0, CNT1 vaddudm S2, S
Re: ppc64 micro optimization
Niels Möller writes: > Danny Tsen writes: > >> My fault. I did not include the gym-aes-crypt.c in the patch. Here is >> the updated patch. Please apply this one and we can work from there. > > Thanks, now pushed onto a new branch ppc64-gcm-aes. I've now pushed some more changes to that branch: added gcm-internal.h, fixed the non-fat case, and moved around the nop definitions of the new functions. Next, I'll have a look at register usage in the assembly code. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Danny Tsen writes: > My fault. I did not include the gym-aes-crypt.c in the patch. Here is > the updated patch. Please apply this one and we can work from there. Thanks, now pushed onto a new branch ppc64-gcm-aes. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Hi Niels, My fault. I did not include the gym-aes-crypt.c in the patch. Here is the updated patch. Please apply this one and we can work from there. Thanks. -Danny > On Mar 5, 2024, at 1:08 PM, Niels Möller wrote: > > Danny Tsen writes: > >> Please let me know when you merge the code and we can work from there. > > Hi, I tried to apply and build with the v5 patch, and noticed some problems. > > Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go > in gcm-internal.h, like on this branch, > https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.h?ref_type=heads > Corresponding name mangling defines should also be in gcm-internal.h, > not in the installed gcm.h header. > > The file gcm-aes.c was missing in the patch. If the dummy C versions of > _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the > definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy > function for both encrypt and decrypt). > > It would also be nice if you could check that the new code is used > and working in a non-fat build, configured with --disable-fat > --enable-power-crypto-ext. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Danny Tsen writes: > Please let me know when you merge the code and we can work from there. Hi, I tried to apply and build with the v5 patch, and noticed some problems. Declaration of _gcm_aes_encrypt / _gcm_aes_decrypt is missing. It can go in gcm-internal.h, like on this branch, https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/gcm-internal.h?ref_type=heads Corresponding name mangling defines should also be in gcm-internal.h, not in the installed gcm.h header. The file gcm-aes.c was missing in the patch. If the dummy C versions of _gcm_aes_*crypt are needed only for fat builds, maybe simplest to put the definitions in fat-ppc.c (maybe one can even use the same "return 0" dummy function for both encrypt and decrypt). It would also be nice if you could check that the new code is used and working in a non-fat build, configured with --disable-fat --enable-power-crypto-ext. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
RE: ppc64 micro optimization
Hi Niels, Please let me know when you merge the code and we can work from there. Thanks. -Danny From: Niels Möller Sent: Friday, February 23, 2024 1:07 AM To: Danny Tsen Cc: nettle-bugs@lists.lysator.liu.se ; George Wilson Subject: [EXTERNAL] Re: ppc64 micro optimization Danny Tsen writes: > Here is the v5 patch from your comments. Please review. Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though. > --- a/gcm-aes128.c > +++ b/gcm-aes128.c > @@ -63,6 +63,11 @@ void > gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, >size_t length, uint8_t *dst, const uint8_t *src) > { > + size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, > length, dst, src); > + ctx->gcm.data_size += done; > + length -= done; > + src += done; > + dst += done; >GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); > } We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean. Ah, and I think you you could write &ctx->key instead of the explicit cast. > +C load table elements > +li r9,1*16 > +li r10,2*16 > +li r11,3*16 > +lxvd2x VSR(H1M),0,HT > +lxvd2x VSR(H1L),r9,HT > +lxvd2x VSR(H2M),r10,HT > +lxvd2x VSR(H2L),r11,HT > +addi HT, HT, 64 > +lxvd2x VSR(H3M),0,HT > +lxvd2x VSR(H3L),r9,HT > +lxvd2x VSR(H4M),r10,HT > +lxvd2x VSR(H4L),r11,HT > + > +li r25,0x10 > +li r26,0x20 > +li r27,0x30 > +li r28,0x40 > +li r29,0x50 > +li r30,0x60 > +li r31,0x70 I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27. > +.align 5 > +C increase ctr value as input to aes_encrypt > +vaddudm S1, S0, CNT1 > +vaddudm S2, S1, CNT1 > +vaddudm S3, S2, CNT1 > +vaddudm S4, S3, CNT1 > +vaddudm S5, S4, CNT1 > +vaddudm S6, S5, CNT1 > +vaddudm S7, S6, CNT1 This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Danny Tsen writes: > Here is the v5 patch from your comments. Please review. Thanks. I think this looks pretty good. Maybe I should commit it on a branch and we can iterate from there. I'll be on vacation and mostly offline next week, though. > --- a/gcm-aes128.c > +++ b/gcm-aes128.c > @@ -63,6 +63,11 @@ void > gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, > size_t length, uint8_t *dst, const uint8_t *src) > { > + size_t done = _gcm_aes_encrypt ((struct gcm_key *)ctx, _AES128_ROUNDS, > length, dst, src); > + ctx->gcm.data_size += done; > + length -= done; > + src += done; > + dst += done; >GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); > } We should come up with some preprocessor things to completely omit the new code on architectures that don't have _gcm_aes_encrypt (possibly with some macro to reduce duplication). I think that's the main thing I'd like to have before merge. Otherwise, looks nice and clean. Ah, and I think you you could write &ctx->key instead of the explicit cast. > +C load table elements > +li r9,1*16 > +li r10,2*16 > +li r11,3*16 > +lxvd2x VSR(H1M),0,HT > +lxvd2x VSR(H1L),r9,HT > +lxvd2x VSR(H2M),r10,HT > +lxvd2x VSR(H2L),r11,HT > +addi HT, HT, 64 > +lxvd2x VSR(H3M),0,HT > +lxvd2x VSR(H3L),r9,HT > +lxvd2x VSR(H4M),r10,HT > +lxvd2x VSR(H4L),r11,HT > + > +li r25,0x10 > +li r26,0x20 > +li r27,0x30 > +li r28,0x40 > +li r29,0x50 > +li r30,0x60 > +li r31,0x70 I still think there's opportunity to reduce number of registers (and corresponding load-store of callee save registers. E.g, here r9-r11 are used for the same thing as r25-r27. > +.align 5 > +C increase ctr value as input to aes_encrypt > +vaddudm S1, S0, CNT1 > +vaddudm S2, S1, CNT1 > +vaddudm S3, S2, CNT1 > +vaddudm S4, S3, CNT1 > +vaddudm S5, S4, CNT1 > +vaddudm S6, S5, CNT1 > +vaddudm S7, S6, CNT1 This is a rather long dependency chain; I wonder if you could make a measurable saving of a cycle or two by using additional CNT2 or CNT4 registers (if not, it's preferable to keep the current simple chain). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Hi Niels, Here is the v5 patch from your comments. Please review. Thanks. -Danny > On Feb 14, 2024, at 8:46 AM, Niels Möller wrote: > > Danny Tsen writes: > >> Here is the new patch v4 for AES/GCM stitched implementation and >> benchmark based on the current repo. > > Thanks. I'm not able to read it all carefully at the moment, but I have > a few comments, see below. > > In the mean time, I've also tried to implement something similar for > x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to > the contrary, my stitched implementation seems considerably slower... > But at least that helped me understand the higher-level issues better. > >> --- a/gcm-aes128.c >> +++ b/gcm-aes128.c >> @@ -63,14 +63,30 @@ void >> gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, >> size_t length, uint8_t *dst, const uint8_t *src) >> { >> - GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); >> + size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b, >> + _AES128_ROUNDS, &ctx->cipher.keys, length, >> dst, src); > > I know I asked you to explode the context into many separate arguments > to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's > not so nice to have so many arguments that they can't be passed in > registers. Second, when running a fat build on a machine where the > needed instructions are unavailable, it's a bit of a waste to have to > spend lots of instructions on preparing those arguments for calling a > nop function. So to reduce overhead, I'm now leaning towards an > interface like > > /* To reduce the number of arguments (e.g., maximum of 6 register > arguments on x86_64), pass a pointer to gcm_key, which really is a > pointer to the first member of the appropriate gcm_aes*_ctx > struct. */ > size_t > _gcm_aes_encrypt (struct gcm_key *CTX, >unsigned rounds, >size_t size, uint8_t *dst, const uint8_t *src); > > That's not so pretty, but I think that is workable and efficient, and > since it is an internal function, the interface can be changed if this > is implemented on other architectures and we find out that it needs some > tweaks. See > https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads > for the code I wrote to accept that ctx argument. > > It would also be nice to have a #define around the code calling > _gcm_aes_encrypt, so that it is compiled only if (i) we have an > non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat > build, which may select a non-trivial implementation of _gcm_aes_encrypt > at run time. > >> + ctx->gcm.data_size += done; >> + length -= done; >> + if (length > 0) { > > Not sure of the check for length > 0 is needed. It is fine to call > gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a > call with length 0, though, which may be a more common case when > _gcm_aes_encrypt is used? > >> +define(`SAVE_GPR', `std $1, $2(SP)') >> +define(`RESTORE_GPR', `ld $1, $2(SP)') > > I think the above two macros are unneeded, it's easier to read to use > std and ld directly. > >> +define(`SAVE_VR', >> + `li r11, $2 >> + stvx $1, r11, $3') >> +define(`RESTORE_VR', >> + `li r11, $2 >> + lvx $1, r11, $3') > > It would be nice if we could trim the use of vector registers so we > don't need to save and restore lots of them. And if we need two > instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR > that also adjusts the stack pointer, and doesn't need to use an additional > register for indexing? > >> +C load table elements >> +li r9,1*16 >> +li r10,2*16 >> +li r11,3*16 >> +lxvd2x VSR(H1M),0,HT >> +lxvd2x VSR(H1L),r9,HT >> +lxvd2x VSR(H2M),r10,HT >> +lxvd2x VSR(H2L),r11,HT >> +li r9,4*16 >> +li r10,5*16 >> +li r11,6*16 >> +li r12,7*16 >> +lxvd2x VSR(H3M),r9,HT >> +lxvd2x VSR(H3L),r10,HT >> +lxvd2x VSR(H4M),r11,HT >> +lxvd2x VSR(H4L),r12,HT > > I think it would be nicer to follow the style I tried to implement in my > recent updates, using some registers (e.g., r9-r11) as offsets, > initializing them only once, and using everywhere. E.g., in this case, > the loading could be > >lxvd2x VSR(H1M),0,HT >lxvd2x VSR(H1L),r9,HT >lxvd2x VSR(H2M),r10,HT >lxvd2x VSR(H2L),r11,HT >addi HT, HT, 64 >lxvd2x VSR(H3M),0,HT >lxvd2x VSR(H3L),r9,HT >lxvd2x VSR(H4M),r10,HT >lxvd2x VSR(H4L),r11,HT > >> +C do two 4x ghash >> + >> +C previous digest combining >> +xxlor vs0, VSR(S0), VSR(S0) >> +vxor S0,S0,D >> + >> +GF_MUL(F2, R2, H3L, H3M, S1) >> +GF_MUL(F, R, H4L, H4M, S0) >> +vxor F,F,F2 >> +
Re: ppc64 micro optimization
Danny Tsen writes: > Here is the new patch v4 for AES/GCM stitched implementation and > benchmark based on the current repo. Thanks. I'm not able to read it all carefully at the moment, but I have a few comments, see below. In the mean time, I've also tried to implement something similar for x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to the contrary, my stitched implementation seems considerably slower... But at least that helped me understand the higher-level issues better. > --- a/gcm-aes128.c > +++ b/gcm-aes128.c > @@ -63,14 +63,30 @@ void > gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx, > size_t length, uint8_t *dst, const uint8_t *src) > { > - GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src); > + size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b, > + _AES128_ROUNDS, &ctx->cipher.keys, length, > dst, src); I know I asked you to explode the context into many separate arguments to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's not so nice to have so many arguments that they can't be passed in registers. Second, when running a fat build on a machine where the needed instructions are unavailable, it's a bit of a waste to have to spend lots of instructions on preparing those arguments for calling a nop function. So to reduce overhead, I'm now leaning towards an interface like /* To reduce the number of arguments (e.g., maximum of 6 register arguments on x86_64), pass a pointer to gcm_key, which really is a pointer to the first member of the appropriate gcm_aes*_ctx struct. */ size_t _gcm_aes_encrypt (struct gcm_key *CTX, unsigned rounds, size_t size, uint8_t *dst, const uint8_t *src); That's not so pretty, but I think that is workable and efficient, and since it is an internal function, the interface can be changed if this is implemented on other architectures and we find out that it needs some tweaks. See https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads for the code I wrote to accept that ctx argument. It would also be nice to have a #define around the code calling _gcm_aes_encrypt, so that it is compiled only if (i) we have an non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat build, which may select a non-trivial implementation of _gcm_aes_encrypt at run time. > + ctx->gcm.data_size += done; > + length -= done; > + if (length > 0) { Not sure of the check for length > 0 is needed. It is fine to call gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a call with length 0, though, which may be a more common case when _gcm_aes_encrypt is used? > +define(`SAVE_GPR', `std $1, $2(SP)') > +define(`RESTORE_GPR', `ld $1, $2(SP)') I think the above two macros are unneeded, it's easier to read to use std and ld directly. > +define(`SAVE_VR', > + `li r11, $2 > + stvx $1, r11, $3') > +define(`RESTORE_VR', > + `li r11, $2 > + lvx $1, r11, $3') It would be nice if we could trim the use of vector registers so we don't need to save and restore lots of them. And if we need two instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR that also adjusts the stack pointer, and doesn't need to use an additional register for indexing? > +C load table elements > +li r9,1*16 > +li r10,2*16 > +li r11,3*16 > +lxvd2x VSR(H1M),0,HT > +lxvd2x VSR(H1L),r9,HT > +lxvd2x VSR(H2M),r10,HT > +lxvd2x VSR(H2L),r11,HT > +li r9,4*16 > +li r10,5*16 > +li r11,6*16 > +li r12,7*16 > +lxvd2x VSR(H3M),r9,HT > +lxvd2x VSR(H3L),r10,HT > +lxvd2x VSR(H4M),r11,HT > +lxvd2x VSR(H4L),r12,HT I think it would be nicer to follow the style I tried to implement in my recent updates, using some registers (e.g., r9-r11) as offsets, initializing them only once, and using everywhere. E.g., in this case, the loading could be lxvd2x VSR(H1M),0,HT lxvd2x VSR(H1L),r9,HT lxvd2x VSR(H2M),r10,HT lxvd2x VSR(H2L),r11,HT addi HT, HT, 64 lxvd2x VSR(H3M),0,HT lxvd2x VSR(H3L),r9,HT lxvd2x VSR(H4M),r10,HT lxvd2x VSR(H4L),r11,HT > +C do two 4x ghash > + > +C previous digest combining > +xxlor vs0, VSR(S0), VSR(S0) > +vxor S0,S0,D > + > +GF_MUL(F2, R2, H3L, H3M, S1) > +GF_MUL(F, R, H4L, H4M, S0) > +vxor F,F,F2 > +vxor R,R,R2 > + > +xxlor VSR(S0), vs0, vs0 > + > +GF_MUL(F3, R3, H2L, H2M, S2) > +GF_MUL(F4, R4, H1L, H1M, S3) > +vxor F3,F3,F4 > +vxor R3,R3,R4 > + > +vxor F,F,F3 > +vxor D,R,R3 > +GHASH_REDUCE(D, F, POLY_L, R2, F2) C
Re: ppc64 micro optimization
Hi Niels, Here is the new patch v4 for AES/GCM stitched implementation and benchmark based on the current repo. Thanks. -Danny > On Jan 31, 2024, at 4:35 AM, Niels Möller wrote: > > Niels Möller writes: > >> While the powerpc64 vncipher instruction really wants the original >> subkeys, not transformed. So on power, it would be better to have a >> _nettle_aes_invert that is essentially a memcpy, and then the aes >> decrypt assembly code could be reworked without the xors, and run at exactly >> the same speed as encryption. > > I've tried this out, see branch > https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert . It > appears to give the desired improvement in aes decrypt speed, making it > run at the same speed as aes encrypt. Which is a speedup of about 80% > when benchmarked on power10 (the cfarm120 machine). > >> Current _nettle_aes_invert also changes the order of the subkeys, with >> a FIXME comment suggesting that it would be better to update the order >> keys are accessed in the aes decryption functions. > > I've merged the changes to keep subkey order the same for encrypt and > decrypt (so that the decrypt round loop uses subkeys starting at the end > of the array), which affects all aes implementations except s390x, which > doesn't need any subkey expansion. But I've deleted the sparc32 assembly > rather than updating it. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. > ___ > nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se > To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Niels Möller writes: > While the powerpc64 vncipher instruction really wants the original > subkeys, not transformed. So on power, it would be better to have a > _nettle_aes_invert that is essentially a memcpy, and then the aes > decrypt assembly code could be reworked without the xors, and run at exactly > the same speed as encryption. I've tried this out, see branch https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-aes-invert. It appears to give the desired improvement in aes decrypt speed, making it run at the same speed as aes encrypt. Which is a speedup of about 80% when benchmarked on power10 (the cfarm120 machine). > Current _nettle_aes_invert also changes the order of the subkeys, with > a FIXME comment suggesting that it would be better to update the order > keys are accessed in the aes decryption functions. I've merged the changes to keep subkey order the same for encrypt and decrypt (so that the decrypt round loop uses subkeys starting at the end of the array), which affects all aes implementations except s390x, which doesn't need any subkey expansion. But I've deleted the sparc32 assembly rather than updating it. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
RE: ppc64 micro optimization
Hi Niels, This may take a while before I digest the changes and I redo the changes. It could be another 2 weeks. It may take longer because I will be taking time out and out of the country. Will do my best. Thanks. -Danny From: Niels Möller Sent: Thursday, January 25, 2024 3:58 AM To: Danny Tsen Cc: nettle-bugs@lists.lysator.liu.se ; George Wilson Subject: [EXTERNAL] Re: ppc64 micro optimization Danny Tsen writes: > Thanks for merging the stitched implementation for PPC64 with your > detailed information and efforts We're not quite there yet, though. Do you think you could rebase your work on top of recent changes? Sorry about conflicts, but I think new macros should fit well with what you need (feel free to have additional macros, where you find that useful). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Danny Tsen writes: > Thanks for merging the stitched implementation for PPC64 with your > detailed information and efforts We're not quite there yet, though. Do you think you could rebase your work on top of recent changes? Sorry about conflicts, but I think new macros should fit well with what you need (feel free to have additional macros, where you find that useful). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
Re: ppc64 micro optimization
Hi Niels, Thanks for merging the stitched implementation for PPC64 with your detailed information and efforts Thanks. -Danny > On Jan 21, 2024, at 11:27 PM, Niels Möller wrote: > > In preparing for merging the gcm-aes "stitched" implementation, I'm > reviewing the existing ghash code. WIP branch "ppc-ghash-macros. > > I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides > that, I've been able to improve scheduling of the reduction instructions > (adding in the result of vpmsumd last seems to improve parallelism, some > 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also > streamlined the way load offsets are used, and trimmed the number of > needed vector registers slightly. > > For the AES code, I've merged the new macros (I settled on the names > OPN_XXY and OPN_XXXY), no change in speed expected from that change. > > I've also tried to understand the differenct between AES encrypt and > decrypt, where decrypt is much slower, and uses an extra xor instruction > in the round loop. I think the reason for that is that other AES > implementations (including x86_64 and arm64 instructions, and Nettle's C > implementation) expect the decryption subkeys to be transformed via the > AES "MIX_COLUMN" operation, see > https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_type=heads#L163 > > > While the powerpc64 vncipher instruction really wants the original > subkeys, not transformed. So on power, it would be better to have a > _nettle_aes_invert that is essentially a memcpy, and then the aes > decrypt assembly code could be reworked without the xors, and run at exactly > the same speed as encryption. Current _nettle_aes_invert also changes > the order of the subkeys, with a FIXME comment suggesting that it would > be better to update the order keys are accessed in the aes decryption > functions. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. > > ___ > nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se > To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se ___ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se