Amitay Isaacs <ami...@ozlabs.org> writes: > --- /dev/null > +++ b/powerpc64/ecc-secp256r1-redc.asm > @@ -0,0 +1,144 @@ > +C powerpc64/ecc-secp256r1-redc.asm > +ifelse(` > + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation > + > + Based on x86_64/ecc-secp256r1-redc.asm
Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version. A few comments below. > +C Register usage: > + > +define(`SP', `r1') > + > +define(`RP', `r4') > +define(`XP', `r5') > + > +define(`F0', `r3') > +define(`F1', `r6') > +define(`F2', `r7') > +define(`F3', `r8') > + > +define(`U0', `r9') > +define(`U1', `r10') > +define(`U2', `r11') > +define(`U3', `r12') > +define(`U4', `r14') > +define(`U5', `r15') > +define(`U6', `r16') > +define(`U7', `r17') One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7. > + .file "ecc-secp256r1-redc.asm" > + > +C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> > 64 > +define(`FOLD', ` > + sldi F2, $1, 32 > + srdi F3, $1, 32 > + li F0, 0 > + li F1, 0 > + subfc F0, F2, F0 > + subfe F1, F3, F1 I think the li F0, 0 li F1, 0 subfc F0, F2, F0 subfe F1, F3, F1 could be replaced with subfic F0, F2, 0 C "negate with borrow" subfze F1, F3 If that is measurably faster, I can't say. Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing subfc U1, F0, U1 subfe U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0 with addc U1, F0, U1 adde U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0 I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1 (If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic). > + C If carry, we need to add in > + C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> > + li F0, 0 > + addze F0, F0 > + neg F2, F0 > + sldi F1, F2, 32 > + srdi F3, F2, 32 > + li U7, -2 > + and F3, F3, U7 I think the three instructions to set F3 could be replaced with srdi F3, F2, 31 sldi F3, F3, 1 Or maybe the and operation is faster than shift? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs