Amitay Isaacs <ami...@ozlabs.org> writes:

> --- /dev/null
> +++ b/powerpc64/ecc-secp256r1-redc.asm
> @@ -0,0 +1,144 @@
> +C powerpc64/ecc-secp256r1-redc.asm
> +ifelse(`
> +   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
> +
> +   Based on x86_64/ecc-secp256r1-redc.asm

Looks good, and it seems method follows the x86_64 version closely. I
just checked in a correction and a clarification to the comments to the
x86_64 version.

A few comments below.

> +C Register usage:
> +
> +define(`SP', `r1')
> +
> +define(`RP', `r4')
> +define(`XP', `r5')
> +
> +define(`F0', `r3')
> +define(`F1', `r6')
> +define(`F2', `r7')
> +define(`F3', `r8')
> +
> +define(`U0', `r9')
> +define(`U1', `r10')
> +define(`U2', `r11')
> +define(`U3', `r12')
> +define(`U4', `r14')
> +define(`U5', `r15')
> +define(`U6', `r16')
> +define(`U7', `r17')

One could save one register by letting U7 and XP overlap, since XP isn't
used after loading U7.

> +     .file "ecc-secp256r1-redc.asm"
> +
> +C FOLD(x), sets (F3,F2,F1,F0)  <-- [(x << 224) - (x << 192) - (x << 96)] >> 
> 64
> +define(`FOLD', `
> +     sldi    F2, $1, 32
> +     srdi    F3, $1, 32
> +     li      F0, 0
> +     li      F1, 0
> +     subfc   F0, F2, F0
> +     subfe   F1, F3, F1

I think the 

        li      F0, 0
        li      F1, 0
        subfc   F0, F2, F0
        subfe   F1, F3, F1

could be replaced with 

        subfic  F0, F2, 0    C "negate with borrow"
        subfze  F1, F3 

If that is measurably faster, I can't say. 

Another option: Since powerpc, like arm, seems to use the proper two's
complement convention that "borrow" is not carry, maybe we don't need to
negate to F0 and F1 at all, and instead change the later subtraction, replacing

        subfc   U1, F0, U1
        subfe   U2, F1, U2
        subfe   U3, F2, U3
        subfe   U0, F3, U0

with

        addc    U1, F0, U1
        adde    U2, F1, U2
        subfe   U3, F2, U3
        subfe   U0, F3, U0

I haven't thought that through, but it does make some sense to me. I
think the arm code propagates carry through a mix of add and sub
instructions in a some places. Maybe F2 needs to be incremented
somewhere for this to work, but probably still cheaper. If this works,
FOLD would turn into something like

        sldi    F0, $1, 32
        srdi    F1, $1, 32
        subfc   F2, $1, F0
        addme   F3, F1

(If you want to investigate this later on, that's fine too, I could merge
the code with the current folding logic).

> +     C If carry, we need to add in
> +     C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1>
> +     li      F0, 0
> +     addze   F0, F0
> +     neg     F2, F0
> +     sldi    F1, F2, 32
> +     srdi    F3, F2, 32
> +     li      U7, -2
> +     and     F3, F3, U7

I think the three instructions to set F3 could be replaced with

        srdi    F3, F2, 31
        sldi    F3, F3, 1

Or maybe the and operation is faster than shift?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to