Re: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

2022-01-20 Thread Amitay Isaacs
Here's the updated code for P192 curve after simplifying C2 folding.

Amitay.
-- 

Retirement: When you quit working just before your heart does.
Retirement: When you quit working just before your heart does.
C powerpc64/ecc-secp192r1-modp.asm

ifelse(`
   Copyright (C) 2021 Amitay Isaacs, IBM Corporation

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

.file "ecc-secp192r1-modp.asm"

define(`RP', `r4')
define(`XP', `r5')

define(`T0', `r6')
define(`T1', `r7')
define(`T2', `r8')
define(`T3', `r9')
define(`C1', `r10')
define(`C2', `r11')

C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp)
.text
define(`FUNC_ALIGN', `5')
PROLOGUE(_nettle_ecc_secp192r1_modp)
ld  T0, 0(XP)
ld  T1, 8(XP)
ld  T2, 16(XP)

li  C1, 0
li  C2, 0

ld  T3, 24(XP)
addcT0, T3, T0
addeT1, T3, T1
addze   T2, T2
addze   C1, C1

ld  T3, 32(XP)
addcT1, T3, T1
addeT2, T3, T2
addze   C1, C1

ld  T3, 40(XP)
addcT0, T3, T0
addeT1, T3, T1
addeT2, T3, T2
addze   C1, C1

addcT0, C1, T0
addeT1, C1, T1
addze   T2, T2
addze   C2, C2

addcT0, C2, T0
addeT1, C2, T1
addze   T2, T2

std T0, 0(RP)
std T1, 8(RP)
std T2, 16(RP)

blr
EPILOGUE(_nettle_ecc_secp192r1_modp)
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

2021-11-30 Thread Amitay Isaacs
On Tue, 2021-11-30 at 20:17 +0100, Niels Möller wrote:
> Amitay Isaacs  writes:
> 
> > +   .file "ecc-secp192r1-modp.asm"
> 
> Thanks, I'm looking at this file first (being the simplest, even
> though
> the security level of this curve is a bit low for current usage, so
> performance not of so great importance).

Yes. The main focus was on curves p256 and higher.  For completeness
sake I added the assembly for p192 and p224 curves.

> 
> I'm quite new to powerpc, so I'm refering to the instruction
> reference,
> and trying to learn as we go along. It seems addc is addition with
> carry
> output (but no carry input), adde is addition with carry input and
> output, and addze is addition of zero with carry input and output.
> 
> > +define(`RP', `r4')
> > +define(`XP', `r5')
> > +
> > +define(`T0', `r6')
> > +define(`T1', `r7')
> > +define(`T2', `r8')
> > +define(`T3', `r9')
> > +define(`C1', `r10')
> > +define(`C2', `r11')
> 
> As I understand it, we could also use register r3 (unused input
> argument), but we don't need to, since we have enough free scratch
> registers.

Generally I avoided using r3 as it is the first input argument, but
also serves as the return value of the function.  So unless we are
running short of registers r3 is left untouched.  (Even though there
are 32 registers, only some of them can be used without saving them
first.)  For example, r3 is used in p224 modp implementation.

> 
> > +   C void ecc_secp192r1_modp (const struct ecc_modulo *m,
> > mp_limb_t *rp)
> > +   .text
> > +define(`FUNC_ALIGN', `5')
> > +PROLOGUE(_nettle_ecc_secp192r1_modp)
> > +   ld  T0, 0(XP)
> > +   ld  T1, 8(XP)
> > +   ld  T2, 16(XP)
> > +
> > +   li  C1, 0
> > +   li  C2, 0
> > +
> > +   ld  T3, 24(XP)
> > +   addcT0, T3, T0
> > +   addeT1, T3, T1
> > +   addze   T2, T2
> > +   addze   C1, C1
> > +
> > +   ld  T3, 32(XP)
> > +   addcT1, T3, T1
> > +   addeT2, T3, T2
> > +   addze   C1, C1
> > +
> > +   ld  T3, 40(XP)
> > +   addcT0, T3, T0
> > +   addeT1, T3, T1
> > +   addeT2, T3, T2
> > +   addze   C1, C1
> 
> To analyze what we are doing, I'm using the Nettle and GMP convention
> that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1
> (mod
> p). Denote the six input words as
> 
>   
> 
> representing the number 
> 
>   B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0
> 
> The accumulation above, as I understand it, computes
> 
>    =  + a_3 (B+1) + a_4 (B^2 + B) 
>     + a_5 (B^2 + B + 1>
> 
> or more graphically,
> 
>   a_2 a_2 a_1
>   a_3 a_3
>   a_4 a_4
>     + a_5 a_5 a_5
>   ---
>   c_1 t_2 t_1 t_0
> 
> This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of
> the
> addze instructions can increment it).
> 
> This looks nice, and I think it is pretty efficient too. It looks a
> bit
> different from what the x86_64 code is doing; maybe the latter could
> be
> improved.
> 
> > +   addcT0, C1, T0
> > +   addeT1, C1, T1
> > +   addze   T2, T2
> > +   addze   C2, C2
> 
> Above, c_1 is folded in at the right places, 
> 
>     <--   + c_1 (B + 1)
> 
> This number is < B^3 + 3 (B+1). This implies that in the (quite
> unlikely) case we get carry out, i.e., c_2 = 1, then the value of the
> low three words is < 3 (B+1). That means that there can be no new
> carry
> out when folding c_2.
> 
> > +   li  C1, 0
> > +   addcT0, C2, T0
> > +   addeT1, C2, T1
> > +   addze   T2, T2
> > +   addze   C1, C1
> > +
> > +   addcT0, C1, T0
> > +   addeT1, C1, T1
> > +   addze   T2, T2
> 
> So I think this final folding could be reduced to just
> 
> addcT0, C2, T0
> addeT1, C2, T1
> addze   T2, T2
> 
> There's no carry out, from this, because either C2 was zero, or T2
> was
> small, <= 3. Does that make sense?

This was the first code I wrote using the exact calculation you have
outlined above.  However, I was not sure about the sizes of the carry
(C1 and C2).  I did notice the x86 code short-cutting the C2 folding,
but the reasoning was not apparent.

Thank you for explaining the bounds calculation.  That does help me
understand why C2 folding can be simplified as you have suggested.

> 
> > +   std T0, 0(RP)
> > +   std T1, 8(RP)
> > +   std T2, 16(RP)
> > +
> > +   blr
> > +EPILOGUE(_nettle_ecc_secp192r1_modp)
> 
> Regards,
> /Niels
> 

Amitay.
-- 

Good literature is about Love and War.
Junk Fiction is about Sex and Violence.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

2021-11-30 Thread Niels Möller
Amitay Isaacs  writes:

> + .file "ecc-secp192r1-modp.asm"

Thanks, I'm looking at this file first (being the simplest, even though
the security level of this curve is a bit low for current usage, so
performance not of so great importance).

I'm quite new to powerpc, so I'm refering to the instruction reference,
and trying to learn as we go along. It seems addc is addition with carry
output (but no carry input), adde is addition with carry input and
output, and addze is addition of zero with carry input and output.

> +define(`RP', `r4')
> +define(`XP', `r5')
> +
> +define(`T0', `r6')
> +define(`T1', `r7')
> +define(`T2', `r8')
> +define(`T3', `r9')
> +define(`C1', `r10')
> +define(`C2', `r11')

As I understand it, we could also use register r3 (unused input
argument), but we don't need to, since we have enough free scratch
registers.

> + C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp)
> + .text
> +define(`FUNC_ALIGN', `5')
> +PROLOGUE(_nettle_ecc_secp192r1_modp)
> + ld  T0, 0(XP)
> + ld  T1, 8(XP)
> + ld  T2, 16(XP)
> +
> + li  C1, 0
> + li  C2, 0
> +
> + ld  T3, 24(XP)
> + addcT0, T3, T0
> + addeT1, T3, T1
> + addze   T2, T2
> + addze   C1, C1
> +
> + ld  T3, 32(XP)
> + addcT1, T3, T1
> + addeT2, T3, T2
> + addze   C1, C1
> +
> + ld  T3, 40(XP)
> + addcT0, T3, T0
> + addeT1, T3, T1
> + addeT2, T3, T2
> + addze   C1, C1

To analyze what we are doing, I'm using the Nettle and GMP convention
that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod
p). Denote the six input words as

  

representing the number 

  B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0

The accumulation above, as I understand it, computes

   =  + a_3 (B+1) + a_4 (B^2 + B) 
+ a_5 (B^2 + B + 1>

or more graphically,

  a_2 a_2 a_1
  a_3 a_3
  a_4 a_4
+ a_5 a_5 a_5
  ---
  c_1 t_2 t_1 t_0

This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the
addze instructions can increment it).

This looks nice, and I think it is pretty efficient too. It looks a bit
different from what the x86_64 code is doing; maybe the latter could be
improved.

> + addcT0, C1, T0
> + addeT1, C1, T1
> + addze   T2, T2
> + addze   C2, C2

Above, c_1 is folded in at the right places, 

<--   + c_1 (B + 1)

This number is < B^3 + 3 (B+1). This implies that in the (quite
unlikely) case we get carry out, i.e., c_2 = 1, then the value of the
low three words is < 3 (B+1). That means that there can be no new carry
out when folding c_2.

> + li  C1, 0
> + addcT0, C2, T0
> + addeT1, C2, T1
> + addze   T2, T2
> + addze   C1, C1
> +
> + addcT0, C1, T0
> + addeT1, C1, T1
> + addze   T2, T2

So I think this final folding could be reduced to just

addcT0, C2, T0
addeT1, C2, T1
addze   T2, T2

There's no carry out, from this, because either C2 was zero, or T2 was
small, <= 3. Does that make sense?

> + std T0, 0(RP)
> + std T1, 8(RP)
> + std T2, 16(RP)
> +
> + blr
> +EPILOGUE(_nettle_ecc_secp192r1_modp)

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


[PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

2021-11-25 Thread Amitay Isaacs
Signed-off-by: Amitay Isaacs 
---
 powerpc64/ecc-secp192r1-modp.asm | 93 
 1 file changed, 93 insertions(+)
 create mode 100644 powerpc64/ecc-secp192r1-modp.asm

diff --git a/powerpc64/ecc-secp192r1-modp.asm b/powerpc64/ecc-secp192r1-modp.asm
new file mode 100644
index ..97c71a83
--- /dev/null
+++ b/powerpc64/ecc-secp192r1-modp.asm
@@ -0,0 +1,93 @@
+C powerpc64/ecc-secp192r1-modp.asm
+
+ifelse(`
+   Copyright (C) 2021 Amitay Isaacs, IBM Corporation
+
+   This file is part of GNU Nettle.
+
+   GNU Nettle is free software: you can redistribute it and/or
+   modify it under the terms of either:
+
+ * the GNU Lesser General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your
+   option) any later version.
+
+   or
+
+ * the GNU General Public License as published by the Free
+   Software Foundation; either version 2 of the License, or (at your
+   option) any later version.
+
+   or both in parallel, as here.
+
+   GNU Nettle is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see http://www.gnu.org/licenses/.
+')
+
+   .file "ecc-secp192r1-modp.asm"
+
+define(`RP', `r4')
+define(`XP', `r5')
+
+define(`T0', `r6')
+define(`T1', `r7')
+define(`T2', `r8')
+define(`T3', `r9')
+define(`C1', `r10')
+define(`C2', `r11')
+
+   C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp)
+   .text
+define(`FUNC_ALIGN', `5')
+PROLOGUE(_nettle_ecc_secp192r1_modp)
+   ld  T0, 0(XP)
+   ld  T1, 8(XP)
+   ld  T2, 16(XP)
+
+   li  C1, 0
+   li  C2, 0
+
+   ld  T3, 24(XP)
+   addcT0, T3, T0
+   addeT1, T3, T1
+   addze   T2, T2
+   addze   C1, C1
+
+   ld  T3, 32(XP)
+   addcT1, T3, T1
+   addeT2, T3, T2
+   addze   C1, C1
+
+   ld  T3, 40(XP)
+   addcT0, T3, T0
+   addeT1, T3, T1
+   addeT2, T3, T2
+   addze   C1, C1
+
+   addcT0, C1, T0
+   addeT1, C1, T1
+   addze   T2, T2
+   addze   C2, C2
+
+   li  C1, 0
+   addcT0, C2, T0
+   addeT1, C2, T1
+   addze   T2, T2
+   addze   C1, C1
+
+   addcT0, C1, T0
+   addeT1, C1, T1
+   addze   T2, T2
+
+   std T0, 0(RP)
+   std T1, 8(RP)
+   std T2, 16(RP)
+
+   blr
+EPILOGUE(_nettle_ecc_secp192r1_modp)
-- 
2.33.1

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs