Re: [PATCH 0/8] Implement Curve448 ECDH and Ed448

2019-12-02 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Performance for the scalar multiplication primitives seem to be slower
> than secp384 and slightly faster than secp521, and looking at point
> addition, it's slower than secp521. I hope that will be improved a quite
> a bit with an optimized mod operation for the curve448 prime.

I've tried out this mod function (for 64-bit):

static void
ecc_448_modp(const struct ecc_modulo *m, mp_limb_t *rp)
{
  /* Let B = 2^64, b = 2^32 = sqrt(B).
 p = B^7 - b B^3 - 1 ==> B^7 = b B^3 + 1

 {x_{13}, ..., x_0} =
   {x_6,...,x_0} 
 + {x_{10},...,x_7}
 + 2 {x_{13},x_{12}, x_{11}} B^4 
 + b {x_{10},...,x_7,x_{13},x_{12},x_{11}
  */
  mp_limb_t c3, c4, c7;
  mp_limb_t *tp = rp + 7;

  c4 = mpn_add_n (rp, rp, rp + 7, 4);
  c7 = mpn_addmul_1 (rp + 4, rp + 11, 3, 2);
  c3 = mpn_addmul_1 (rp, rp + 11, 3, (mp_limb_t) 1 << 32);
  c7 += mpn_addmul_1 (rp + 3, rp + 7, 4, (mp_limb_t) 1 << 32);
  tp[0] = c7;
  tp[1] = tp[2] = 0;
  tp[3] = c3 + (c7 << 32);
  tp[4] = c4 + (c7 >> 32) + (tp[3] < c3);
  tp[5] = tp[6] = 0;
  c7 = mpn_add_n (rp, rp, tp, 7);
  c7 = cnd_add_n (c7, rp, m->B, 7);
  assert(c7 == 0);  
}

This gives a speedup of 85% over the general ecc_mod (on my machine),
and gives about 35% speedup for scalar multiplication (both mul_g and
mul_a). So with this change, performance of mul_g and mul_1 is roughly
midway between secp384 and secp521.

Not sure if replacing the addmul_1 calls with shifts is worthwhile for
the C code (we'll get more function calls and more passes over the data,
which should still be worthwhile for machines with slow multiplication),
but for assembly implementation, the addmul_1(..., 2) call should be adds
only, in registers, and the addmul_1(,..., 1<<32) should be shift and
add, preferably in registers.

I'm going to leave randomized testing running for a few hours.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: rsa too slow?

2019-12-02 Thread Niels Möller
Nikos Mavrogiannopoulos  writes:

>  I got pinged by someone testing the performance of TLS handshakes and
> it seems that gnutls/nettle with RSA is significantly slower than
> openssl. 

To quote the NEWS file for Nettle-3.4.1:

 Performance regression:

 * All RSA private key operations employing RSA blinding, i.e.,
   rsa_decrypt_tr, rsa_*_sign_tr, the new rsa_sec_decrypt, and
   rsa_compute_root_tr, are significantly slower. This is
   because (i) RSA blinding now use side-channel silent
   operations, (ii) blinding includes a modular inversion, and
   (iii) side-channel silent modular inversion, implemented as
   mpn_sec_invert, is very expensive. A 60% slowdown for
   2048-bit RSA keys have been measured.

> name size   sign/ms verify/ms
>  rsa 20480.8881   27.1422
>rsa (openssl) 20481.4249   45.2295
>
>   rsa-tr 20480.4257   29.1152
> rsa-tr (openssl) 20481.3735   46.1692

The above explains why Nettle's rsa-tr is much slower than the non-tr
version. But it's disappointing that there also looks like a pretty
large general slowdown.

I think most of the running time for RSA operations, except for modular
inversion, are in wel-tuned GMP functions. For best speed, make sure GMP
is either compiled with --enable-fat, or configured for the machine it's
running on, and use a recent version. To track down any problems, it's
important to know more precisely what processor it's running on and how
gmp was configured.

For what it's worth, this is what I get on the laptop (quite old, "U4100
@ 1.30GHz" according to /proc/cpuinfo, should probably be "SU4100",
detected as core2-pc-linux-gnu by gmp) I'm sitting in front of right
now:

$ ../examples/hogweed-benchmark rsa
name size   sign/ms verify/ms
 rsa 20480.21067.2703
  rsa-tr 20480.11586.8202
   rsa (openssl) 20480.20246.4992
rsa-tr (openssl) 20480.19596.4983

So here, Nettle is slightly faster except for side-channel silent
signing. It's a bit odd that *verify* for rsa-tr appears slower than the
non-tr, since no secrets are involved, and the same function is called.
May be a problem in the benchmark program.

Is "Smooth CRT" something that I should look up?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 0/8] Implement Curve448 ECDH and Ed448

2019-12-02 Thread Niels Möller
Daiki Ueno  writes:

> Thank you.  By the way, one thing I realized in my past rebase attempts
> is that, this commit doing the final reduction of a value by mod q seems
> to be incorrect for curve448 and should probably be reverted:
>
>   commit 6cf6abd68eb3d6c8c8e5ab217be734f9c537037f
>   Author: Daiki Ueno 
>   Date:   Sat Aug 5 09:43:47 2017 +0200
>
>   ecc-eh-to-a, eddsa-sign: Parameterize hard-coded value
>   
>   This allows the same code to be reused in curve448 and Ed448.
>   
>   Signed-off-by: Daiki Ueno 
>
> - shift = 252 - GMP_NUMB_BITS * (ecc->p.size - 1);
> + shift = ecc->q.bit_size - 1 - GMP_NUMB_BITS * (ecc->p.size - 1);
>   cy = mpn_submul_1 (r, ecc->q.m, ecc->p.size,
>  r[ecc->p.size-1] >> shift);
>
> For curve25519, q is defined as:
>
>   2^252 + 0x14def9dea2f79cd65812631a5cf5d3ed
>
> whose bit pattern starts with 0x1000, so r - q * (r>>252) should
> work.
>
> On the other hand, for curve448, q is defined as:
>
>   2^446 - 0x8335dc163bb124b65129c96fde933d8d723a70aadc873d6d54a7bb0d
>
> whose bit pattern starts with 0x.  In that case the formula (r - q *
> (r>>445)) could be incorrect due to the accumulated errors by
> multiplication (i.e. q * 0x7FFF...).

Good catch! Right, this needs a bit more analysis. Fur curve25519, the
subtraction can underflow (unlikely), which is addressed with the
conditional addition a few lines down.

> Therefore, I suggest using r - q * (r>>446) instead, though it would
> introduce another hard-coded value.

But for curve448, that subtraction will never underflow, instead it will
sometimes produce a non-canonical result, r >= q. So correcting the
shift isn't enough.

On the other hand, this code should perhaps be deleted altogether, I
think h_to_a with flags == 2 is used only for ecdsa. It might make sense
to instead add a function pointer to struct ecc_modulo to do canonical
reduction; that's needed in a few different places, not only here.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 0/8] Implement Curve448 ECDH and Ed448

2019-12-02 Thread Daiki Ueno
ni...@lysator.liu.se (Niels Möller) writes:

> Daiki Ueno  writes:
>
>> From 8bc6e735d4b40cbab5e187a28e01b63a04ecd92b Mon Sep 17 00:00:00 2001
>> From: Daiki Ueno 
>> Date: Fri, 23 Jun 2017 17:26:18 +0200
>> Subject: [PATCH 2/4] Implement Curve448 primitives
>>
>> This patch adds the necessary primitives for "curve448", defined in
>> RFC 7748.  Those primitives are namely: addition, doubling, scalar
>> multiplication of the generator or an arbitrary point, inversion, and
>> square root.
>
> At last, I've now merged this onto the curve448 branch.

That's a great news, thank you Niels!

> I see you've made some chenges to the needed scratch space, if I
> understand it correctly, you need to allow h_to_a_itch larger than
> mul_itch or mul_g_itch. You increase the value of ECC_ECDSA_SIGN_ITCH
> and add a new ECC_ECDSA_KEYGEN_ITCH. Can you comment on that?
>
> The only reason ECDSA is affected at all by curve448, is that we have
> tests for ecdsa over the curve25519 and curve448, even though that's
> not the way these curves are intended to be used. Maybe that should
> just be deleted.

Indeed, I agree to remove the tests and affected parts in the library.

> Performance for the scalar multiplication primitives seem to be slower
> than secp384 and slightly faster than secp521, and looking at point
> addition, it's slower than secp521. I hope that will be improved a quite
> a bit with an optimized mod operation for the curve448 prime.
>
>> While the interface is similar to curve25519, the implementation is
>> slightly different.  For curve25519, the Pippenger tables are
>> generated through the coordinates on the Montgomery curve.  On the
>> other hand, the tables for curve448 are directly generated from the
>> coordinates on the corresponding Edwards curve ("edwards448").
>
> This is no longer the case, since the handling curve 25519 was changed
> early on, based on your patches back then.

Thank you.  By the way, one thing I realized in my past rebase attempts
is that, this commit doing the final reduction of a value by mod q seems
to be incorrect for curve448 and should probably be reverted:

  commit 6cf6abd68eb3d6c8c8e5ab217be734f9c537037f
  Author: Daiki Ueno 
  Date:   Sat Aug 5 09:43:47 2017 +0200

  ecc-eh-to-a, eddsa-sign: Parameterize hard-coded value
  
  This allows the same code to be reused in curve448 and Ed448.
  
  Signed-off-by: Daiki Ueno 

- shift = 252 - GMP_NUMB_BITS * (ecc->p.size - 1);
+ shift = ecc->q.bit_size - 1 - GMP_NUMB_BITS * (ecc->p.size - 1);
  cy = mpn_submul_1 (r, ecc->q.m, ecc->p.size,
 r[ecc->p.size-1] >> shift);

For curve25519, q is defined as:

  2^252 + 0x14def9dea2f79cd65812631a5cf5d3ed

whose bit pattern starts with 0x1000, so r - q * (r>>252) should
work.

On the other hand, for curve448, q is defined as:

  2^446 - 0x8335dc163bb124b65129c96fde933d8d723a70aadc873d6d54a7bb0d

whose bit pattern starts with 0x.  In that case the formula (r - q *
(r>>445)) could be incorrect due to the accumulated errors by
multiplication (i.e. q * 0x7FFF...).

Therefore, I suggest using r - q * (r>>446) instead, though it would
introduce another hard-coded value.

Regards,
-- 
Daiki Ueno
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: rsa too slow?

2019-12-02 Thread Simo Sorce
On Mon, 2019-12-02 at 13:24 +0100, Nikos Mavrogiannopoulos wrote:
> Hi,
>  I got pinged by someone testing the performance of TLS handshakes and
> it seems that gnutls/nettle with RSA is significantly slower than
> openssl. On the other hand, secp256r1 and ed25519 are faster. (btw.
> both openssl and gnutls/nettle are slower than rusttls).

FYI last time I checked rusttls it does not employ any countermeasure,
not even blinding, easy to be fast that way.

>  Nevertheless
> the RSA caught my attention because I had the impression that nettle
> was at some point equivalent if not faster. I see that the hogweed
> benchmark values in nettle show a 3x difference in signing for the TR
> version and ~2x for the unprotected. Going back to 3.1 did not affect
> that. Was that always the case? If not any ideas what could have
> caused that? Did we miss some optimizations? (from a quick review of
> openssl' RSA code, I see that smooth CRT RSA was added relatively
> recently, but could that get such a big performance benefit?)

Would you be able to measure OpenSSL's RSA from a release before the
smooth CRt was added ?

> name size   sign/ms verify/ms
>  rsa 20480.8881   27.1422
>rsa (openssl) 20481.4249   45.2295
> 
>   rsa-tr 20480.4257   29.1152
> rsa-tr (openssl) 20481.3735   46.1692
> 
> regards,
> Nikos
> ___
> nettle-bugs mailing list
> nettle-bugs@lists.lysator.liu.se
> http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

-- 
Simo Sorce
RHEL Crypto Team
Red Hat, Inc




___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


rsa too slow?

2019-12-02 Thread Nikos Mavrogiannopoulos
Hi,
 I got pinged by someone testing the performance of TLS handshakes and
it seems that gnutls/nettle with RSA is significantly slower than
openssl. On the other hand, secp256r1 and ed25519 are faster. (btw.
both openssl and gnutls/nettle are slower than rusttls). Nevertheless
the RSA caught my attention because I had the impression that nettle
was at some point equivalent if not faster. I see that the hogweed
benchmark values in nettle show a 3x difference in signing for the TR
version and ~2x for the unprotected. Going back to 3.1 did not affect
that. Was that always the case? If not any ideas what could have
caused that? Did we miss some optimizations? (from a quick review of
openssl' RSA code, I see that smooth CRT RSA was added relatively
recently, but could that get such a big performance benefit?)

name size   sign/ms verify/ms
 rsa 20480.8881   27.1422
   rsa (openssl) 20481.4249   45.2295

  rsa-tr 20480.4257   29.1152
rsa-tr (openssl) 20481.3735   46.1692

regards,
Nikos
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs