Re: [PATCH 0/8] Implement Curve448 ECDH and Ed448
ni...@lysator.liu.se (Niels Möller) writes: > Performance for the scalar multiplication primitives seem to be slower > than secp384 and slightly faster than secp521, and looking at point > addition, it's slower than secp521. I hope that will be improved a quite > a bit with an optimized mod operation for the curve448 prime. I've tried out this mod function (for 64-bit): static void ecc_448_modp(const struct ecc_modulo *m, mp_limb_t *rp) { /* Let B = 2^64, b = 2^32 = sqrt(B). p = B^7 - b B^3 - 1 ==> B^7 = b B^3 + 1 {x_{13}, ..., x_0} = {x_6,...,x_0} + {x_{10},...,x_7} + 2 {x_{13},x_{12}, x_{11}} B^4 + b {x_{10},...,x_7,x_{13},x_{12},x_{11} */ mp_limb_t c3, c4, c7; mp_limb_t *tp = rp + 7; c4 = mpn_add_n (rp, rp, rp + 7, 4); c7 = mpn_addmul_1 (rp + 4, rp + 11, 3, 2); c3 = mpn_addmul_1 (rp, rp + 11, 3, (mp_limb_t) 1 << 32); c7 += mpn_addmul_1 (rp + 3, rp + 7, 4, (mp_limb_t) 1 << 32); tp[0] = c7; tp[1] = tp[2] = 0; tp[3] = c3 + (c7 << 32); tp[4] = c4 + (c7 >> 32) + (tp[3] < c3); tp[5] = tp[6] = 0; c7 = mpn_add_n (rp, rp, tp, 7); c7 = cnd_add_n (c7, rp, m->B, 7); assert(c7 == 0); } This gives a speedup of 85% over the general ecc_mod (on my machine), and gives about 35% speedup for scalar multiplication (both mul_g and mul_a). So with this change, performance of mul_g and mul_1 is roughly midway between secp384 and secp521. Not sure if replacing the addmul_1 calls with shifts is worthwhile for the C code (we'll get more function calls and more passes over the data, which should still be worthwhile for machines with slow multiplication), but for assembly implementation, the addmul_1(..., 2) call should be adds only, in registers, and the addmul_1(,..., 1<<32) should be shift and add, preferably in registers. I'm going to leave randomized testing running for a few hours. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: rsa too slow?
Nikos Mavrogiannopoulos writes: > I got pinged by someone testing the performance of TLS handshakes and > it seems that gnutls/nettle with RSA is significantly slower than > openssl. To quote the NEWS file for Nettle-3.4.1: Performance regression: * All RSA private key operations employing RSA blinding, i.e., rsa_decrypt_tr, rsa_*_sign_tr, the new rsa_sec_decrypt, and rsa_compute_root_tr, are significantly slower. This is because (i) RSA blinding now use side-channel silent operations, (ii) blinding includes a modular inversion, and (iii) side-channel silent modular inversion, implemented as mpn_sec_invert, is very expensive. A 60% slowdown for 2048-bit RSA keys have been measured. > name size sign/ms verify/ms > rsa 20480.8881 27.1422 >rsa (openssl) 20481.4249 45.2295 > > rsa-tr 20480.4257 29.1152 > rsa-tr (openssl) 20481.3735 46.1692 The above explains why Nettle's rsa-tr is much slower than the non-tr version. But it's disappointing that there also looks like a pretty large general slowdown. I think most of the running time for RSA operations, except for modular inversion, are in wel-tuned GMP functions. For best speed, make sure GMP is either compiled with --enable-fat, or configured for the machine it's running on, and use a recent version. To track down any problems, it's important to know more precisely what processor it's running on and how gmp was configured. For what it's worth, this is what I get on the laptop (quite old, "U4100 @ 1.30GHz" according to /proc/cpuinfo, should probably be "SU4100", detected as core2-pc-linux-gnu by gmp) I'm sitting in front of right now: $ ../examples/hogweed-benchmark rsa name size sign/ms verify/ms rsa 20480.21067.2703 rsa-tr 20480.11586.8202 rsa (openssl) 20480.20246.4992 rsa-tr (openssl) 20480.19596.4983 So here, Nettle is slightly faster except for side-channel silent signing. It's a bit odd that *verify* for rsa-tr appears slower than the non-tr, since no secrets are involved, and the same function is called. May be a problem in the benchmark program. Is "Smooth CRT" something that I should look up? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 0/8] Implement Curve448 ECDH and Ed448
Daiki Ueno writes: > Thank you. By the way, one thing I realized in my past rebase attempts > is that, this commit doing the final reduction of a value by mod q seems > to be incorrect for curve448 and should probably be reverted: > > commit 6cf6abd68eb3d6c8c8e5ab217be734f9c537037f > Author: Daiki Ueno > Date: Sat Aug 5 09:43:47 2017 +0200 > > ecc-eh-to-a, eddsa-sign: Parameterize hard-coded value > > This allows the same code to be reused in curve448 and Ed448. > > Signed-off-by: Daiki Ueno > > - shift = 252 - GMP_NUMB_BITS * (ecc->p.size - 1); > + shift = ecc->q.bit_size - 1 - GMP_NUMB_BITS * (ecc->p.size - 1); > cy = mpn_submul_1 (r, ecc->q.m, ecc->p.size, > r[ecc->p.size-1] >> shift); > > For curve25519, q is defined as: > > 2^252 + 0x14def9dea2f79cd65812631a5cf5d3ed > > whose bit pattern starts with 0x1000, so r - q * (r>>252) should > work. > > On the other hand, for curve448, q is defined as: > > 2^446 - 0x8335dc163bb124b65129c96fde933d8d723a70aadc873d6d54a7bb0d > > whose bit pattern starts with 0x. In that case the formula (r - q * > (r>>445)) could be incorrect due to the accumulated errors by > multiplication (i.e. q * 0x7FFF...). Good catch! Right, this needs a bit more analysis. Fur curve25519, the subtraction can underflow (unlikely), which is addressed with the conditional addition a few lines down. > Therefore, I suggest using r - q * (r>>446) instead, though it would > introduce another hard-coded value. But for curve448, that subtraction will never underflow, instead it will sometimes produce a non-canonical result, r >= q. So correcting the shift isn't enough. On the other hand, this code should perhaps be deleted altogether, I think h_to_a with flags == 2 is used only for ecdsa. It might make sense to instead add a function pointer to struct ecc_modulo to do canonical reduction; that's needed in a few different places, not only here. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 0/8] Implement Curve448 ECDH and Ed448
ni...@lysator.liu.se (Niels Möller) writes: > Daiki Ueno writes: > >> From 8bc6e735d4b40cbab5e187a28e01b63a04ecd92b Mon Sep 17 00:00:00 2001 >> From: Daiki Ueno >> Date: Fri, 23 Jun 2017 17:26:18 +0200 >> Subject: [PATCH 2/4] Implement Curve448 primitives >> >> This patch adds the necessary primitives for "curve448", defined in >> RFC 7748. Those primitives are namely: addition, doubling, scalar >> multiplication of the generator or an arbitrary point, inversion, and >> square root. > > At last, I've now merged this onto the curve448 branch. That's a great news, thank you Niels! > I see you've made some chenges to the needed scratch space, if I > understand it correctly, you need to allow h_to_a_itch larger than > mul_itch or mul_g_itch. You increase the value of ECC_ECDSA_SIGN_ITCH > and add a new ECC_ECDSA_KEYGEN_ITCH. Can you comment on that? > > The only reason ECDSA is affected at all by curve448, is that we have > tests for ecdsa over the curve25519 and curve448, even though that's > not the way these curves are intended to be used. Maybe that should > just be deleted. Indeed, I agree to remove the tests and affected parts in the library. > Performance for the scalar multiplication primitives seem to be slower > than secp384 and slightly faster than secp521, and looking at point > addition, it's slower than secp521. I hope that will be improved a quite > a bit with an optimized mod operation for the curve448 prime. > >> While the interface is similar to curve25519, the implementation is >> slightly different. For curve25519, the Pippenger tables are >> generated through the coordinates on the Montgomery curve. On the >> other hand, the tables for curve448 are directly generated from the >> coordinates on the corresponding Edwards curve ("edwards448"). > > This is no longer the case, since the handling curve 25519 was changed > early on, based on your patches back then. Thank you. By the way, one thing I realized in my past rebase attempts is that, this commit doing the final reduction of a value by mod q seems to be incorrect for curve448 and should probably be reverted: commit 6cf6abd68eb3d6c8c8e5ab217be734f9c537037f Author: Daiki Ueno Date: Sat Aug 5 09:43:47 2017 +0200 ecc-eh-to-a, eddsa-sign: Parameterize hard-coded value This allows the same code to be reused in curve448 and Ed448. Signed-off-by: Daiki Ueno - shift = 252 - GMP_NUMB_BITS * (ecc->p.size - 1); + shift = ecc->q.bit_size - 1 - GMP_NUMB_BITS * (ecc->p.size - 1); cy = mpn_submul_1 (r, ecc->q.m, ecc->p.size, r[ecc->p.size-1] >> shift); For curve25519, q is defined as: 2^252 + 0x14def9dea2f79cd65812631a5cf5d3ed whose bit pattern starts with 0x1000, so r - q * (r>>252) should work. On the other hand, for curve448, q is defined as: 2^446 - 0x8335dc163bb124b65129c96fde933d8d723a70aadc873d6d54a7bb0d whose bit pattern starts with 0x. In that case the formula (r - q * (r>>445)) could be incorrect due to the accumulated errors by multiplication (i.e. q * 0x7FFF...). Therefore, I suggest using r - q * (r>>446) instead, though it would introduce another hard-coded value. Regards, -- Daiki Ueno ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: rsa too slow?
On Mon, 2019-12-02 at 13:24 +0100, Nikos Mavrogiannopoulos wrote: > Hi, > I got pinged by someone testing the performance of TLS handshakes and > it seems that gnutls/nettle with RSA is significantly slower than > openssl. On the other hand, secp256r1 and ed25519 are faster. (btw. > both openssl and gnutls/nettle are slower than rusttls). FYI last time I checked rusttls it does not employ any countermeasure, not even blinding, easy to be fast that way. > Nevertheless > the RSA caught my attention because I had the impression that nettle > was at some point equivalent if not faster. I see that the hogweed > benchmark values in nettle show a 3x difference in signing for the TR > version and ~2x for the unprotected. Going back to 3.1 did not affect > that. Was that always the case? If not any ideas what could have > caused that? Did we miss some optimizations? (from a quick review of > openssl' RSA code, I see that smooth CRT RSA was added relatively > recently, but could that get such a big performance benefit?) Would you be able to measure OpenSSL's RSA from a release before the smooth CRt was added ? > name size sign/ms verify/ms > rsa 20480.8881 27.1422 >rsa (openssl) 20481.4249 45.2295 > > rsa-tr 20480.4257 29.1152 > rsa-tr (openssl) 20481.3735 46.1692 > > regards, > Nikos > ___ > nettle-bugs mailing list > nettle-bugs@lists.lysator.liu.se > http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs -- Simo Sorce RHEL Crypto Team Red Hat, Inc ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
rsa too slow?
Hi, I got pinged by someone testing the performance of TLS handshakes and it seems that gnutls/nettle with RSA is significantly slower than openssl. On the other hand, secp256r1 and ed25519 are faster. (btw. both openssl and gnutls/nettle are slower than rusttls). Nevertheless the RSA caught my attention because I had the impression that nettle was at some point equivalent if not faster. I see that the hogweed benchmark values in nettle show a 3x difference in signing for the TR version and ~2x for the unprotected. Going back to 3.1 did not affect that. Was that always the case? If not any ideas what could have caused that? Did we miss some optimizations? (from a quick review of openssl' RSA code, I see that smooth CRT RSA was added relatively recently, but could that get such a big performance benefit?) name size sign/ms verify/ms rsa 20480.8881 27.1422 rsa (openssl) 20481.4249 45.2295 rsa-tr 20480.4257 29.1152 rsa-tr (openssl) 20481.3735 46.1692 regards, Nikos ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs