Latency in polynomial evaluation
I think this is a promising alternative, if one would otherwise need to interleaving a large number of blocks to get full utilization of the multipliers. ** How to choose ** When implementing one of those schemes, different processor resources may be the bottleneck. I'd expect it to be one of o Multiply latency, i.e, latency of the dependency chain from one block to the next (including also a few additions, but multiply latency willb e the main part). If this is the bottleneck, it means all other instructions can be scheduled in parallel, and the processor will sit idle for some cycles, waiting for a multiply to complete. Typical latency for multiply is 5 times longer than for an addition (but ratio difers quite a bit between processors, of course) o Multiply throughput, i.e., the maximum number of (independent) multiply instructions that can be run per cycle. Typical number is 0.5 -- 2. If this is the bottleneck, the processor will spend some cycles idle, waiting for a multiplier to be ready to accept a new input. o A superscalar processor can issue several instructinos in the same cycle, but there's a fix small limit. Typical number is 2 -- 6. So, e.g., if the processor can issue maximum 4 instructions per cycle, the evaluation loop consists of 40 instructions, and the loop actually runs in close to 10 cycles per iteration, then instruction issue is the bottleneck. The tricks discussed in this note are useful for finding an evaluation scheme where multiply latency isn't a bottleneck. But once a loop hits the limit on multiply throughput or instructions per cycle, other tricks are needed to optimize further. In particular, the postponed reduction has a cost in multiply throughput, since it needs some additional multiply instruction. I think one should aim to hit the limit on multiply throughput; that one is hard to negotiate (it's possible to reduce the number of multiply instructions somewhat, by the Karatsuba trick, but due to the additional overhead, likely to be useful only on processors with particularly low multiply throughput). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
Maamoun TK writes: > Great! I believe this is the best we can get for processing one block. One may be able to squeeze out one or two cycles more using the mulx extension, which should make it possible to eliminate some of the move instructions (I don't think moves cost any execution unit resources, but they do consume decoding resources). > I'm trying to implement two-way interleaving using AVX extension and > the main instruction of interest here is 'vpmuludq' that does double > multiply operation My manual seems a bit confused if it's called pmuludq or vpmuludq. But you're thinking of the instruction that does two 32x32 --> 64 multiplies? It will be interesting to see how that works out! It does half the work compared to a 64 x 64 --> 128 multiply instruction, but accumulation/folding may get more efficient by using vector registers. (There seems to also be an avx variant doing four 32x32 --> 64 multiplies, using 256-bit registers). > the main concern here is there's a shortage of XMM registers as > there are 16 of them, I'm working on addressing this issue by using memory > operands of key values for 'vpmuludq' and hope the processor cache do his > thing here. Reading cached values from memory is usally cheap. So probably fine as long as values modified are kept in registers. > I'm expecting to complete the assembly implementation tomorrow. If my analysis of the single-block code is right, I'd expect it to be rather important to trim number of instructions per block. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
ni...@lysator.liu.se (Niels Möller) writes: >> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version. > > And I've now tried the same method for the x86_64 implementation. See > attached file + needed patch to asm.m4. This gives 2.9 GByte/s. > > I'm not entirely sure cycle numbers are accurate, with clock frequence > not being fixed. I think the machine runs bechmarks at 2.1GHz, and then > this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 > instructions per cycle, 0.5 multiply instructions per cycle. > > This laptop has an AMD zen2 processor, which should be capable of > issuing four instructions per cycle and complete one multiply > instruction per cycle (according to > https://gmplib.org/~tege/x86-timing.pdf). > > This seems to indicate that on this hardware, speed is not limited by > multiplier throughput, instead, the bottleneck is instruction > decoding/issuing, with max four instructions per cycle. Benchmarked also on my other nearby x86_64 machine (intel broadwell processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect it to be generally faster, and have pushed it to the master-updates branch. I haven't looked that carefully at what the old code was doing, but I think the final folding for each block used a multiply instruction that then depends on the previous ones for that block, increasing the per block latency. With the new code, all multiplies done for a block are independent of each other. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
ni...@lysator.liu.se (Niels Möller) writes: >> This is the speed I get for C implementations of poly1305_update on my >> x86_64 laptop: >> >> * Radix 26: 1.2 GByte/s (old code) >> >> * Radix 32: 1.3 GByte/s >> >> * Radix 64: 2.2 GByte/s [...] >> For comparison, the current x86_64 asm version: 2.5 GByte/s. [...] > I've tried reworking folding, to reduce latency [...] With this trick I get on > the same machine > > Radix 32: 1.65 GByte/s > > Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version. And I've now tried the same method for the x86_64 implementation. See attached file + needed patch to asm.m4. This gives 2.9 GByte/s. I'm not entirely sure cycle numbers are accurate, with clock frequence not being fixed. I think the machine runs bechmarks at 2.1GHz, and then this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 instructions per cycle, 0.5 multiply instructions per cycle. This laptop has an AMD zen2 processor, which should be capable of issuing four instructions per cycle and complete one multiply instruction per cycle (according to https://gmplib.org/~tege/x86-timing.pdf). This seems to indicate that on this hardware, speed is not limited by multiplier throughput, instead, the bottleneck is instruction decoding/issuing, with max four instructions per cycle. Regards, /Niels diff --git a/asm.m4 b/asm.m4 index 4ac21c20..60c66c25 100644 --- a/asm.m4 +++ b/asm.m4 @@ -94,10 +94,10 @@ C For 64-bit implementation STRUCTURE(P1305) STRUCT(R0, 8) STRUCT(R1, 8) + STRUCT(S0, 8) STRUCT(S1, 8) - STRUCT(PAD, 12) - STRUCT(H2, 4) STRUCT(H0, 8) STRUCT(H1, 8) + STRUCT(H2, 8) divert C x86_64/poly1305-internal.asm ifelse(` Copyright (C) 2013 Niels Möller This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') .file "poly1305-internal.asm" C Registers mainly used by poly1305_block define(`CTX', `%rdi') C First argument to all functions define(`KEY', `%rsi') define(`MASK',` %r8') C _poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16]) .text ALIGN(16) PROLOGUE(_nettle_poly1305_set_key) W64_ENTRY(2,0) mov $0x0ffc0fff, MASK mov (KEY), %rax and MASK, %rax and $-4, MASK mov %rax, P1305_R0 (CTX) imul$5, %rax mov %rax, P1305_S0 (CTX) mov 8(KEY), %rax and MASK, %rax mov %rax, P1305_R1 (CTX) shr $2, %rax imul$5, %rax mov %rax, P1305_S1 (CTX) xor XREG(%rax), XREG(%rax) mov %rax, P1305_H0 (CTX) mov %rax, P1305_H1 (CTX) mov %rax, P1305_H2 (CTX) W64_EXIT(2,0) ret undefine(`KEY') undefine(`MASK') EPILOGUE(_nettle_poly1305_set_key) define(`T0', `%rcx') define(`T1', `%rsi')C Overlaps message input pointer. define(`T2', `%r8') define(`H0', `%r9') define(`H1', `%r10') define(`F0', `%r11') define(`F1', `%r12') C Compute in parallel C C {H1,H0} = R0 T0 + S1 T1 + S0 (T2 >> 2) C {F1,F0} = R1 T0 + R0 T1 + S1 T2 C T = R0 * (T2 & 3) C C Then accumulate as C C +--+--+--+ C |T |H1|H0| C +--+--+--+ C + |F1|F0| C --+--+--+--+ C |H2|H1|H0| C +--+--+--+ C _poly1305_block (struct poly1305_ctx *ctx, const uint8_t m[16], unsigned hi) PROLOGUE(_nettle_poly1305_block) W64_ENTRY(3, 0) push%r12 mov (%rsi), T0 mov 8(%rsi), T1 mov XREG(%rdx), XREG(T2)C Also zero extends add P1305_H0 (CTX), T0 adc P1305_H1 (CTX), T1 adc P1305_H2 (CTX), T2 mov P1305_R1 (CTX), %rax mul T0 C R1 T0 mov %rax, F0 mov %rdx, F1 mov T0, %raxC Last use of T0 input mov P1305_R0 (CTX), T0 mul T0 C R0*T0 mov %rax, H0 mov %rdx, H1 mov T
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
thout adding, to avoid the explicit clearing of F0 and F1? It may also be doable with one instruction less; the 5 instructions does 10 multiplies, but I think we use only 7, the rest must somehow be zeroed or ignored. > xxmrgld VSR(TMP), VSR(TMP), VSR(ZERO) > li IDX, 32 > xxswapd VSR(F0), VSR(F0) > vadduqm F1, F1, TMP > stxsdx VSR(F0), IDX, CTX > > li IDX, 40 > xxmrgld VSR(F0), VSR(ZERO), VSR(F0) > vadduqm F1, F1, F0 > xxswapd VSR(F1), VSR(F1) > stxvd2x VSR(F1), IDX, CTX This is looks a bit verbose, if what we need to do is just to add high part of F0 to low part of F1 (with carry to the high part of F1), and store the result? Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
ni...@lysator.liu.se (Niels Möller) writes: > This is the speed I get for C implementations of poly1305_update on my > x86_64 laptop: > > * Radix 26: 1.2 GByte/s (old code) > > * Radix 32: 1.3 GByte/s > > * Radix 64: 2.2 GByte/s > > It would be interesting with benchmarks on actual 32-bit hardware, > 32-bit ARM likely being the most relevant arch. > > For comparison, the current x86_64 asm version: 2.5 GByte/s. I've tried reworking folding, to reduce latency. Idea is to let the most significant state word be close to a word, rather than limited to <= 4 as in the previous version. When multiplying by r, split one of the multiplies to take out the low 2 bits. For the radix 64 version, that term is B^2 t_2 * r0 Split t_2 as 4*hi + lo, then this can be reduced to B^2 lo * r0 + hi * 5*r0 (Using the same old B^2 = 5/4 (mod p) in a slightly different way). The 5*r0 fits one word and can be precomputed, and then this multiplication goes in parallell with the other multiplies, and no multiply left in the final per-block folding. With this trick I get on the same machine Radix 32: 1.65 GByte/s Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version. I haven't yet done a strict analysis of bounds on the state and temporaries, but I would expect that it works out with no possibility of overflow. See attached file. To fit the precomputed 5*r0 in a nice way I had to rearrange the unions in struct poly1305_ctx a bit, I also attach the patch to do this. Size of the struct should be the same, so I think it can be done without any abi bump. Regards, /Niels diff --git a/poly1305.h b/poly1305.h index 99c63c8a..6c13a590 100644 --- a/poly1305.h +++ b/poly1305.h @@ -55,18 +55,15 @@ struct poly1305_ctx { /* Key, 128-bit value and some cached multiples. */ union { -uint32_t r32[6]; -uint64_t r64[3]; +uint32_t r32[8]; +uint64_t r64[4]; } r; - uint32_t s32[3]; /* State, represented as words of 26, 32 or 64 bits, depending on implementation. */ - /* High bits first, to maintain alignment. */ - uint32_t hh; union { -uint32_t h32[4]; -uint64_t h64[2]; +uint32_t h32[6]; +uint64_t h64[3]; } h; }; /* poly1305-internal.c Copyright: 2013 Nikos Mavrogiannopoulos Copyright: 2013, 2022 Niels Möller This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. */ #if HAVE_CONFIG_H #include "config.h" #endif #include #include #include "poly1305.h" #include "poly1305-internal.h" #include "macros.h" #if 1 typedef unsigned __int128 nettle_uint128_t; #define M64(a,b) ((nettle_uint128_t)(a) * (b)) #define r0 r.r64[0] #define r1 r.r64[1] #define s0 r.r64[2] #define s1 r.r64[3] #define h0 h.h64[0] #define h1 h.h64[1] #define h2 h.h64[2] void _nettle_poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16]) { uint64_t t0, t1; t0 = LE_READ_UINT64 (key); t1 = LE_READ_UINT64 (key + 8); ctx->r0 = t0 & UINT64_C (0x0ffc0fff); ctx->r1 = t1 & UINT64_C (0x0ffc0ffc); ctx->s0 = 5*ctx->r0; ctx->s1 = 5*(ctx->r1 >> 2); ctx->h0 = 0; ctx->h1 = 0; ctx->h2 = 0; } void _nettle_poly1305_block (struct poly1305_ctx *ctx, const uint8_t *m, unsigned m128) { uint64_t t0, t1, t2; nettle_uint128_t s, f0, f1; /* Add in message block */ t0 = ctx->h0 + LE_READ_UINT64(m); s = (nettle_uint128_t) ctx->h1 + (t0 < ctx->h0) + LE_READ_UINT64(m+8); t1 = s; t2 = ctx->h2 + (s >> 64) + m128; /* Key constants are bounded by rk < 2^60, sk < 5*2^58, therefore all the fk sums fit in 128 bits without overflow, with at least one bit margin. */ f0 = M64(t0, ctx->r0) + M64(t1, ctx->s1) + M64(t2 >> 2, ctx->s0); f1 = M64(t0, ctx->r1) + M64(t1, ctx->r0) + M64(t2, ctx->s1) + ((nettle_uint128_t)((t2 & 3) * ctx->r0) << 64); ctx->h0 = f0; f1 += f0 >> 64; ctx->h1 = f1; ctx->h2 = f1 >> 64; } /* Adds digest to the
Re: [PATCH v2 0/6] Add powerpc64 assembly for elliptic curves
Amitay Isaacs writes: > I posted the modified codes in the earlier email thread, but I think > posting them as a seperate series will make them easier to cherry pick. Thanks! > V2 changes: > - Use actual register names when storing/restoring from stack > - Drop m4 definitions which are not in use > - Simplify C2 folding for P192 curve > > Amitay Isaacs (2): > ecc: Add powerpc64 assembly for ecc_192_modp > ecc: Add powerpc64 assembly for ecc_224_modp > > Martin Schwenke (4): > ecc: Add powerpc64 assembly for ecc_384_modp > ecc: Add powerpc64 assembly for ecc_521_modp > ecc: Add powerpc64 assembly for ecc_25519_modp > ecc: Add powerpc64 assembly for ecc_448_modp I merged secp192, secp384, secp521 a few days ago. The other three, secp224, curve25519, curve448 look good too (with one very minor comment fix which I can take care of). I'll do some local testing, then merge to master-updates for a run of the ci system, including tests on ppc big-endian. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH v2 5/6] ecc: Add powerpc64 assembly for ecc_25519_modp
Amitay Isaacs writes: > --- /dev/null > +++ b/powerpc64/ecc-curve25519-modp.asm > @@ -0,0 +1,101 @@ > +C powerpc64/ecc-25519-modp.asm > +define(`RP', `r4') > +define(`XP', `r5') > + > +define(`U0', `r6') C Overlaps unused modulo input > +define(`U1', `r7') > +define(`U2', `r8') > +define(`U3', `r9') > +define(`T0', `r10') > +define(`T1', `r11') > +define(`M', `r12') > + > +define(`UN', r3) Comment seems misplaced, it's UN / r3 that overlaps the unused input, right? > + C void ecc_curve25519_modp (const struct ecc_modulo *p, mp_limb_t *rp, > mp_limb_t *xp) > + .text > +define(`FUNC_ALIGN', `5') > +PROLOGUE(_nettle_ecc_curve25519_modp) > + > + C First fold the limbs affecting bit 255 > + ld UN, 56(XP) > + li M, 38 > + mulhdu T1, M, UN > + mulld UN, M, UN > + ld U3, 24(XP) > + li T0, 0 > + addcU3, UN, U3 > + addeT0, T1, T0 > + > + ld UN, 40(XP) > + mulhdu U2, M, UN > + mulld UN, M, UN > + > + addcU3, U3, U3 > + addeT0, T0, T0 > + srdiU3, U3, 1 C Undo shift, clear high bit > + > + C Fold the high limb again, together with RP[5] > + li T1, 19 > + mulld T0, T1, T0 > + ld U0, 0(XP) > + ld U1, 8(XP) > + ld T1, 16(XP) > + addcU0, T0, U0 > + addeU1, UN, U1 > + ld T0, 32(XP) > + addeU2, U2, T1 > + addze U3, U3 > + > + mulhdu T1, M, T0 > + mulld T0, M, T0 > + addcU0, T0, U0 > + addeU1, T1, U1 > + std U0, 0(RP) > + std U1, 8(RP) > + > + ld T0, 48(XP) > + mulhdu T1, M, T0 > + mulld UN, M, T0 > + addeU2, UN, U2 > + addeU3, T1, U3 > + std U2, 16(RP) > + std U3, 24(RP) > + > + blr > +EPILOGUE(_nettle_ecc_curve25519_modp) Looks good. I must admit that the x86_64 version this is based on is not so easy to follow. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
Maamoun TK writes: > I made a performance test of this patch on the available architectures I > have access to. > > Arm64 (gcc117 gfarm): > * Radix 26: 0.65 GByte/s > * Radix 26 (2-way interleaved): 0.92 GByte/s > * Radix 32: 0.55 GByte/s > * Radix 64: 0.58 GByte/s > POWER9: > * Radix 26: 0.47 GByte/s > * Radix 26 (2-way interleaved): 1.15 GByte/s > * Radix 32: 0.52 GByte/s > * Radix 64: 0.58 GByte/s > Z15: > * Radix 26: 0.65 GByte/s > * Radix 26 (2-way interleaved): 3.17 GByte/s > * Radix 32: 0.82 GByte/s > * Radix 64: 1.22 GByte/s Interesting. I'm a bit surprised the radix-64 doesn't perform better, in particular on arm64. (But I'm not yet familiar with arm64 multiply instructions). Numbers for 2-way interleaving are impressive, I'd like to understand how that works. Might be useful derive corresponding multiply throughput, i.e., number of multiply operations (and with which multiply instruction) completed per cycle, as well as total cycles per block It looks like the folding done per-block in the radix-64 code costs at least 5 or so cycles per block (since these operations are all dependent, and we also have the multiply by 5 in there, probably adding a few cycles more). Maybe at least the multiply can be postponed. > I tried to compile the new code with -m32 flag on x86_64 but I got > "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this > target". That's expected, in two ways: I don't expect radix-64 to give any performance gain over radix-32 on any 32-bit archs. And I think __int128 is supported only on archs where it fits in two registers. If we start using __int128 we need a configure test for it, and then it actually makes things simpler, at least for this in this usecase, if it stays unsupported on 32-bit archs where it shouldn't be used. So to compile with -m32, the radix-64 code must be #if:ed out. > Also, I've disassembled the update function of Radix 64 and none of the > architectures has made use of SIMD support (including x86_64 that hasn't > used XMM registers which is standard for this arch, I don't know if gcc > supports such behavior for C compiling but I'm aware that MSVC takes > advantage of that standardization for further optimization on compiled C > code). The radix-64 code really wants multiply instruction(s) for 64x64 --> 128, and I think that's not so common SIMD instruction sets (but powerpc64 vmsumudm looks potentially useful?) Either as a single instruction, or as a pair of mulhigh/mullow instructions. And some not too complicated way to do a 128-bit add with proper carry propagation in the middle. Arm32 neon does have 32x32 --> 64, which looks like a good fit for the radix-32 variant. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
ni...@lysator.liu.se (Niels Möller) writes: > The current C implementation uses radix 26, and 25 multiplies (32x32 > --> 64) per block. And quite a lot of shifts. A radix 32 variant > analogous to the above would need 16 long multiplies and 4 short. I'd > expect that to be faster on most machines, but I'd have to try that out. I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C). This is the speed I get for C implementations of poly1305_update on my x86_64 laptop: * Radix 26: 1.2 GByte/s (old code) * Radix 32: 1.3 GByte/s * Radix 64: 2.2 GByte/s It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch. For comparison, the current x86_64 asm version: 2.5 GByte/s. If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD. To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755ace309. Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two. Regards, /Niels /* poly1305-internal.c Copyright: 2013 Nikos Mavrogiannopoulos Copyright: 2013, 2022 Niels Möller This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. */ #if HAVE_CONFIG_H #include "config.h" #endif #include #include #include "poly1305.h" #include "poly1305-internal.h" #include "macros.h" #if 1 typedef unsigned __int128 nettle_uint128_t; #define M64(a,b) ((nettle_uint128_t)(a) * (b)) #define r0 r.r64[0] #define r1 r.r64[1] #define s1 r.r64[2] #define h0 h.h64[0] #define h1 h.h64[1] #define h2 hh void _nettle_poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16]) { uint64_t t0, t1; t0 = LE_READ_UINT64(key); t1 = LE_READ_UINT64(key + 8); ctx->r0 = t0 & UINT64_C(0x0ffc0fff); ctx->r1 = t1 & UINT64_C(0x0ffc0ffc); ctx->s1 = 5*(ctx->r1 >> 2); ctx->h0 = 0; ctx->h1 = 0; ctx->h2 = 0; } void _nettle_poly1305_block (struct poly1305_ctx *ctx, const uint8_t *m, unsigned t2) { uint64_t t0, t1; nettle_uint128_t s, f0, f1; /* Add in message block */ t0 = ctx->h0 + LE_READ_UINT64(m); s = (nettle_uint128_t) (t0 < ctx->h0) + ctx->h1 + LE_READ_UINT64(m+8); t1 = s; t2 += (s >> 64) + ctx->h2; /* Key constants are bounded by rk < 2^60, sk < 5*2^58, therefore all the fk sums fit in 128 bits without overflow, with at least one bit margin. */ f0 = M64(t0, ctx->r0) + M64(t1, ctx->s1); f1 = M64(t0, ctx->r1) + M64(t1, ctx->r0) + t2 * ctx->s1 + ((nettle_uint128_t)(t2 * ctx->r0) << 64); /* Fold high part of f1. */ f0 += 5*(f1 >> 66); f1 &= ((nettle_uint128_t) 1 << 66) - 1; ctx->h0 = f0; f1 += f0 >> 64; ctx->h1 = f1; ctx->h2 = f1 >> 64; assert (ctx->h2 <= 4); } /* Adds digest to the nonce */ void _nettle_poly1305_digest (struct poly1305_ctx *ctx, union nettle_block16 *s) { uint64_t t0, t1, t2, c1, mask, s0; t0 = ctx->h0; t1 = ctx->h1; t2 = ctx->h2; /* Compute resulting carries when adding 5. */ c1 = t0 > -(UINT64_C(5)); t2 += (t1 + c1 < c1); /* Set if H >= 2^130 - 5 */ mask = - (t2 >> 2); t0 += mask & 5; t1 += mask & c1; /* FIXME: Take advan
Re: [Arm64, S390x] Optimize Chacha20
Maamoun TK writes: > As far as I understand, SIMD is called Advanced SIMD on AArch64 and it's > standard for this architecture. simd is enabled by default in GCC but it > can be disabled with nosimd option as I can see in here > https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html which is why I made > a specific config option for it. If it's present on all known aarch64 systems (and HWCAP_ASIMD flag always set), I think we can keep things simpler and use the code unconditionally, with no extra subdir, no fat build function pointers or configure flag. I've pushed the merge button for the s390x merge request. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
Maamoun TK writes: > Wider multiplication would improve the performance for 64-bit general > registers but as the case for the current SIMD implementation, the radix > 2^26 fits well there. If multiply throughput is the bottleneck, it makes sense to do as much work as possible per multiply. So I don't think I understand the benefits of interleaving, can you explain? Let's consider the 64-bit case, since that's less writing. B = 2^64 as usual. Then the state is H = h_2 B^2 + h_1 B + h_0 (with h_2 rather small, depending on how far we normalize for each block, lets assume at most 3 bits, or maybe even h_2 <= 4). R = r_1 B + r_0 By the spec, high 4 bits of both r_0 and r_1, and low 2 bits of r_1 are zero, which makes mutliplication R H (mod p) particularly nice. We get R H = r_0 h_0 + B (r_1 h_0 + r_0 h_1) + B^2 (r_1 h_1 + r_0 h_2) + B^3 r_1 h_2 But then B^2 = 5/4 (mod p), and hence B^2 r_1 = 5 r_1 / 4 (mod p), where the "/ 4" is just shifting out the two low zero bits. So let r_1' = 5 r_1 / 4, R H = r_0 h_0 + r_1' h_1 + B (r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2) These are 4 long multiplications (64x64 --> 128) and two short, 64x64 --> for the products involving h_2. (The 32-bit version would be 16 long multiplications and 4 short). From the zero high bits, we also get bounds on these terms, f_0 = r_0 h_0 + r_1' h_1 < 2^124 + 5*2^122 = 9*2^122 f_1 = r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2 < 2^125 + 5*2^61 + 2^127 So these two chains can be added together as 128-bit quantities with no overflow, in any order, there's plendy of parallelism. E.g., power vmsumudm might be useful. For final folding, we need to split f_1 into top 62 and low 66 bits, multiply low part by 5 (fits in 64 bits), and add into f_0, which still fits in 128 bits. And then take the top 64 bits of f_0 and add into f_1 (result <= 2^66 bits). The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out. In contrast, trying to use a similar scheme for multiplying by (r^2 (mod p)), as needed for an interleaved version, seems more expensive. There are several contributions to the cost: * First, the accumulation of products by power of B needs to take into account carry, as result can exceed 2^128, so one would need something closer to general schoolbok multiplication. * Second, since r^2 (mod p) may exceed 2^128, we need three words rather than two, so three more short multiplications to add in. * Third, we can't pre-divide key words by 4, since low bits are no longer guaranteed to be zero. This gives more expensive reduction, with more multiplies by 5. The two first points makes smaller radix more attractive; if we need three words for both factors, we can distribute the bits to ensure some of the most significant bits are zero. > Since the loop of block iteration is moved to inside the assembly > implementation, computing one multiple of key at the function prologue > should be ok. For large messages, that's fine, but may add a significant cost for messages of just two blocks. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, PowerPC64, S390x] Optimize Poly1305
Maamoun TK writes: > The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, > and 382.65% speedup for s390x. > > OpenSSL is still ahead in terms of performance speed since it uses 4-way > interleaving or maybe more!! > Increasing the interleaving ways more than two has nothing to do with > parallelism since the execution units are already saturated by using 2-ways > for the three architectures. The reason behind the performance improvement > is the number of execution times of reduction procedure is cutted by half > for 4-way interleaving since the products of multiplying state parts by key > can be combined before the reduction phase. Let me know if you are > interested in doing that on nettle! Interesting. I haven't paid much attention to the poly1305 implementation since it was added back in 2013. The C implementation doesn't try to use wider multiplication than 32x32 --> 64, which is poor for 64-bit platforms. Maybe we could use unsigned __int128 if we can write a configure test to check if it is available and likely to be efficient? For most efficient interleaving, I take it one should precompute some powers of the key, similar to how it's done in the recent gcm code? > It would be nice if the arm64 patch will be tested on big-endian mode since > I don't have access to any big-endian variant for testing. Merged this one too on a branch for ci testing. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Arm64, S390x] Optimize Chacha20
Maamoun TK writes: > I created merge requests that have improvements of Chacha20 for arm64 and > s390x architectures by following the approach used in powerpc > implementation. > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/37 > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/40 > The patches have 80.85% speedup for arm64 arch and 284.79% speedup for > s390x arch. Nice, I've had a quick first look. > It would be nice if the arm64 patch will be tested on big-endian mode since > I don't have access to any big-endian variant for testing. I've merged the arm64 code to a branch, for CI testing. For the ARM code, which instructions are provided by the asimd extension? Basic simd is always available, if I've understood correctly. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)
Amitay Isaacs writes: > Compared to the current version in master branch, this version > definitely improves the performance of the reduction code. > > On POWER9, the reduction code shows 7% speed up when tested separately. > > The improvement in P256 sign/verify is marginal. Here are the numbers > from hogweed-benchmark on POWER9. > > > name size sign/ms verify/ms >ecdsa 256 11.10133.5713 (master) >ecdsa 256 11.15273.6011 (this patch) Thanks for testing. Committed to the master branch now. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 4/7] ecc: Add powerpc64 assembly for ecc_384_modp
Amitay Isaacs writes: > diff --git a/powerpc64/ecc-secp384r1-modp.asm > b/powerpc64/ecc-secp384r1-modp.asm > new file mode 100644 > index ..67791f09 > --- /dev/null > +++ b/powerpc64/ecc-secp384r1-modp.asm > @@ -0,0 +1,227 @@ > +C powerpc64/ecc-secp384r1-modp.asm This looks nice (and it seems folding scheme is the same as for the x86_64 version). Just one minor thing, > +define(`FUNC_ALIGN', `5') > +PROLOGUE(_nettle_ecc_secp384r1_modp) > + > + std H0, -48(SP) > + std H1, -40(SP) > + std H2, -32(SP) > + std H3, -24(SP) > + std H4, -16(SP) > + std H5, -8(SP) I find it clearer to use register names rather than the m4 defines for save and restore of callee-save registers. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)
ni...@lysator.liu.se (Niels Möller) writes: > ni...@lysator.liu.se (Niels Möller) writes: > >> I think it should be possible to reduce number of needed registers, and >> completely avoid using callee-save registers (load the values now in >> U4-U7 one at a time a bit closer to the place where they are needed in), >> and replace F3 with $1 in the FOLD and FOLDC macros. > > Attaching a variant to do this. Passes tests with qemu, but I haven't > benchmarked it on any real hardware. Would you like to test and benchmark this on relevant real hardware, before I merged this version? Code still below, and committed to the branch ppc-secp256-tweaks. Regards, /Niels C powerpc64/ecc-secp256r1-redc.asm ifelse(` Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation Based on x86_64/ecc-secp256r1-redc.asm This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') C Register usage: define(`RP', `r4') define(`XP', `r5') define(`F0', `r3') define(`F1', `r6') define(`F2', `r7') define(`T', `r8') define(`U0', `r9') define(`U1', `r10') define(`U2', `r11') define(`U3', `r12') .file "ecc-secp256r1-redc.asm" C FOLD(x), sets (x,F2,F1,F0) <-- [(x << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLD', ` sldiF0, $1, 32 srdiF1, $1, 32 subfc F2, F0, $1 subfe $1, F1, $1 ') C FOLDC(x), sets (x,F2,F1,F0) <-- [((x+c) << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLDC', ` sldiF0, $1, 32 srdiF1, $1, 32 addze T, $1 subfc F2, F0, $1 subfe $1, F1, T ') C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) .text define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_ecc_secp256r1_redc) ld U0, 0(XP) ld U1, 8(XP) ld U2, 16(XP) ld U3, 24(XP) FOLD(U0) ld T, 32(XP) addcU1, F0, U1 addeU2, F1, U2 addeU3, F2, U3 addeU0, U0, T FOLDC(U1) ld T, 40(XP) addcU2, F0, U2 addeU3, F1, U3 addeU0, F2, U0 addeU1, U1, T FOLDC(U2) ld T, 48(XP) addcU3, F0, U3 addeU0, F1, U0 addeU1, F2, U1 addeU2, U2, T FOLDC(U3) ld T, 56(XP) addcU0, F0, U0 addeU1, F1, U1 addeU2, F2, U2 addeU3, U3, T C If carry, we need to add in C 2^256 - p = <0xfffe, 0xff..ff, 0x, 1> li F0, 0 addze F0, F0 neg F2, F0 sldiF1, F2, 32 srdiT, F2, 32 li XP, -2 and T, T, XP addcU0, F0, U0 addeU1, F1, U1 addeU2, F2, U2 addeU3, T, U3 std U0, 0(RP) std U1, 8(RP) std U2, 16(RP) std U3, 24(RP) blr EPILOGUE(_nettle_ecc_secp256r1_redc) -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Build problem on ppc64be + musl
Going through some old mail... From a discussion in September: ni...@lysator.liu.se (Niels Möller) writes: > ni...@lysator.liu.se (Niels Möller) writes: > >> I've tried a different approach on branch >> https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-efv2-check. Patch >> below. (It makes sense to me to have the new check together with the ABI >> check, but on second thought, probably a mistake to overload the ABI >> variable. It would be better to have a separate configure variable, more >> similar to the W64_ABI). > > Another iteration, on that branch (sorry for the typo in the branch > name), or see patch below. > > Stijn, can you try it out and see if it works for you? I haven't seen any response to this, but I've nevertheless just added these changes on the master-updates branch. It would be nice if you can confirm that it solves the problem with musl. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Status update
Hi, just a heads up that I'll likely not be very responsive next few weeks. I may or may not get some hacking time during Christmas holidays. What I'd like to do when I get time: Review recent patches for powerpc ecc and sm4. Complete support for ANSI x9.62 (I'm not really up-to-date on the details, but needed square root code is already in, and there's code to do the rest was posted by Wim Lewis long ago). Prepare a new release. Maybe write salsa20 and chacha assembly for more platforms. But not necessarily in that order. Feel free to reply with suggested priorities, and remind me if there's something important that I've missed. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)
ni...@lysator.liu.se (Niels Möller) writes: > Thanks! Merged to master-updates for ci testing. And now merged to the master branch. > I think it should be possible to reduce number of needed registers, and > completely avoid using callee-save registers (load the values now in > U4-U7 one at a time a bit closer to the place where they are needed in), > and replace F3 with $1 in the FOLD and FOLDC macros. Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware. C powerpc64/ecc-secp256r1-redc.asm ifelse(` Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation Based on x86_64/ecc-secp256r1-redc.asm This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') C Register usage: define(`RP', `r4') define(`XP', `r5') define(`F0', `r3') define(`F1', `r6') define(`F2', `r7') define(`T', `r8') define(`U0', `r9') define(`U1', `r10') define(`U2', `r11') define(`U3', `r12') .file "ecc-secp256r1-redc.asm" C FOLD(x), sets (x,F2,F1,F0) <-- [(x << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLD', ` sldiF0, $1, 32 srdiF1, $1, 32 subfc F2, F0, $1 subfe $1, F1, $1 ') C FOLDC(x), sets (x,F2,F1,F0) <-- [((x+c) << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLDC', ` sldiF0, $1, 32 srdiF1, $1, 32 addze T, $1 subfc F2, F0, $1 subfe $1, F1, T ') C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) .text define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_ecc_secp256r1_redc) ld U0, 0(XP) ld U1, 8(XP) ld U2, 16(XP) ld U3, 24(XP) FOLD(U0) ld T, 32(XP) addcU1, F0, U1 addeU2, F1, U2 addeU3, F2, U3 addeU0, U0, T FOLDC(U1) ld T, 40(XP) addcU2, F0, U2 addeU3, F1, U3 addeU0, F2, U0 addeU1, U1, T FOLDC(U2) ld T, 48(XP) addcU3, F0, U3 addeU0, F1, U0 addeU1, F2, U1 addeU2, U2, T FOLDC(U3) ld T, 56(XP) addcU0, F0, U0 addeU1, F1, U1 addeU2, F2, U2 addeU3, U3, T C If carry, we need to add in C 2^256 - p = <0xfffe, 0xff..ff, 0x, 1> li F0, 0 addze F0, F0 neg F2, F0 sldiF1, F2, 32 srdiT, F2, 32 li XP, -2 and T, T, XP addcU0, F0, U0 addeU1, F1, U1 addeU2, F2, U2 addeU3, T, U3 std U0, 0(RP) std U1, 8(RP) std U2, 16(RP) std U3, 24(RP) blr EPILOGUE(_nettle_ecc_secp256r1_redc) > Regards, > /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH] doc: documentation fot SM3 hash
Tianjia Zhang writes: > Signed-off-by: Tianjia Zhang > --- > nettle.texinfo | 74 -- > 1 file changed, 72 insertions(+), 2 deletions(-) Thanks! Merged now. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)
Amitay Isaacs writes: > On POWER9, the new code gives ~20% speedup for ecc_secp256r1_redc in > isolation, and ~1% speedup for ecdsa sign and verify over the earlier > assembly version. Thanks! Merged to master-updates for ci testing. I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
x86_64 ecc_256_redc (was: Re: ARM64 ecc_256_redc)
ni...@lysator.liu.se (Niels Möller) writes: > I think the approach should apply to other 64-bit archs (should probably > work also on x86_64, where it's sometimes tricky to avoid x86_64 > instructions clobbering the carry flag when it should be preserved, but > probably not so difficult in this case). x86_64 version below. I could also trimmed register usage, so it no longer needs to save and restore any registers. On my machine, this gives a speedup of 17% for ecc_secp256r1_redc in isolation, 3% speedup for ecdsa sign and 7% speedup of ecdsa verify. Regards, /Niels C x86_64/ecc-secp256r1-redc.asm ifelse(` Copyright (C) 2013, 2021 Niels Möller This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') .file "ecc-secp256r1-redc.asm" define(`RP', `%rsi') define(`XP', `%rdx') define(`U0', `%rdi') C Overlaps unused modulo input define(`U1', `%rcx') define(`U2', `%rax') define(`U3', `%r8') define(`F0', `%r9') define(`F1', `%r10') define(`F2', `%r11') define(`F3', `%rdx') C Overlap XP, used only in final carry folding C FOLD(x), sets (x,F2,F1,F0 ) <-- (x << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLD', ` mov $1, F0 mov $1, F1 mov $1, F2 shl `$'32, F0 shr `$'32, F1 sub F0, F2 sbb F1, $1 ') C FOLDC(x), sets (x,F2,F1,F0) <-- ((x+c) << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLDC', ` mov $1, F0 mov $1, F1 mov $1, F2 adc `$'0, $1 shl `$'32, F0 shr `$'32, F1 sub F0, F2 sbb F1, $1 ') PROLOGUE(_nettle_ecc_secp256r1_redc) W64_ENTRY(3, 0) mov (XP), U0 FOLD(U0) mov 8(XP), U1 mov 16(XP), U2 mov 24(XP), U3 add F0, U1 adc F1, U2 adc F2, U3 adc 32(XP), U0 FOLDC(U1) add F0, U2 adc F1, U3 adc F2, U0 adc 40(XP), U1 FOLDC(U2) add F0, U3 adc F1, U0 adc F2, U1 adc 48(XP), U2 FOLDC(U3) add F0, U0 adc F1, U1 adc F2, U2 adc 56(XP), U3 C Sum, including carry, is < 2^{256} + p. C If carry, we need to add in 2^{256} mod p = 2^{256} - p C = <0xfffe, 0xff..ff, 0x, 1> C and this addition can not overflow. sbb F2, F2 mov F2, F0 mov F2, F1 mov XREG(F2), XREG(F3) neg F0 shl $32, F1 and $-2, XREG(F3) add F0, U0 mov U0, (RP) adc F1, U1 mov U1, 8(RP) adc F2, U2 mov U2, 16(RP) adc F3, U3 mov U3, 24(RP) W64_EXIT(3, 0) ret EPILOGUE(_nettle_ecc_secp256r1_redc) -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: ANNOUNCE: Serious bug in Nettle's ecdsa_verify - Critical Confirmation
"Jayakumar, Jaikanth" writes: > There is a small confusion, I believe the bug reported here > (https://lists.lysator.liu.se/pipermail/nettle-bugs/2021/009457.html) > is related to CVE-2021-20305, right ? and this (CVE-2021-20305) is > fixed in version 3.7.2. Which *two* problems are you asking about? The problem referred to as https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-20305 was fixed in nettle-3.7.2. Then there was a different problem, in RSA decryption, https://cve.mitre.org/cgi-bin/cvename.cgi?name=2021-3580, fixed in nettle-3.7.3. > In the case it is the same, it would help big time if the CVE was > mentioned somewhere in the bug announcement thread. I'll try to remember to mention relevant CVE ids in future release announcements. Would help to also document in the NEWS file? Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
ARM64 ecc_256_redc (was: Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc)
ni...@lysator.liu.se (Niels Möller) writes: > I'm looking at a different approach (experimenting on ARM64, which is > quite similar to powerpc, but I don't yet have working code). To > understand what the redc code is doing we need to keep in mind that what > one folding step does is to compute > > + U0*p > > which cancels the low limb, since p = -1 (mod 2^64). So since the low > limb always cancel, what we need is > > + U0*((p+1)/2^64) > > The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - > p) * U0, subtracting in the folding step, and adding in the high part > later. But one doesn't have to do it that way. One could instead use a > FOLD macro that computes > > (2^{192} - 2^{160} + 2^{128} + 2^{32}) U0 > > I also wonder of there's some way to use carry out from one fold step > and apply it at the right place while preparing the F0,F1,F2,F3 for the next > step. I've got this working now, attaching the version with early carry folding. Also checked in on the branch arm64-ecc. The preceding commit (5ee0839bb28c092044fce09534651b78640518c4) collects carries and adds them in as a separate pass over the data. I've tested it only with Tested only with qemu-aarch64, help with benchmarking on real arm64 hardware appreciated (just add the file in the arm64/ directory and run ./config.status --recheck && ./config.status have the build pick it up). I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case). C arm64/ecc-secp256r1-redc.asm ifelse(` Copyright (C) 2013, 2021 Niels Möller This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') .file "ecc-secp256r1-redc.asm" define(`RP', `x1') define(`XP', `x2') define(`U0', `x0') C Overlaps unused modulo input define(`U1', `x3') define(`U2', `x4') define(`U3', `x5') define(`U4', `x6') define(`U5', `x7') define(`U6', `x8') define(`U7', `x9') define(`F0', `x10') define(`F1', `x11') define(`F2', `x12') define(`F3', `x13') define(`ZERO', `x14') C FOLD(x), sets (F3, F2,F1,F0 ) <-- (x << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLD', ` lsl F0, $1, #32 lsr F1, $1, #32 subsF2, $1, F0 sbc F3, $1, F1 ') C FOLDC(x), sets (F3, F2,F1,F0) <-- ((x+c) << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLDC', ` lsl F0, $1, #32 lsr F1, $1, #32 adc F3, $1, ZEROC May overflow, but final result will not. subsF2, $1, F0 sbc F3, F3, F1 ') PROLOGUE(_nettle_ecc_secp256r1_redc) ldr U0, [XP] ldr U1, [XP, #8] ldr U2, [XP, #16] ldr U3, [XP, #24] ldr U4, [XP, #32] ldr U5, [XP, #40] ldr U6, [XP, #48] ldr U7, [XP, #56] mov ZERO, #0 FOLD(U0) addsU1, U1, F0 adcsU2, U2, F1 adcsU3, U3, F2 adcsU4, U4, F3 FOLDC(U1) addsU2, U2, F0 adcsU3, U3, F1 adcsU4, U4, F2 adcsU5, U5, F3 FOLDC(U2) addsU3, U3, F0 adcsU4, U4, F1 adcsU5, U5, F2 adcsU6, U6, F3 FOLDC(U3) addsU4, U4, F0 adcsU5, U5, F1 adcsU6, U6, F2 adcsU7, U7, F3 C Sum, including carry, is < 2^{256} + p. C If carry, we need to add in 2^{256} mod p = 2^{256} - p C = <0xfffe, 0xff..ff, 0x, 1> C and this addition can not overflow. adc F0, ZERO, ZERO neg F2, F0 lsl F1, F2, #32 lsr F3, F2, #32 and F3, F3, #-2 addsU0, F0, U4 adcsU1, F1, U5 adcsU2, F2, U6 adc
Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc
ni...@lysator.liu.se (Niels Möller) writes: > If this works, > FOLD would turn into something like > > sldiF0, $1, 32 > srdiF1, $1, 32 > subfc F2, $1, F0 > addme F3, F1 I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute + U0*p which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is + U0*((p+1)/2^64) The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes (2^{192} - 2^{160} + 2^{128} + 2^{32}) U0 I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH v2 1/4] Add OSCCA SM3 hash algorithm
Tianjia Zhang writes: > Hi Niels, > >> Would you mind writing a short description of the algorithm for the >> manual? I think it should go under "Miscellaneous hash functions". Would >> be nice with some brief background on this hash function (origin, >> intended applications, when and where it's useful) plus reference docs >> for the defined constants and functions. >> > > SM3 is a cryptographic hash function standard adopted by the > government of the People's Republic of China, which was issued by the > Cryptography Standardization Technical Committee of China on December > 17, 2010. The corresponding standard is "GM/T 0004-2012 "SM3 > Cryptographic Hash Algorithm"". > > SM3 algorithm is a hash algorithm in ShangMi cryptosystems. SM3 is > mainly used for digital signature and verification, message > authentication code generation and verification, random number > generation, etc. Thanks for the backround. > Its algorithm is public. Combined with the public key > algorithm SM2 and the symmetric encryption algorithm SM4, it can be > used in various data security and network security scenarios such as > the TLS 1.3 protocol, disk encryption, standard digital certificates, > and digital signatures. I think the above two sentences could be removed or shortened. I think the mention of TLS, with reference to RFC 8998, is the part most relevant for the Nettle manual. Besides that I think your text provides right level of detail. > According to the State Cryptography > Administration of China, its security and efficiency are equivalent to > SHA-256. This is relevant too. > Thanks for your reminder, the above is the information I provided. Do > I need to submit it to the document through PATCH? If you can prepare a patch for nettle.texinfo, that would be ideal. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc
Amitay Isaacs writes: > --- /dev/null > +++ b/powerpc64/ecc-secp256r1-redc.asm > @@ -0,0 +1,144 @@ > +C powerpc64/ecc-secp256r1-redc.asm > +ifelse(` > + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation > + > + Based on x86_64/ecc-secp256r1-redc.asm Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version. A few comments below. > +C Register usage: > + > +define(`SP', `r1') > + > +define(`RP', `r4') > +define(`XP', `r5') > + > +define(`F0', `r3') > +define(`F1', `r6') > +define(`F2', `r7') > +define(`F3', `r8') > + > +define(`U0', `r9') > +define(`U1', `r10') > +define(`U2', `r11') > +define(`U3', `r12') > +define(`U4', `r14') > +define(`U5', `r15') > +define(`U6', `r16') > +define(`U7', `r17') One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7. > + .file "ecc-secp256r1-redc.asm" > + > +C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> > 64 > +define(`FOLD', ` > + sldiF2, $1, 32 > + srdiF3, $1, 32 > + li F0, 0 > + li F1, 0 > + subfc F0, F2, F0 > + subfe F1, F3, F1 I think the li F0, 0 li F1, 0 subfc F0, F2, F0 subfe F1, F3, F1 could be replaced with subfic F0, F2, 0C "negate with borrow" subfze F1, F3 If that is measurably faster, I can't say. Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing subfc U1, F0, U1 subfe U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0 with addcU1, F0, U1 addeU2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0 I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like sldiF0, $1, 32 srdiF1, $1, 32 subfc F2, $1, F0 addme F3, F1 (If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic). > + C If carry, we need to add in > + C 2^256 - p = <0xfffe, 0xff..ff, 0x, 1> > + li F0, 0 > + addze F0, F0 > + neg F2, F0 > + sldiF1, F2, 32 > + srdiF3, F2, 32 > + li U7, -2 > + and F3, F3, U7 I think the three instructions to set F3 could be replaced with srdiF3, F2, 31 sldiF3, F3, 1 Or maybe the and operation is faster than shift? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH v2 1/4] Add OSCCA SM3 hash algorithm
Tianjia Zhang writes: > Add OSCCA SM3 secure hash (OSCCA GM/T 0004-2012 SM3) generic > hash transformation. Thanks, merged the patch series onto a branch "sm3" for testing, with only minor changes. > --- /dev/null > +++ b/sm3.h [...] > +#define SM3_DIGEST_SIZE 32 > +#define SM3_BLOCK_SIZE 64 > +/* For backwards compatibility */ > +#define SM3_DATA_SIZE SM3_BLOCK_SIZE I dropped the definition of SM3_DATA_SIZE, since this is a new feature in Nettle, there's no old version to be compatible with. Would you mind writing a short description of the algorithm for the manual? I think it should go under "Miscellaneous hash functions". Would be nice with some brief background on this hash function (origin, intended applications, when and where it's useful) plus reference docs for the defined constants and functions. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp
Amitay Isaacs writes: > + .file "ecc-secp192r1-modp.asm" Thanks, I'm looking at this file first (being the simplest, even though the security level of this curve is a bit low for current usage, so performance not of so great importance). I'm quite new to powerpc, so I'm refering to the instruction reference, and trying to learn as we go along. It seems addc is addition with carry output (but no carry input), adde is addition with carry input and output, and addze is addition of zero with carry input and output. > +define(`RP', `r4') > +define(`XP', `r5') > + > +define(`T0', `r6') > +define(`T1', `r7') > +define(`T2', `r8') > +define(`T3', `r9') > +define(`C1', `r10') > +define(`C2', `r11') As I understand it, we could also use register r3 (unused input argument), but we don't need to, since we have enough free scratch registers. > + C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) > + .text > +define(`FUNC_ALIGN', `5') > +PROLOGUE(_nettle_ecc_secp192r1_modp) > + ld T0, 0(XP) > + ld T1, 8(XP) > + ld T2, 16(XP) > + > + li C1, 0 > + li C2, 0 > + > + ld T3, 24(XP) > + addcT0, T3, T0 > + addeT1, T3, T1 > + addze T2, T2 > + addze C1, C1 > + > + ld T3, 32(XP) > + addcT1, T3, T1 > + addeT2, T3, T2 > + addze C1, C1 > + > + ld T3, 40(XP) > + addcT0, T3, T0 > + addeT1, T3, T1 > + addeT2, T3, T2 > + addze C1, C1 To analyze what we are doing, I'm using the Nettle and GMP convention that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod p). Denote the six input words as representing the number B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0 The accumulation above, as I understand it, computes = + a_3 (B+1) + a_4 (B^2 + B) + a_5 (B^2 + B + 1> or more graphically, a_2 a_2 a_1 a_3 a_3 a_4 a_4 + a_5 a_5 a_5 --- c_1 t_2 t_1 t_0 This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the addze instructions can increment it). This looks nice, and I think it is pretty efficient too. It looks a bit different from what the x86_64 code is doing; maybe the latter could be improved. > + addcT0, C1, T0 > + addeT1, C1, T1 > + addze T2, T2 > + addze C2, C2 Above, c_1 is folded in at the right places, <-- + c_1 (B + 1) This number is < B^3 + 3 (B+1). This implies that in the (quite unlikely) case we get carry out, i.e., c_2 = 1, then the value of the low three words is < 3 (B+1). That means that there can be no new carry out when folding c_2. > + li C1, 0 > + addcT0, C2, T0 > + addeT1, C2, T1 > + addze T2, T2 > + addze C1, C1 > + > + addcT0, C1, T0 > + addeT1, C1, T1 > + addze T2, T2 So I think this final folding could be reduced to just addcT0, C2, T0 addeT1, C2, T1 addze T2, T2 There's no carry out, from this, because either C2 was zero, or T2 was small, <= 3. Does that make sense? > + std T0, 0(RP) > + std T1, 8(RP) > + std T2, 16(RP) > + > + blr > +EPILOGUE(_nettle_ecc_secp192r1_modp) Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 0/7] Add powerpc64 assembly for elliptic curves
Amitay Isaacs writes: > This series of patches add the powerpc64 assembly for modp/redc functions > for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results > in 15-30% performance improvements as measured on POWER9 system using > hogweed-benchmark. Nice. For testing these functions, I recommend running while NETTLE_TEST_SEED=0 ./testsuite/ecc-mod-test ; do : ; done and while NETTLE_TEST_SEED=0 ./testsuite/ecc-redc-test ; do : ; done for a few hours. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 0/4] Introduce OSCCA SM3 hash algorithm
Tianjia Zhang writes: > You can refer to the ISO specification here: > https://www.iso.org/standard/67116.html > Or PDF version: > https://github.com/alipay/tls13-sm-spec/blob/master/sm-en-pdfs/sm3/GBT.32905-2016.SM3-en.pdf I see that RFC 8998 refers to http://www.gmbz.org.cn/upload/2018-07-24/1532401392982079739.pdf, which looks like the same pdf file. I find it a bit odd that the document carries no information on author or organization. > The specification does not define the reference implementation of the > algorithm. This series of patches mainly refers to the SM3 > implementation in libgcrypt and gnulib. It looks like the gcrypt implementation is licensed under LGPLv2.1 or later (see https://github.com/gpg/libgcrypt/blob/master/cipher/sm3.c), so should be fine to copy into nettle (in contrast to gnulib code, which appears to be GPLv3, and would need explicit permission from copyright holder before relicensing). But if it is a derived work of libgcrypt, in the sense of copyright law, the copyright header needs to acknowledge that, ie, Copyright (C) 2017 Jia Zhang Copyright (C) 2021 Tianjia Zhang Or did you write both versions, with Jia being an alternate form of your name? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH 0/4] Introduce OSCCA SM3 hash algorithm
Tianjia Zhang writes: > Add OSCCA SM3 secure hash generic hash algorithm, described > in OSCCA GM/T 0004-2012 SM3. Thanks, I've had a first quick look, and it looks nice. I don't know much about this hash function, though. A few questions: * Is there some reasonably authoritative English reference for the algorithm? I checked wikipedia, and it only links to an old internet draft, https://tools.ietf.org/id/draft-oscca-cfrg-sm3-02.html * The name "sm3" is a bit short, would it make sense to add some family-prefix, maybe "oscca_sm3"? * Do you have some examples of protocols or applications that specify the use of sm3? * The implementation, it's written from scratch, or is it based on some reference implementation? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [PATCH] Curve point decompression
ni...@lysator.liu.se (Niels Möller) writes: > Wim Lewis writes: > >> Now that 3.5.1 is out, is there a chance this could be looked at? > Not sure in which order to do things. Maybe it will be best to first add > the square root routines, with tests, and then add functions for > converting between points and octet strings (and related utilities, if > needed). I have added sqrt functions on the branch ecc-sqrt (sorry for a forced update since previous attempt). So this is now on top of the changes to the inversion improvements from last year. All the secpxxxr1 curves are supported, but not the gost curves. Tests pass (I have additional changes to enable randomized tests that I'd like to commit in a few days), except that sqrt(0) fails for the secp224 curve, where the implementation uses the full Tonelli-Shanks algorithm. I'm looking at the algorithm description in Cohen's book (A course in computational algebraic number theory), and it seems to not work for this case. If we need sqrt(0), it must be handled as a special case. Also, unlike the other square root functions, it seems tricky to make the secp224r1 square root function side-channel silent. But I expect the main use case of point decompression is for public input (secrets in elliptic curve crypto tend to be scalars, not points), right? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
secp256r1 mod functions
Hi, a while ago I was asked to explain the 64-bit C versions of ecc_secp256r1_modp and ecc_secp256r1_modq (in ecc-secp256r1.c), and I found that a bit difficult. I've rewritten them, on branch https://git.lysator.liu.se/nettle/nettle/-/blob/secp256r1-mod/ecc-secp256r1.c. Main difference is handling of the case that next quotient is close to 2^{64}: Old code allowed the quotient to overflow 64 bits, using an additional carry variable q2. New code ensures that next quotient is always at most 2^{64} - 1. For the new implementation, the modp function is a special case of the 2/1 division in https://gmplib.org/~tege/division-paper.pdf (would usually need 3/2 division to get sufficient accuracy, but reduces to 2/1 since the next most significant word of p is 0), and the modq function is a special case of divappr2, described in https://www.lysator.liu.se/~nisse/misc/schoolbook-divappr.pdf. I've not been able to measure any significant difference in speed (I get somewhat noisy measurements from the examples/ecc-benchmark tool), although I would expect the new code to be very slightly faster. These functions are not that performance critical, since the bulk of the reductions for this curve is done using redc, not mod. Any additional testing, benchmarking, or code staring, is appreciated. I will likely merge the new code to the master branch in a few days. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize SHA3 permute using vector facility
Maamoun TK writes: > I've added a new patch that optimizes SHA3 permute function for S390x > architecture https://git.lysator.liu.se/nettle/nettle/-/merge_requests/36 > More about the patch in merge request description. Really nice speedup, and interesting that it's significantly faster than your previous version using the special sha3 instructions. I'm sorry the existing implementations are quite hard to follow, with irregular data movements and rather unstructured comments. It must have been a bit challenging to decipher the x86_64 version. Do you have any ideas on how to improve documentation and comments? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Structural fixes to the manual
I've spent some time to improve structure (mostly non-text changes) of the manual. 1. Deleted all explicit node pointers in nettle.texinfo, instead letting makeinfo infer the node structure. This is the recommended way these days, according to texinfo documentation. 2. Changed the make rules producing nettle.pdf to use texi2pdf, instead of the chain texi2dvi + dvips + pstopdf. Most obvious result is that hyperlinks work better, and output file is slightly smaller. It's done in whatever way is default in texi2pdf, I haven't tried to check the details (e.g., what kind of fonts are used, and if they're all embedded in the file). 3. Split the huge Cipher functions node into one node per cipher. 4. Fixed a few places where urls or example code was too wide for the page. According to the docs (https://www.gnu.org/software/texinfo/manual/texinfo/html_node/URL-Line-Breaking.html), line breaks should be automatically added in urls when needed (and that's true also according to the docs for texinfo-6.7, which is what I have installed), but that didn't work at all when I tried it, so I've added a few explicit hints on how to break long urls. Also the @urefbreakstyle command wasn't recognized at all. Anyone here more familiar with texinfo that can explain? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize SHA1 with fat build support
Maamoun TK writes: > I got almost 12% speedup of optimizing the sha3_permute() function using > the SHA hardware accelerator of s390x, is it worth adding that assembly > implementation? For such a small assembly function, I think it's worth the effort (more questionable if it was worth adding the special instructions for it...). If you have the time, you could also try out doing it with vector registers, like on x86_64 and arm/neon. Some difficulties in the x86_64 implementation were (i) xmm register shortage, (ii) moving 64-bit pieces between the 128-bit xmm registers, and (iii) rotating the 64-bit pieces of an xmm register by different shift counts. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Feature request: OCB mode
ni...@lysator.liu.se (Niels Möller) writes: > If someone wants to work on it, please post to the list. I might look > into it myself, but as you have noticed, I have rather limited hacking > time. I've given it a try, see branch ocb-mode. Based on RFC 7253. Passes tests, but not particularly optimized. Some comments and questions: 1. Most of the operations use only the enrypt function of the underlying block cipher. Except ocb decrypt, which needs *both* the decrypt function and the encrypt function. For ciphers that use different key setup for encrypt and decrypt, e.g., AES, that means that to decrypt OCB one needs to initialize two separate aes128_ctx. To call the somewhat unwieldy void ocb_decrypt (struct ocb_ctx *ctx, const struct ocb_key *key, const void *encrypt_ctx, nettle_cipher_func *encrypt, const void *decrypt_ctx, nettle_cipher_func *decrypt, size_t length, uint8_t *dst, const uint8_t *src); 2. It's not obvious how to best manage the different L_i values. Can be computed upfront, on demand, or cached in some way. Current code computes only L_*, L_$ and L_0 up front (part of ocb_set_key), and the others recomputed each time they're needed. 3. The processing of the authenticated data doesn't depend on the nonce in any way. That means that if one processes several messages with the same key and associated data, the associated data can be processed once, with the same sum reused for all messages. Is that something that is useful in practice, and which nettle interfaces should support? 4. The way the nonce is used seems designed to allow cheap incrementing of the nonce. The nonce is used to determine Offset_0 = Stretch[1+bottom..128+bottom] where "bottom" is the least significant 6 bits of the nonce, acting as a shift, and "Stretch" is independent of those nonce bits, so unchanged on all but one out of 64 nonce increments. Should nettle support some kind of auto-incrementing nonce that takes advantage of this? Nettle does something similar for UMAC (not sure if there are others). As I said, current code is not particularly optimized, but OCB has potential to be quite fast. The per-block processing for authentication of the message (not associated data) is just an XOR. And encryption/decryption can be done several blocks in parallel, like CTR mode. If we do, e.g., 4 or 8 blocks at a time, there will be a fairly regular structure of the needed Offset_i values, possibly making them cheaper to setup, but I haven't yet looked into those details. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
CBC-AES (was: Re: [S390x] Optimize AES modes)
ni...@lysator.liu.se (Niels Möller) writes: > I've also added a cbc-aes128-encrypt.asm. > That gives more significant speedup, almost 60%. I think main reason for > the speedup is that we avoid reloading subkeys between blocks. I've continued this path, see branch aes-cbc. The aes128 variant is at https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes128-encrypt.asm Benchmark results are positive but a bit puzzling. On my laptop (AMD Ryzen 5) I get aes128 ECB encrypt 5450.18 This is the latest version, doing two blocks per iteration. aes128 CBC encrypt 547.34 The general CBC mode written in C, with one call to aes128_encrypt per block. 10(!) times slower than ECB. cbc_aes128 encrypt 865.11 The new assembly function. Almost 60% speedup over the old code, which is nice, and large enough that it seems motivated to have the new functin. But still 6 times slower than ECB. I'm not sure why. Let's look a bit closer at cycle numbers. Not sure I get accurate cycle numbers (it's a bit tricky with variable features and turbo modes and whatnot), but it looks like ECB mode is 6 cycles per block, which would be consistent with issue of two aesenc instructions per block. While the CBC mode is 37 cycles per block, almost 4 cycles per aesenc. This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii) the processor's out-of-order machinery results in as many as 7-8 blocks processed in parallel when executing the ECB loop, i.e., instruction issue for 3-4 iterations through the loop before the results of the first iteration is ready. The interface for the new function is struct cbc_aes128_ctx CBC_CTX(struct aes128_ctx, AES_BLOCK_SIZE); void cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src); I'm not that fond of the struct cbc_aes128_ctx though, which includes both (constant) subkeys and iv. So I'm considering changing that to void cbc_aes128_encrypt(const struct aes128_ctx *ctx, uint8_t *iv, size_t length, uint8_t *dst, const uint8_t *src); I.e., similar to cbc_encrypt, but without the arguments nettle_cipher_func *f, size_t block_size. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Build problem on ppc64be + musl
ni...@lysator.liu.se (Niels Möller) writes: > I've tried a different approach on branch > https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-efv2-check. Patch > below. (It makes sense to me to have the new check together with the ABI > check, but on second thought, probably a mistake to overload the ABI > variable. It would be better to have a separate configure variable, more > similar to the W64_ABI). Another iteration, on that branch (sorry for the typo in the branch name), or see patch below. Stijn, can you try it out and see if it works for you? Regards, /Niels diff --git a/config.m4.in b/config.m4.in index d89325b8..b98a5817 100644 --- a/config.m4.in +++ b/config.m4.in @@ -5,6 +5,7 @@ define(`COFF_STYLE', `@ASM_COFF_STYLE@')dnl define(`TYPE_FUNCTION', `@ASM_TYPE_FUNCTION@')dnl define(`TYPE_PROGBITS', `@ASM_TYPE_PROGBITS@')dnl define(`ALIGN_LOG', `@ASM_ALIGN_LOG@')dnl +define(`ELFV2_ABI', `@ELFV2_ABI@')dnl define(`W64_ABI', `@W64_ABI@')dnl define(`RODATA', `@ASM_RODATA@')dnl define(`WORDS_BIGENDIAN', `@ASM_WORDS_BIGENDIAN@')dnl diff --git a/configure.ac b/configure.ac index ebec8759..2ed4ab4e 100644 --- a/configure.ac +++ b/configure.ac @@ -311,6 +311,9 @@ AC_SUBST([GMP_NUMB_BITS]) # Figure out ABI. Currently, configurable only by setting CFLAGS. ABI=standard +ELFV2_ABI=no # For powerpc64 +W64_ABI=no # For x86_64 windows + case "$host_cpu" in [x86_64 | amd64]) AC_TRY_COMPILE([ @@ -355,6 +358,15 @@ case "$host_cpu" in ], [ ABI=64 ]) +if test "$ABI" = 64 ; then + AC_TRY_COMPILE([ +#if _CALL_ELF == 2 +#error ELFv2 ABI +#endif + ], [], [], [ + ELFV2_ABI=yes + ]) +fi ;; aarch64*) AC_TRY_COMPILE([ @@ -750,7 +762,6 @@ IF_DLL='#' LIBNETTLE_FILE_SRC='$(LIBNETTLE_FORLINK)' LIBHOGWEED_FILE_SRC='$(LIBHOGWEED_FORLINK)' EMULATOR='' -W64_ABI=no case "$host_os" in mingw32*|cygwin*) @@ -1031,6 +1042,7 @@ AC_SUBST(ASM_TYPE_FUNCTION) AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) +AC_SUBST(ELFV2_ABI) AC_SUBST(W64_ABI) AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR) diff --git a/powerpc64/machine.m4 b/powerpc64/machine.m4 index 187a49b8..b59f0863 100644 --- a/powerpc64/machine.m4 +++ b/powerpc64/machine.m4 @@ -1,7 +1,7 @@ define(`PROLOGUE', `.globl C_NAME($1) DECLARE_FUNC(C_NAME($1)) -ifelse(WORDS_BIGENDIAN,no, +ifelse(ELFV2_ABI,yes, `ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN') C_NAME($1): addis 2,12,(.TOC.-C_NAME($1))@ha @@ -17,7 +17,7 @@ ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN') undefine(`FUNC_ALIGN')') define(`EPILOGUE', -`ifelse(WORDS_BIGENDIAN,no, +`ifelse(ELFV2_ABI,yes, `.size C_NAME($1), . - C_NAME($1)', `.size .C_NAME($1), . - .C_NAME($1) .size C_NAME($1), . - .C_NAME($1)')') -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Reorganization of x86_64 aesni code
I've merged a reorganization of the x86_64 aesni code to the master-updates branch for testing. This replaces the x86_64/aesni/aes-*crypt-internal.asm files with separate files for the different key sizes, as has been discussed earlier. And I've implemented 2-way interleaving, i.e., doing 2 blocks at a time, which gave a nice speedup on the order of 15% in my tests. I may be worthwhile to go to 3-way or 4-way, but I don't plan to try that soon. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Feature request: OCB mode
Justus Winter writes: > we (Sequoia PGP) would love to see OCB being implemented in Nettle. The > OpenPGP working group is working on a revision of RFC4880, which will > mostly be a cryptographic refresh, and will bring AEAD to OpenPGP. > > The previous -now abandoned- draft called for EAX being mandatory, and > OCB being optional [0]. This was motivated by OCB being encumbered by > patents. However, said patents were waived by the holder [1]. > > 0: > https://datatracker.ietf.org/doc/html/draft-ietf-openpgp-rfc4880bis-10#section-9.6 > 1: https://mailarchive.ietf.org/arch/msg/cfrg/qLTveWOdTJcLn4HP3ev-vrj05Vg/ That's good news, I hadn't seen that. Then OCB gets a lot more interesting. And https://datatracker.ietf.org/doc/html/rfc7253 is a proper reference (there seems to be a couple of different versions of OCB)? > Unfortunately, we don't have the expertise in our team to contribute a > patch, and we currently aren't in a position to offer funding for the > implementation. If someone wants to work on it, please post to the list. I might look into it myself, but as you have noticed, I have rather limited hacking time. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Big endian tests (no mips)
Maamoun TK writes: > On Mon, Aug 23, 2021 at 8:59 PM Niels Möller wrote: > >> I would like to keep testing on big-endian. s390x is big-endian, right? >> And so is powerpc64 (non -el). So it would be nice to configure cross >> tests on one of those platforms configured with --disable-assembler, to >> test portability of the C code. Are s390x cross tools and qemu-user in >> good enough shape (it's an official debian release arch), or is >> powerpc64 a better option? >> > > Yes, s390x is big-endian and it's good for such purposes. Along being > officially supported in debian releases, it runs natively on remote > instance in gitlab CI. I've just added an s390x cross-build to the gitlab ci, with --disable-assembler to exercise all #if WORDS_BIGENDIAN. I noticed that for some of the archs (powerpc64, powerpc64el, s390x, i.e., the ones not used in gnutls tests) we don't have any cross libgmp-dev packages preinstalled in the image, and since we don't explicitly install them either, there's no test coverage of public key functions in these builds. I'll see if I can fix that. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Big endian tests (no mips) (was: Re: Build problem on ppc64be + musl)
ni...@lysator.liu.se (Niels Möller) writes: > Unfortunaly, the CI cross builds aren't working at the moment (the > buildenv images are based on Debian Buster ("stable" at the time images > were built), and nettle's ci scripts do apt-get update and apt-get > install, which now attempts to get Bullseye packages (new "stable" since > a week ago)). Images now updated to debian stable (thanks, Daiki!). But we'll have to drop mips tests for now, since current setup assumes archs under tests are available in debian, and mips has been discontinued as a debian release architecture. Other cross builds now work (change to drop mips is on the master-updates branch). If you have ideas on how to revive mips tests, that's welcome, but for now we'll have to do without. I would like to keep testing on big-endian. s390x is big-endian, right? And so is powerpc64 (non -el). So it would be nice to configure cross tests on one of those platforms configured with --disable-assembler, to test portability of the C code. Are s390x cross tools and qemu-user in good enough shape (it's an official debian release arch), or is powerpc64 a better option? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Build problem on ppc64be + musl
Maamoun TK writes: > That's right, in little-endian systems I got "#define _CALL_ELF 2" while in > big-endian ones that value is 1 except when using musl. That's good. > I've updated the > patch in the branch > https://git.lysator.liu.se/mamonet/nettle/-/tree/ppc64_musl_fix to exploit > this distinction. I've tried a different approach on branch https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-efv2-check. Patch below. (It makes sense to me to have the new check together with the ABI check, but on second thought, probably a mistake to overload the ABI variable. It would be better to have a separate configure variable, more similar to the W64_ABI). Unfortunaly, the CI cross builds aren't working at the moment (the buildenv images are based on Debian Buster ("stable" at the time images were built), and nettle's ci scripts do apt-get update and apt-get install, which now attempts to get Bullseye packages (new "stable" since a week ago)). Regards, /Niels diff --git a/config.m4.in b/config.m4.in index d89325b8..2ac19a84 100644 --- a/config.m4.in +++ b/config.m4.in @@ -5,6 +5,7 @@ define(`COFF_STYLE', `@ASM_COFF_STYLE@')dnl define(`TYPE_FUNCTION', `@ASM_TYPE_FUNCTION@')dnl define(`TYPE_PROGBITS', `@ASM_TYPE_PROGBITS@')dnl define(`ALIGN_LOG', `@ASM_ALIGN_LOG@')dnl +define(`ABI', `@ABI@')dnl define(`W64_ABI', `@W64_ABI@')dnl define(`RODATA', `@ASM_RODATA@')dnl define(`WORDS_BIGENDIAN', `@ASM_WORDS_BIGENDIAN@')dnl diff --git a/configure.ac b/configure.ac index ebec8759..0efa5795 100644 --- a/configure.ac +++ b/configure.ac @@ -353,8 +353,15 @@ case "$host_cpu" in ], [], [ ABI=32 ], [ - ABI=64 -]) + AC_TRY_COMPILE([ +#if _CALL_ELF == 2 +#error ELFv2 ABI +#endif + ], [], [ + ABI=64v1 + ], [ + ABI=64v2 + ])]) ;; aarch64*) AC_TRY_COMPILE([ @@ -514,7 +521,7 @@ if test "x$enable_assembler" = xyes ; then fi ;; *powerpc64*) - if test "$ABI" = 64 ; then + if test "$ABI" != 32 ; then GMP_ASM_POWERPC_R_REGISTERS asm_path="powerpc64" if test "x$enable_fat" = xyes ; then @@ -1032,6 +1039,7 @@ AC_SUBST(ASM_TYPE_PROGBITS) AC_SUBST(ASM_MARK_NOEXEC_STACK) AC_SUBST(ASM_ALIGN_LOG) AC_SUBST(W64_ABI) +AC_SUBST(ABI) AC_SUBST(ASM_WORDS_BIGENDIAN) AC_SUBST(EMULATOR) AC_SUBST(ASM_X86_ENDBR) diff --git a/powerpc64/machine.m4 b/powerpc64/machine.m4 index 187a49b8..60c7465d 100644 --- a/powerpc64/machine.m4 +++ b/powerpc64/machine.m4 @@ -1,7 +1,7 @@ define(`PROLOGUE', `.globl C_NAME($1) DECLARE_FUNC(C_NAME($1)) -ifelse(WORDS_BIGENDIAN,no, +ifelse(ABI,64v2, `ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN') C_NAME($1): addis 2,12,(.TOC.-C_NAME($1))@ha @@ -17,7 +17,7 @@ ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN') undefine(`FUNC_ALIGN')') define(`EPILOGUE', -`ifelse(WORDS_BIGENDIAN,no, +`ifelse(ABI,64v2, `.size C_NAME($1), . - C_NAME($1)', `.size .C_NAME($1), . - .C_NAME($1) .size C_NAME($1), . - .C_NAME($1)')') -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Build problem on ppc64be + musl
Maamoun TK writes: > config.guess detects the C standard library based on a result from the > compiler defined in "CC_FOR_BUILD" variable, for some reason OpenWrt build > system failed to set that variable properly, from your config.log I can see > CC_FOR_BUILD='gcc -O -g' but when I use bare musl tools I get > CC_FOR_BUILD='musl-gcc' In Nettle's Makefiles, CC_FOR_BUILD is intended to be a compiler targetting the *build* system, used to compile things like eccdata.c that are run on the build system as part of the build. It's intended to be different from CC when cross compiling. Not entirely sure how CC_FOR_BUILD is used in config.guess, but I think it is used to detect the system type of the build system. > There is nothing specific in the output of powerpc64-openwrt-linux-musl-gcc > -E -dM log as I can see. In musl libc FAQ, they stated that there is no > __MUSL__ in the preprocessor macros https://wiki.musl-libc.org/faq.html The interesting thing I see is #define _CALL_ELF 2 I hope this can be used to distinguish from other big-endian systems, that use ELFv1 abi? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize SHA1 with fat build support
Maamoun TK writes: > What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n > function for that particular type? That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686. So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Build problem on ppc64be + musl
David Edelsohn writes: > Musl Libc does not support ELFv1, so I don't understand how this > configuration is possible. If I understood the original report, musl always uses ELFv2 abi, for both little and big endian configurations. Which for big endian is incompatible with the way powerpc64 assembly is configured in nettle. Nettle assembly files currently use ELFv2 on little endian, but always uses ELFv1 on big endian. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Build problem on ppc64be + musl
Maamoun TK writes: > Forcing ELFv2 abi doesn't work for big-endian mode as this mode has no > support for ELFv2. ppc64 linux big-endian is deprecated, it' not unexpected > to get such issues. Dropping big-endian support for powerpc could be an > option to solve this issue but that will be a drawback for AIX (BE) systems. The configuration where it didn't work was powerpc64-openwrt-linux-musl. I'd like Nettle to work on embedded systems whenever practical. But support depends on assistance from users of those systems. As I understood it, this system needs to use the v2 ABI. I would hope it's easy to detect the abi used by the configured C compiler, and then select the same prologue sequence as is currently used for little-endian. I.e., one more configure test, and changing the "ifelse(WORDS_BIGENDIAN,no," condition in powerpc64/machine.m4 to check a different configure variable. I don't know how the linker detected abi incompatibility (ld error message like "gcm-hash.o: ABI version 1 is not compatible with ABI version 2 output"), if that's based just on the presence of the special ".opd" section, or if there are other attributes in the ELF file, and if so, how the assembler decides which attributes to attach. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize SHA1 with fat build support
Maamoun TK writes: > I made a merge request in the main repository that optimizes SHA1 for s390x > architecture with fat build support !33 > <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33>. Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost. I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course. If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy. There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all. Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Is there an equivalent to curve25519_mul for ECC keys?
Nicolas Mora writes: > I'm wondering if there is a function of a combination of functions to > perform a DH computation using ECC keys and their parameters "struct > ecc_point *pub1, struct ecc_scalar *key2"? ecc_point_mul (declared in ecc.h) is intended to do that. There's also a variant ecc_point_mul_g. But it seems they're not properly documented in the manual. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Fat build support for AES and GHASH
Maamoun TK writes: > I've applied for your change requests. I think we're ready to merge the > s390x branch at this point, let me know if there are conflicts with the > master branch tho. Merged to master branch now! Had to commit some minor fixes to make "make dist" and the s390x ci build work, and added a brief ChangeLog entry for latest additions. For the memxor merge requests, it would be good to retarget to the master branch (but I'm not sure how to do that in gitlab). Regards, /Niels > regards, > Mamone -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Fat build support for AES and GHASH
Maamoun TK writes: > I've applied for your change requests. I think we're ready to merge the > s390x branch at this point, let me know if there are conflicts with the > master branch tho. Fixes merged, thanks!. I'll try out merging the s390x branch into master(-updates), to see if there are any difficulties. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize GHASH
Maamoun TK writes: > You are right, modern operating systems are supposed to have this > functionality but accessing some program's memory is pretty easy nowadays, > I think it's a good practice to clean behind the cipher functions for what > it makes sense and whenever possible. I think it's futile to try to do that thoroughly, e.g., code generated by the compiler will not clear each stack frame on return (and I'm not even ware of any compiler option to generate code like that). We have to trust the operating system (where as usual, "trust" can also be read as "depend on"). For the specific case of key material, it might make sense to go to a little extra effort to not leave copies in memory, but other neetle code doesn't do that. > In another topic, I've optimized the SHA-512 algorithm for arm64 > architecture but it turned out all CFarm variants don't support SHA-512 > crypto extension so I can't do any performance or correctness testing for > now. Do you know any CFarm alternative that supports SHA-512 and SHA3 > extensions for arm64 architectures? Can you do correctness tests on qemu? (I've been using a crosscompiler and qemu-user to test other ARM code, and that's also what the ci tests do). I have access to the systems listed on https://gmplib.org/devel/testsystems, is any of those applicable? The arm64 machines available includes one Cortex-A73 and one Apple M1. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Fat build support for AES and GHASH
Maamoun TK writes: > I created a MR !31 > <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/31> that adds > fat build support of AES and GHASH for S390x architecture. The MR's > description has a brief overview of the modifications done to add the fat > build support. Merged, thanks! I wrote some comments asking for two followup changes (avoid inline asm, and setting of FAT_TEST_LIST). Do you think we're getting ready to merge the s390x branch to master? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Aarch64] Optimize AES
Maamoun TK writes: > I made this patch operate AES ciphering with fixed key sizes of 128-bit, > 192-bit, and 256-bit, in this case I eliminated the loading process of key > expansion for every round. Since this technique produces performance > benefits, I'm planning to keep the implementation as is and in case > handling uncommon key size is mandatory, I can append additional branch to > process message blocks with any key size. What do you think? There's no need to support non-standard key sizes. _nettle_aes_encrypt should only ever be called with one of the constants _AES128_ROUNDS, _AES192_ROUNDS, _AES256_ROUNDS as the first argument. I think it's becoming clearer that we should make assembly for _nettle_aes_encypt optional, in favor of separate entry points for aes{128,192,256}_{en,de}crypt. I think you or I had an experimental branch to do that. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize GHASH
Maamoun TK writes: > My concern is if the program > terminates then the operation system will deallocate the program's stack > without clearing its content so that leftover data will remain somewhere at > the RAM which could be a subject for a memory allocation or dumbing by > other programs. I think the kernel is responsible for clearing that memory before handing it out to a new process. If it didn't, that would be a huge security problem. I'm fairly sure operating systems do this correctly. (And I would be a bit curious to know of any exceptions, maybe some embedded or ancient systems don't do it?) Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize GHASH
Maamoun TK writes: > Any update on this patch? I think we have reached the merging stage of this > patch if there are no further queries. Merged, thanks! >> I'm thinking it's also worth it to wipe the authentication tag and the >> leftover bytes of input data from the stack. Leaving out the output >> authentication tag in the stack is never a good idea and in case of >> processing AAD the input data is left in the clear so leaving leftover >> bytes in the stack may reveal potential secret data. I've pushed another >> commit to wipe the whole parameter block content (authentication tag and >> hash subkey) and the leftover bytes of input data. Other nettle functions don't do that, it's generally assumed that the running program is trustworthy, and that the operating system protects the data from non-trustworthy processes. I think using encrypted swap (using an ephemeral key destroyed on shutdown) is a good idea. To me, it makes some sense for nettle to wipe the copy of the key (since the application might wipe the context struct and expect no copies to remain), but probably overkill for the other data. But it shouldn't hurt either. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [AArch64] Fat build support for SHA-256 compress
Maamoun TK writes: > I made a merge request that adds fat build support for SHA-256 compress > function !29 <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/29> Thanks, merged! Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [AArch64] Optimize SHA-256 compress
Maamoun TK writes: > I made a merge request !28 > <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/28> to the > 'arm64-sha1' branch that optimizes SHA-256 compress function, I've added a > brief description of the patch in addition to benchmark numbers in the MR > description. A patch for fat build support will be followed in another > merge request. Thanks, merged now! Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize GHASH
Maamoun TK writes: > I made a merge request !26 > <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/26> that > optimizes the GHASH algorithm for S390x architecture. Nice! I've added a few comments in the mr. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Aarch64] Fat build support for SHA1 compress
Maamoun TK writes: > This patch added fat build support SHA1 compress function using the regular > HWCAP features. Thanks, merged to the arm64-sha1 branch for testing. The patch in the email didn't apply cleanly, there were some breakage with added newline characters etc. Maybe try as an attachment next time (or create a merge request). Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: ANNOUNCE: Nettle-3.7.3
ni...@lysator.liu.se (Niels Möller) writes: > I've prepared a new bug-fix release of Nettle, a low-level > cryptographics library, to fix bugs in the RSA decryption functions. The > bugs cause crashes on certain invalid inputs, which could be used > for denial of service attacks on applications using these functions. I forgot to reference the CVE id allocated for this problem: CVE-2021-3580 (at the moment still in the "reserved" state). Thanks to Simo Sorce and Redhat for that registration. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
ANNOUNCE: Nettle-3.7.3
I've prepared a new bug-fix release of Nettle, a low-level cryptographics library, to fix bugs in the RSA decryption functions. The bugs cause crashes on certain invalid inputs, which could be used for denial of service attacks on applications using these functions. More details in NEWS file below. Upgrading is strongly recommended. The Nettle home page can be found at https://www.lysator.liu.se/~nisse/nettle/, and the manual at https://www.lysator.liu.se/~nisse/nettle/nettle.html. The release can be downloaded from https://ftp.gnu.org/gnu/nettle/nettle-3.7.3.tar.gz ftp://ftp.gnu.org/gnu/nettle/nettle-3.7.3.tar.gz https://www.lysator.liu.se/~nisse/archive/nettle-3.7.3.tar.gz Regards, /Niels NEWS for the Nettle 3.7.3 release This is bugfix release, fixing bugs that could make the RSA decryption functions crash on invalid inputs. Upgrading to the new version is strongly recommended. For applications that want to support older versions of Nettle, the bug can be worked around by adding a check that the RSA ciphertext is in the range 0 < ciphertext < n, before attempting to decrypt it. Thanks to Paul Schaub and Justus Winter for reporting these problems. The new version is intended to be fully source and binary compatible with Nettle-3.6. The shared library names are libnettle.so.8.4 and libhogweed.so.6.4, with sonames libnettle.so.8 and libhogweed.so.6. Bug fixes: * Fix crash for zero input to rsa_sec_decrypt and rsa_decrypt_tr. Potential denial of service vector. * Ensure that all of rsa_decrypt_tr and rsa_sec_decrypt return failure for out of range inputs, instead of either crashing, or silently reducing input modulo n. Potential denial of service vector. * Ensure that rsa_decrypt returns failure for out of range inputs, instead of silently reducing input modulo n. * Ensure that rsa_sec_decrypt returns failure if the message size is too large for the given key. Unlike the other bugs, this would typically be triggered by invalid local configuration, rather than by processing untrusted remote data. -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. signature.asc Description: PGP signature ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Aarch64] Optimize SHA1 Compress
Maamoun TK writes: >> Great speedup! Any idea why openssl is still slightly faster? >> > > Sure, OpenSSL implementation uses a loop inside SH1 update function which > eliminates the constant initialization and state loading/sotring for each > block while nettle does that for every block iteration. I see, that can make a difference if the actual compressing is fast enough. > Modifying the message words in-place will change the value used by > 'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set > Architecture: > SHA1SU0 .4S, .4S, .4S > Is the name of the SIMD source and destination register > . > . > > SHA1SU1 .4S, .4S > Is the name of the SIMD source and destination register > . > . > > So using TMP variable is necessary here. I can't think of any replacement, > let me know how the other implementations handle this case. I'm afraid I have no concrete suggestion, I would need to read up on the aarch64 instructions. Implementations that do only a single round at a time (e.g., the C implementation) uses a 16-word circular buffer for the message expansion state, and updates one of the words per round. If I read the latest patch correctly, you also don't keep any state besides the MSGx registers? > It would be nice to either make the TMP registers more temporary (i.e., >> no round depends on the value in these registers from previous rounds) >> and keep needed state only on the MSG variables. Or rename them to give >> a better hint on how they're used. >> > > Done! Yield a slight performance increase btw. Nice. > We can load all the constants (including duplicate values) from memory with > one instruction. The issue is how to get the data address properly for > every supported abi! > the easiest solution is to define > the data in the .text section to make sure the address is near enough to be > loaded with certain instruction. Do you want to do that? Using .text would probably work, even if it's in some sense more correct to put the constants in rodata segment. But let's leave as is for now. > We have an intensive discussion about that in the GCM patch. The short > story, this patch should work well for both endianness modes. Sounds good. I've pushed the combined patches to a branch arm64-sha1. Would you like to update the fat build setup, before merging to master? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Add AES Key Wrap (RFC 3394) in Nettle
Nicolas Mora writes: > I've added test cases to verify that unwrap fail if the input values > are incorrect [1]. I reuse all the unwrap test cases, changed one > ciphertext byte and expect the unwrap function to return 0. I've merged the latest version of https://git.lysator.liu.se/nettle/nettle/-/merge_requests/19 to the master-updates branch, with some minor changes (moved function typedefs out of nettle-types.h, and indentation fixes to nist-keywrap.h. Thanks for your contribution and patience. >> Or possibly under "7.3 Cipher modes", if it's too different from the >> AEAD constructions. >> > Until we come to a solution on where to put the documentation, I've > started a first draft for the documentation. Can you give me feedback > on it? I think putting it under cipher modes probably makes the most sense. The function reference looks good, it doesn't have to be a lot of text. Please spell "cipher" consistently, not "cypher". In the introduction, you write "Its intention is to provide an algorithm to wrap and unwrap cryptographic keys.". Is it possible to give a bit more details, some guidance on when it is a good idea to use this key wrapping rather than a more general AEAD algorithm? If there's some interesting background, or examples of protocols that use aes keywrap, that could also go here. I think it would also be nice to clarify that the spec defines the key wrapping as an aes-specific mode, but Nettle's implementation supports any block cipher with a block size of 16 bytes. > Also, I've never used LaTex. What tool do you recommend to write LaTex > documentation? I've tried gummi but it says there are errors in the > nettle.texinfo file... Texinfo is not quite the same as LaTeX, even if it uses the same TeX machinery for the typeset pdf version. Manual is here: https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html, but I think you can mostly go by the examples elsewhere in the Nettle manual, and check the docs only for the markup you need. You probably need to grasp the @node thing, though. See https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html#Writing-a-Node (the nettle manual uses the old-fashined way with explicit node links). I edit it in emacs, like any other file. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Aarch64] Optimize SHA1 Compress
Maamoun TK writes: Looks pretty good. A few comments and questions below. > This patch optimizes SHA1 compress function for arm64 architecture by > taking advantage of SHA-1 instructions of Armv8 crypto extension. > The SHA-1 instructions: > SHA1C: SHA1 hash update (choose) > SHA1H: SHA1 fixed rotate > SHA1M: SHA1 hash update (majority) > SHA1P: SHA1 hash update (parity) > SHA1SU0: SHA1 schedule update 0 > SHA1SU1: SHA1 schedule update 1 Can you add this brief summary of instructions as a comment in the asm file? > Benchmark on gcc117 instance of CFarm before applying the patch: > Algorithm modeMbyte/s > sha1 update 214.16 > openssl sha1 update 849.44 > Benchmark on gcc117 instance of CFarm after applying the patch: > Algorithm modeMbyte/s > sha1update 795.57 > openssl sha1 update 849.25 Great speedup! Any idea why openssl is still slightly faster? > +define(`TMP0', `v21') > +define(`TMP1', `v22') Not sure I understand how these are used, but it looks like the TMP variables are used in some way for the message expansion state? E.g., TMP0 assigned in the code for rounds 0-3, and this value used in the code for rounds 8-11. Other implementations don't need extra state for this, but just modifies the 16 message words in-place. It would be nice to either make the TMP registers more temporary (i.e., no round depends on the value in these registers from previous rounds) and keep needed state only on the MSG variables. Or rename them to give a better hint on how they're used. > +C void nettle_sha1_compress(uint32_t *state, const uint8_t *input) > + > +PROLOGUE(nettle_sha1_compress) > +C Initialize constants > +movw2,#0x7999 > +movk w2,#0x5A82,lsl #16 > +dupCONST0.4s,w2 > +movw2,#0xEBA1 > +movk w2,#0x6ED9,lsl #16 > +dupCONST1.4s,w2 > +movw2,#0xBCDC > +movk w2,#0x8F1B,lsl #16 > +dupCONST2.4s,w2 > +movw2,#0xC1D6 > +movk w2,#0xCA62,lsl #16 > +dupCONST3.4s,w2 Maybe would be clearer or more efficient to load these from memory? Not sure if there's an nice and consice way to load the four 32-bit values into a 128-bit, and then copy/duplicate them into the four const registers. > +C Load message > +ld1{MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT] > + > +C Reverse for little endian > +rev32 MSG0.16b,MSG0.16b > +rev32 MSG1.16b,MSG1.16b > +rev32 MSG2.16b,MSG2.16b > +rev32 MSG3.16b,MSG3.16b How does this work on big-endian? The ld1 with .16b is endian-neutral (according to the README), that means we always get the wrong order, and then we do unconditional byteswapping? Maybe add a comment. Not sure if it's worth the effort to make it work differently (ld1 .4w on big-endian)? It's going to be a pretty small fraction of the per-block processing. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
ni...@lysator.liu.se (Niels Möller) writes: > We could either switch it on by default in configure.ac, or add a > configure flag in .gitlab-ci. Just pushed a change to .gitlab-ci to pass --enable-s390x-msa, and it seems to work, see https://gitlab.com/gnutls/nettle/-/jobs/1284895250#L580 Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+
"Christopher M. Riedl" writes: > So in total, if we assume an ideal (but impossible) zero-cost version > for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector > load/stores we can only account for 11.82 cycles/block; leaving 4.97 > cycles/block as an additional benefit of the combined implementation. One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers. >> Another potential overhead is that data is stored to memory when passed >> between these functions. It seems we store a block 3 times, and loads a >> block 4 times (the additional accesses should be cache friendly, but >> wills till cost some execution resources). Optimizing that seems to need >> some kind of combined function. But maybe it is sufficient to optimize >> something a bit more general than aes gcm, e.g., aes ctr? > > This would basically have to replace the nettle_crypt16 function call > with arch-specific assembly, right? I can code this up and try it out in > the context of AES-GCM. Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is const void *cipher, nettle_cipher_func *f, _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src); It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place. So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry points). Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); } At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy to maintain. I wonder if there are any reasonable alternatives with similar performance? One idea that occurs to me is to replace the role of gcm_fill function (and the nettle_fill16_fOBunc type) with an arch-specific assembly only hook-interface that gets inputs in specified registers, and is expected to produce the next cipher input in registers. We could then have a aes128_any_encrypt that takes the same args as aes128_encrypt + a pointer to such a magic assembly function. The aes128_any_encrypt assembly would then put required input in the right registers (address of clear text, current counter block, previous ciphertext block, etc) and have a loop where each iteration calls the hook, and encrypts a block from registers. But I'm afraid it's not going to be so easy, given that where possible (i.e., all modes but cbc encrypt) would like to have the option to do multiple blocks in parallell. Perhaps better to have an assembly interface to functions doing ECB on one block, two blocks, three blocks (if there are sufficient number of registers), etc, in registers, and call that from the other assembly functions. A bit like the recent chacha_Ncore functions, but with input and output output in registers rather than stored in memory. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [Aarch64] Optimize SHA1 Compress
Maamoun TK writes: > I've written the patch from scratch while keeping in mind how to use the > SHA-1 instructions of Arm64 crypto extension from sha1-arm.c in Jeffrey's > repository. If that is the case, avoid phrases like "based on" which are easily misread as implying it's a derived work in the copyright sense. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
Maamoun TK writes: > Did you get the credentials of the new VM? Yes! I set it up and updated the gitlab config last evening, and I've seen a successfull ci run. > I'm thinking after adding the > address and ssh key of new VM, we can't get the optimized cores of AES > tested since enable-msa isn't triggered. We need to push some sort of > hard-coded option in configure.ac to get it tested in the VM during ci job. We could either switch it on by default in configure.ac, or add a configure flag in .gitlab-ci. Once fat build support is added, we will no longer to enable it explicitly, right? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: S390x other modes and memxor
Maamoun TK writes: > This is great information that I can keep in my memory for next > implementations. s390x arch offers 'xc' instruction "Storage-to-storage > XOR" at maximum length of 256 bytes but we can do as many iterations as we > need. I optimized memxor using that instruction as it achieves the optimal > performance for such case, I'll attach the patch at the end of > message. Nice! I'd like to merge this as soon as the s390x ci is up and running again. > Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction > because while it supports the overlapped operands it processes them from > left to right, one byte at a time. Hmm, I wonder if there's some way to work around that. > However, I think optimizing just memxor could make a good sense of how much > it would increase the performance of AES modes. CBC mode could come in > handy here since it uses memxor in encrypt and decrypt operations in case > the operands of decrypt operation don't overlap. Here is the benchmark > result of CBC mode: > > *---* > | AES-128 Encrypt | AES-128 > Decrypt | > ||| > | CBC-Accelerator 1.18 cbp | 0.75 cbp > | > | Basic AES-Accelerator13.50 cbp | 3.34 cbp > | > | Basic AES-Accelerator with memxor 15.50 | 1.57 > | > *-* This seems to confirm that cbc encrypt is the operation that gains the most from assembly for the combined operation. That aes decrypt can also gain a factor two in performance, does that mean that both aes-cbc and memxor run at speed limited by memory bandwidth? And then the gain is from one less pass loading and storing data from memory? What unit is "cbp"? If it's cycles per byte, 0.77 cycles/byte for memxor (the cost of "Basic AES-Accelerator with memxor" minus cost of CBC-Accellerator) sounds unexpectedly slow, compared to, e.g, x86_64, where I get 0.08 cycles per byte (regardless of alignment), or 0.64 cycles per 64-bit word. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
S390x other modes and memxor (was: Re: [S390x] Optimize AES modes)
Maamoun TK writes: > On Sat, May 1, 2021 at 6:11 PM Niels Möller wrote: > >> Is https://git.lysator.liu.se/nettle/nettle/-/merge_requests/23 still >> the current code? >> > > I've added the basic AES-192 and AES-256 too since there is no problem to > test them all together. Merged to the s390x branch now. Thanks for your patience. For further improvement, it would be nice to have aesN_set_encrypt_key and aesN_set_decrypt_key be two entrypoints to the same function. But will make the file replacement logic a bit more complex. And maybe the public aes*_invert_key functions should be marked as deprecated (and deleted, next time we have an abi break)? No other ciphers in Nettle have this feature, and it's not that useful for applications. From codesearch.debian.net, it looks like they are exposed by the haskell and rust bindings, though. > For the other the modes, Before doing the other modes, do you think you could investigate if memxor and memxor3 can be sped up? That should benefit many ciphers and modes, and give more relevant speedup numbers for specialized functions like aes cbc and aes ctr. The best strategy depends on whether or not unaligned memory access is possible and efficient. All current implementations do aligned writes to the destination area (and smaller writes if needed at the edges). For the C implementation and several of the asm implementations, they also do aligned reads, and use shifting to get inputs xored together at the right places. While the x86_64 implementation uses unaligned reads, since that seems as efficient, and reduces complexity quite a lot. On all platforms I'm familiar with, assembly implementations can assume that it is safe to read a few bytes outside the edge of the input buffer, as long as those reads don't cross a word boundary (corresponding to valgrind option --partial-loads-ok=yes). Ideally, memxor performance should be limited by memory/cache bandwidth (with data in L1 cache probably being the most important case. It looks like nettle-benchmark calls it with a size of 10 KB). Note that memxor3 must process data in descending address order, to support the call from cbc_decrypt, with overlapping operands. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
David Edelsohn writes: > Thanks for setting this up. The default accounts have a limited time > (90 days?). For long-term CI access, I can help request a long-term > account for Nettle. Hi, I set up the s390x vm for Nettle ci tests late March. What information do you need to arrange an extension to long-term access, so it doesn't expire? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
Maamoun TK writes: > Hi Niels, hope you are doing well now > Any update on this patch? Thanks, I'm feeling a lot better, although still a bit tired. Is https://git.lysator.liu.se/nettle/nettle/-/merge_requests/23 still the current code? I hope to be back to reviewing pending patches soon, but I also got a fairly serious bug report a few days ago that I need to attend to first. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+
"Christopher M. Riedl" writes: > An implementation combining AES+GCM _can potentially_ yield significant > performance boosts by allowing for increased instruction parallelism, avoiding > C-function call overhead, more flexibility in assembly fine-tuning, etc. This > series provides such an implementation based on the existing optimized Nettle > routines for POWER9 and later processors. Benchmark results on a POWER9 > Blackbird running at 3.5GHz are given at the end of this mail. Benchmark results are impressive. If I get the numbers right, cycles per block (16 bytes) is reduced from 40 to 22.5. You can run nettle-benchmark with the flag -f 3.5e9 (for 3.5GHz clock frequency) to get cycle numbers in the output. I'm a bit conservative about about adding assembly code for combined operations, since it can lead to an explosion in the amount of code to maintain. So I'd like to understand a bit better where the 17.5 saved cycles were spent. For the code on master, gcm_encrypt (with aes) is built from these building blocks: * gcm_fill C code, essentially 2 64-bit stores per block. On little endian, it also needs some byte swapping. * aes_encrypt Using power assembly. Performance measured as the "aes128 ECB encrypt" line in nettle-benchmark output. * memxor3 This is C code on power (and rather hairy C code). Performance can be measured with nettle-benchmark, and it's going to be a bit alignment dependent. * gcm_hash This uses power assembly. Performance is measured as the "gcm update" line in nettle-benchmark output. From your numbers, this seems to be 7.3 cycles per block. So before going all the way with a combined aes_gcm function, I think it's good to try to optimize the building blocks. Please benchmark memxor3, to see if it could benefit from assembly implementation. If so, that should give a nice speedup to several modes, not just gcm. (If you implement memxor3, beware that it needs to support some overlap, to not break in-place CBC decrypt). Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
ni...@lysator.liu.se (Niels Möller) writes: > (iii) I've considered doing it earlier, to make it easier to implement > aes without a round loop (like for all current versions of > aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load > all subkeys into registers and still have registers left to do two > or more blocks in parallel, but then we'd need to override > aes128_encrypt separately from the other aes*_encrypt. I've given this a try, see experimental patch below. It adds a x86_64/aesni/aes128-encrypt.asm, with a 2-way loop. It gives a very modest speedup, 5%, when I benchmark on my laptop (which is now a pretty fast machine, AMD Ryzen 5). I've also added a cbc-aes128-encrypt.asm. That gives more significant speedup, almost 60%. I think main reason for the speedup is that we avoid reloading subkeys between blocks. If we want to go this way, I wonder how to do it without an explosion of files and functions. For s390x, it seems each function will be very small, but not so for most other archs. There are at least three modes that are similar to cbc encrypt in that they have to process blocks sequentially, with no parallelism: CBC encrypt, CMAC, and XTS (there may be more). It's not so nice if we need (modes × ciphers) number of assembly files, with lots of duplication. Regards, /Niels diff --git a/ChangeLog b/ChangeLog index 3d19b1dd..68b8f632 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,13 @@ 2021-04-01 Niels Möller + * cbc-aes128-encrypt.c (nettle_cbc_aes128_encrypt): New file and function. + * x86_64/aesni/cbc-aes128-encrypt.asm: New file. + + * configure.ac (asm_replace_list): Add aes128-encrypt.asm + aes128-decrypt.asm. + * x86_64/aesni/aes128-encrypt.asm: New file, with 2-way loop. + * x86_64/aesni/aes128-decrypt.asm: Likewise. + Move aes128_encrypt and similar functions to their own files. To make it easier for assembly implementations to override specific AES variants. diff --git a/Makefile.in b/Makefile.in index 8d474d1e..b6b983fd 100644 --- a/Makefile.in +++ b/Makefile.in @@ -101,7 +101,8 @@ nettle_SOURCES = aes-decrypt-internal.c aes-decrypt.c aes-decrypt-table.c \ camellia256-set-encrypt-key.c camellia256-crypt.c \ camellia256-set-decrypt-key.c \ camellia256-meta.c \ -cast128.c cast128-meta.c cbc.c \ +cast128.c cast128-meta.c \ +cbc.c cbc-aes128-encrypt.c \ ccm.c ccm-aes128.c ccm-aes192.c ccm-aes256.c cfb.c \ siv-cmac.c siv-cmac-aes128.c siv-cmac-aes256.c \ cnd-memcpy.c \ diff --git a/cbc-aes128-encrypt.c b/cbc-aes128-encrypt.c new file mode 100644 index ..5f7d1c8c --- /dev/null +++ b/cbc-aes128-encrypt.c @@ -0,0 +1,42 @@ +/* cbc-aes128-encrypt.c + + Copyright (C) 2013, 2014 Niels Möller + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +*/ + +#if HAVE_CONFIG_H +# include "config.h" +#endif + +#include "cbc.h" + +void +nettle_cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src) +{ + CBC_ENCRYPT(ctx, aes128_encrypt, length, dst, src); +} diff --git a/cbc.h b/cbc.h index 93b2e739..beece610 100644 --- a/cbc.h +++ b/cbc.h @@ -35,6 +35,7 @@ #define NETTLE_CBC_H_INCLUDED #include "nettle-types.h" +#include "aes.h" #ifdef __cplusplus extern "C" { @@ -79,6 +80,10 @@ memcpy((ctx)->iv, (data), sizeof((ctx)->iv)) sizeof((self)->iv), (self)->iv,\ (length), (dst), (src))) +struct cbc_aes128_ctx CBC_CTX(struct aes128_ctx, AES_BLOCK_SIZE); +void +nettle_cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src); + #ifdef __cplusplus } #endif diff --git a/configure.ac b/configure.ac index be2916c1..26e41d89 100644 --- a/configure.ac +++ b/configure.ac @@ -544,6 +544,7 @@ fi # Files whic
Re: [S390x] Optimize AES modes
Maamoun TK writes: >> I've tried out a split, see below patch. It's a rather large change, >> moving pieces to new places, but nothing difficult. I'm considering >> committing this to the s390x branch, what do you think? >> > > I agree, I'll modify the patch of basic AES-128 optimized functions to be > built on top of the splitted aes functions. Ok, pushed to the s390x branch now. > memxor performs the same in C and assembly since s390 architecture offers > memory xor instruction "xc" see xor_len macro in machine.m4 of the original > patch for an implementation example. But the C implmementation is somewhat complicated, splitting into several cases depending on alignment, and shifting data around to be able to do word operations. If it can be done simpler with the nc instruction, that would at least cut some overhead. (Note that memxor3 must support the overlap case needed by cbc decrypt). > However, s390x AES accelerators offer considerable speedup against C > implementation with optimized internal AES. The following table > demonstrates the idea more clearly: > > Function S390x accelerator C implementation with optimized > internal AES (Only enable aes128.asm, aes192.asm, aes256.asm) > --- [...] > CBC AES128 Decrypt 0.647008 cpb 3.131405 cpb [...] > CTR AES128 Crypt0.710237 cpb 4.767290 cpb For these two, the speed difference should essentially be the time for the C implementation of memxor. "cpb" mean cycles per byte, right? 2-4 cycles per byte for memxor is quite slow. On my x86_64 laptop (ok, comparing apples to oranges), memxor, for the aligned case, is 0.08 cpb, and memxor twice as much. And even the C implementation is not that much slower. > GCM AES128 Encrypt 0.630504 cpb 15.473187 cpb For GCM, are there instructions that combine AES-CTR and GCM HASH? Or are those done separately? It would be nice to have GCM HASH being fast by itself, for performance with other ciphers than aes. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Add AES Key Wrap (RFC 3394) in Nettle
Nicolas Mora writes: >> The new feature also needs documentation, will you look into that once >> code, and in particular the interfaces, are solid? >> > Definitely! > What do you think the documentation should look like? Should it be > near paragraph 7.2.1? Like > > 7.2.1.1 AES Key Wrap That's one possibility, but I think it would also be natural to put it somewhere under or close to "7.4. Authenticated encryption and associated data", even though there's no associated data. That section could perhaps be retitled to "Authenticated encryption" to generalize it? Or possibly under "7.3 Cipher modes", if it's too different from the AEAD constructions. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Compile issue on Solaris 11.3
Jeffrey Walton writes: > --enable-fat turns on cpu identification and runtime switching. I need > that. I need AES. I don't need SHA. It is impossible to get into a > good configuration. I don't think it's worthwhile to add complexity to configure and the fat machinery, and testing thereof, to make it flexible enough for that usecase. In your case, you need it to be able to use an assembler missing support for instructions added to the architecture 7.5 years ago. Are there other usecases where more flexibility would be beneficial? I might consider it, if someone else wants to do the work, and it turns out it doesn't get too messy. To get it to work in your setting, I would suggest one of: (i) Stick to --disable-assembler, to get something that works but is slow (and unfortunately a performance regression since nettle-3.6). (ii) Upgrade your assembler to a version that recognizes the sha instructions (not sure which assembler you're using, I did ask, when you reported the problem back in January, but I haven't seen an answer). I would be a bit surprised if support for these instructions is still missing in recent releases of Oracle's development tools, if that's what you're using. > Nettle wastes a fair amount our time trying to work through these problems. To be honest, high performance on the proprietary and somewhat obscure Solaris operating system is not going to be a high priority to me, in particular version 11.3 which is soon officially end of support (January 2024, according to wikipedia, curiously the same date as for the much older Solaris 10). Correctness, which you'd get with --disable-assembler, is considerably more important. I'm willing to help getting Nettle to work on obscure and obsolete systems, as long as the cost in added complexity is small. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Compile issue on Solaris 11.3
Jeffrey Walton writes: >> I added --disable-x86-sha-ni and it still produces the error. How is >> the ASM being used if it is disabled??? You need to choose *either* --enable-fat (now the default), *or* use the explicit config options for particular instructions. Mixing is not supported. Don't do that. And I think this is at least the third time I point this out to you, most recently just a few days ago. If, e.g., you deeply dislike the way Nettle's configure works and would like it to change, your current behavior is not a productive way of improving anything. It is annoying me and wasting my time. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
Maamoun TK writes: >> > only: >> > variables: >> > - $S390X_SSH_IP_ADDRESS >> > - $S390X_SSH_PRIVATE_KEY >> > - $S390X_SSH_CI_DIRECTORY >> >> What does this mean? Ah, it excludes the job if these variables aren't >> set? >> > > Yes, this is what it does according to gitlab ci docs > <https://docs.gitlab.com/ee/ci/yaml/#onlyexcept-basic>. otherwise, fresh > forks will have always-unsuccessful job. Hmm, docs aren't quite clear, but it doesn't seem to work as is. I accidentally set the new S390X_ACCOUNT varable to "protected", and then the job was started but with $S390X_ACCOUNT expanding to the empty string, and failing.. Perhaps it needs to be written as - $FOO != "" instead? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
Maamoun TK writes: > Isn't it better to define S390X_SSH_IP_ADDRESS variable rather than > hard-code the remote server address in .gitlab-ci.yml? fresh forks now need > to update .gitlab-ci.yml to get a S390x job which is a bit unwieldy in my > opinion. Makes sense. I've added it as a variable, and renamed to S390X_ACCOUNT. Value is of the form username@ip-address. > Yes, this is what it does according to gitlab ci docs > <https://docs.gitlab.com/ee/ci/yaml/#onlyexcept-basic>. otherwise, fresh > forks will have always-unsuccessful job. Ok, added a section only: variables: - $SSH_PRIVATE_KEY - $S390X_ACCOUNT Still on the master-updates branch, will merge as soon as the run looks green. Regards, /Nies -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: xts.c:59: warning: integer constant is too large for ‘long’ type
Jeffrey Walton writes: > This is building Nettle 3.7.2 on a PowerMac with OS X 10.5: > > /usr/bin/cc -I. -I/usr/local/include -DNDEBUG -DHAVE_CONFIG_H -g2 -O2 > -mlong-double-64 -fno-common -maltivec -fPIC -pthread -ggdb3 > -Wno-pointer-sign -Wall -W -Wmissing-prototypes > -Wmissing-declarations -Wstrict-prototypes -Wpointer-arith > -Wbad-function-cast -Wnested-externs -fPIC -MT xts-aes128.o -MD -MP > -MF xts-aes128.o.d -c xts-aes128.c \ > && true > xts.c: In function ‘xts_shift’: > xts.c:59: warning: integer constant is too large for ‘long’ type > xts.c:59: warning: integer constant is too large for ‘long’ type > xts.c:60: warning: integer constant is too large for ‘long’ type > xts.c:60: warning: integer constant is too large for ‘long’ type > xts.c:60: warning: integer constant is too large for ‘long’ type > > On OS X 10.5, you have to use unsigned long long and the ull suffix. This is confusing. The xts_shift function is not in nettle-3.7.2, as far as I can tell, it was deleted long ago in https://git.lysator.liu.se/nettle/nettle/-/commit/685cc919a37b60d3f81dd569bf6e93ad7be0f89b. > Maybe you should add a configure test to see whether you need the ull suffix. The current related code uses UINT64_C for the 64-bit constants. No configure test needed. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: bug#47222: Serious bug in Nettle's ecdsa_verify
Ludovic Courtès writes: > Are there plans to make a new 3.5 release including these fixes? No, I don't plan any 3.5.x release. > Alternatively, could you provide guidance as to which commits should be > cherry-picked in 3.5 for downstream distros? Look at the branch release-3.7-fixes (https://git.lysator.liu.se/nettle/nettle/-/commits/release-3.7-fixes/). The commits since 3.7.1 are the ones you need. Changes to gostdsa and ed448 will not apply, since those curves didn't exist in nettle-3.5. Changes to ed25519 might not apply cleanly, due to refactoring when adding ed448. > I’m asking because in Guix, the easiest way for us to deploy the fixes > on the ‘master’ branch would be by “grafting” a new Nettle variant > ABI-compatible with 3.5.1, which is the one packages currently depend on. I still recommend upgrading to the latest version. There were an abi break in 3.6 (so you'd need to recompile lots of guix packages), but no incompatible changes to the (source level) api. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: libhgwwed has gone missing...
Jeffrey Walton writes: > It looks like Nettle is no longer building or installing hogweed on > some Apple platforms. > > This is from a PowerMac G5 running OS X 10.5: Most likely the configure check for libgmp failed. Check config.log for details. I think the most recent change to the gmp dependency was in nettle-3.6, which requires gmp-6.1 or later. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [S390x] Optimize AES modes
Maamoun TK writes: > I managed to get the tarball approach working in gitlab ci with the > following steps: Thanks for the research. I've added a test job based on these ideas. See https://git.lysator.liu.se/nettle/nettle/-/commit/c25774e230985a625fa5112f3f19e03302e49e7f. An almost identical setup was run successfully as https://gitlab.com/gnutls/nettle/-/jobs/1125145345. > - In gitlab go to settings -> CI / CD. Expand Variables and add the > following variables: > >- S390X_SSH_IP_ADDRESS: username@instance_ip >- S390X_SSH_PRIVATE_KEY: private key of ssh connection >- S390X_SSH_CI_DIRECTORY: name of directory in remote server where the >tarball is extracted and tested I made only the private key a variable (and of type "file", which means it's stored in a temporary file, with file name in $SSH_PRIVATE_KEY). The others are defined in the .gitlab-ci.yml file. > - Update gitlab-ci.yml as follows: > >- Add this line to variables category at the top of file: > > DEBIAN_BUILD: buildenv-debian I used the same fedora image as for the simpler build jobs. > script: > - tar --exclude=.git --exclude=gitlab-ci.yml -cf - . | ssh > $S390X_SSH_IP_ADDRESS "cd $S390X_SSH_CI_DIRECTORY/$CI_PIPELINE_IID && tar > -xf - && I'm using ./configure && make dist instead, then we get a bit testing of that too. On the remote side, directory name is based on $CI_PIPELINE_IID, that seems to be a good way to get one directory per job. > only: > variables: > - $S390X_SSH_IP_ADDRESS > - $S390X_SSH_PRIVATE_KEY > - $S390X_SSH_CI_DIRECTORY What does this mean? Ah, it excludes the job if these variables aren't set? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: [AArch64] Fat build support for GCM optimization and syntax improvements
Maamoun TK writes: > I made a merge request #21 > <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/21> that adds > fat build support for GCM implementation on arm64, the patch also updates > the README file to stay on par with the other architectures and use m4 > macros in gcm-hash.asm (patch provided by Niels Möller), in addition to add > documentation comments. Thanks! Merged to master-updates, for testing. Regard, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Nettle 3.7.2 and OS X 10.12.6
Jeffrey Walton writes: > And it looks like examples are not quite working either: > > $ make check > ... > > All 110 tests passed > > Making check in examples > TEST_SHLIB_DIR="/Users/jwalton/Build-Scripts/nettle-3.7.2/.lib" \ > srcdir="." EMULATOR="" EXEEXT="" \ > ".."/run-tests rsa-sign-test rsa-verify-test rsa-encrypt-test > Opening `testkey' failed: No such file or directory > Invalid key > FAIL: rsa-sign > Opening `testkey' failed: No such file or directory > Invalid key > FAIL: rsa-verify > Opening `testkey.pub' failed: No such file or directory > Invalid key > FAIL: rsa-encrypt > === > 3 of 3 tests failed > === > make[1]: *** [check] Error 1 > make: *** [check] Error 2 > > $ find . -name testkey.pub > $ find . -name testkey My best guess is that your operating system fails to regard the scripts examples/setup-env and teardown-env as executable (similarly to the main run-tests script). The setup-env script is supposed to create those files. The executability-bit that is set on certain files in the tarball must be honored for the build to work correctly. Please to whatever it takes to convince your build environment to do that. > Examples have been breaking the build for years. Why are examples even > built during 'make check'? The tests that are failing for you act as a kind of integration-level test for the library. I think that has some value. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Nettle 3.7.2 and OS X 10.5
Jeffrey Walton writes: > I enabled Altivec builds with > --enable-power-altivec and --enable-fat. Don't do that. As I've tried to explain before, that combination makes no sense. --enable-power-altivec means "unconditionally use the altivec code". --enable-fat (now the default) means "let the fat setup code determine at runtime if altivec (and other) features should be used". That said, I haven't done any tests of the altivec code on Mac. I'd have to rely on help from Mac users to fix any problems. > Auditing the dylib it appears Altivec was not engaged: > > $ otool -tV /usr/local/lib/libnettle.dylib | grep perm > 0001f124b _nettle_sha3_permute > _nettle_sha3_permute: > 000204ecbl _nettle_sha3_permute > > I think there's something a bit sideways here. You're a bit too terse, I have no idea what problem this is intended to illustrate. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
ANNOUNCE: Nettle-3.7.2
I've prepared a new bug-fix release of Nettle, a low-level cryptographics library, to fix a serious bug in the function to verify ECDSA signatures. Implications include an assertion failure, which could be used for denial-of-service, when verifying signatures on the secp_224r1 and secp521_r1 curves. More details in NEWS file below. Upgrading is strongly recomended. The Nettle home page can be found at https://www.lysator.liu.se/~nisse/nettle/, and the manual at https://www.lysator.liu.se/~nisse/nettle/nettle.html. The release can be downloaded from https://ftp.gnu.org/gnu/nettle/nettle-3.7.2.tar.gz ftp://ftp.gnu.org/gnu/nettle/nettle-3.7.2.tar.gz https://www.lysator.liu.se/~nisse/archive/nettle-3.7.2.tar.gz Regards, /Niels NEWS for the Nettle 3.7.2 release This is a bugfix release, fixing a bug in ECDSA signature verification that could lead to a denial of service attack (via an assertion failure) or possibly incorrect results. It also fixes a few related problems where scalars are required to be canonically reduced modulo the ECC group order, but in fact may be slightly larger. Upgrading to the new version is strongly recommended. Even when no assert is triggered in ecdsa_verify, ECC point multiplication may get invalid intermediate values as input, and produce incorrect results. It's trivial to construct alleged signatures that result in invalid intermediate values. It appears difficult to construct an alleged signature that makes the function misbehave in such a way that an invalid signature is accepted as valid, but such attacks can't be ruled out without further analysis. Thanks to Guido Vranken for setting up the fuzzer tests that uncovered this problem. The new version is intended to be fully source and binary compatible with Nettle-3.6. The shared library names are libnettle.so.8.3 and libhogweed.so.6.3, with sonames libnettle.so.8 and libhogweed.so.6. Bug fixes: * Fixed bug in ecdsa_verify, and added a corresponding test case. * Similar fixes to ecc_gostdsa_verify and gostdsa_vko. * Similar fixes to eddsa signatures. The problem is less severe for these curves, because (i) the potentially out or range value is derived from output of a hash function, making it harder for the attacker to to hit the narrow range of problematic values, and (ii) the ecc operations are inherently more robust, and my current understanding is that unless the corresponding assert is hit, the verify operation should complete with a correct result. * Fix to ecdsa_sign, which with a very low probability could return out of range signature values, which would be rejected immediately by a verifier. -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. signature.asc Description: PGP signature ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
ANNOUNCE: Serious bug in Nettle's ecdsa_verify
I've been made aware of a bug in Nettle's code to verify ECDSA signatures. Certain signatures result in the ecc point multiply function being called with out-of-range scalars, which may give incorrect results, or crash in an assertion failure. It's an old bug, probably since Nettle's initial implementation of ECDSA. I've just pushed fixes for ecdsa_verify, as well as a few other cases of potentially out-of-range scalars, to the master-updates branch. I haven't fully analysed the implications, but I'll describe my current understanding. I think an assertion failure, useful for a denial-of-service attack, is easy on the curves where the bitsize of q, the group order, is not an integral number of words. That's secp224r1, on 64-bit platforms, and secp521r1. Even when it's not possible to trigger an assertion failure, it's easy to produce valid-looking input "signatures" that hit out-of range intermediate scalar values where point multiplication may misbehave. This applies to all the NIST secp* curves as well as the GOST curves. To me, it looks very difficult to make it misbehave in such a way that ecdsa_verify will think an invalid signature is valid, but it might be possible; further analysis is needed. I will not be able to analyze it properly now, if anyone else would like to look into it, I can provide a bit more background. ed25519 and ed448 may be affected too, but it appears a bit harder to find inputs that hit out of range values. And since point operations are inherently more robust on these curves, I think they will produce correct results as long as they don't hit the assert. Advise on how to deal best with this? My current plan is to prepare a 3.7.2 bugfix release (from a new bugfix-only branch, without the new arm64 code). Maybe as soon as tomorrow (Wednesday, european time), or in the weekend. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Status update
@@ IF_LE(` eorC0.16b,C0.16b,D.16b -PMUL C1,H1M,H1L -PMUL_SUM C0,H2M,H2L +PMUL(C1,H1M,H1L) +PMUL_SUM(C0,H2M,H2L) -REDUCTION D +REDUCTION(D) andLENGTH,LENGTH,#31 @@ -284,9 +287,9 @@ IF_LE(` eorC0.16b,C0.16b,D.16b -PMUL C0,H1M,H1L +PMUL(C0,H1M,H1L) -REDUCTION D +REDUCTION(D) Lmod: tstLENGTH,#15 @@ -325,9 +328,9 @@ Lmod_8_load: Lmod_8_done: eorC0.16b,C0.16b,D.16b -PMUL C0,H1M,H1L +PMUL(C0,H1M,H1L) -REDUCTION D + REDUCTION(D) Ldone: IF_LE(` -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Add AES Key Wrap (RFC 3394) in Nettle
Nicolas Mora writes: > I've added 2 macros definitions: MSB_XOR_T_WRAP and MSB_XOR_T_UNWRAP, > I couldn't find how to make just one macro for both cases because of > the direction of the xor. Hmm. Maybe better to define an optional swap operation. Like #if WORDS_BIGENDIAN #define bswap_if_le(x) (x) #elif HAVE_BUILTIN_BSWAP64 #define bswap_if_le(x) (__builtin_bswap64 (x)) #else static uint64_t bswap_if_le(uint64_t x) { x = ((x >> 32) & UINT64_C(0x)) | ((x & UINT64_C(0x)) << 32); x = ((x >> 16) & UINT64_C(0x)) | ((x & UINT64_C(0x)) << 16); x = ((x >> 8) & UINT64_C(0xff00ff00ff00ff)) | ((x & UINT64_C(0xff00ff00ff00ff)) << 8); return x; } #endif and then use as B.u64[0] = A.u64 ^ bswap_if_le((n * j) + (i + 1)); Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Add AES Key Wrap (RFC 3394) in Nettle
Nicolas Mora writes: > memcpy (I.b + 8, R + (i * 8), 8); // This one works > I.u64[1] = *(R + (i * 8)); // This one doesn't work > > Is there something I'm missing? The reason it doesn't work is the type of R. R is now an unaligned uint8_t *. *(R + (i * 8)) (the same as R[i*8]) is an uint8_t, not an uint64_t. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Add AES Key Wrap (RFC 3394) in Nettle
Nicolas Mora writes: > I still have one uresolved comment about byte swapping but the rest > are resolved. Thanks. I'll do this round of comments on email, since it might be of interest to other contributors. * About the byteswapping comment, the code // A = MSB(64, B) ^ t where t = (n*j)+i A64 = READ_UINT64(B.b); A64 ^= (n*j)+(i+1); WRITE_UINT64(A.b, A64); could be replaced by something like #if WORDS_BIGENDIAN A.u64 = B.u64 ^ (n*j)+(i+1); #elif HAVE_BUILTIN_BSWAP64 A.u64 = B.u64 ^ __builtin_bswap64((n*j)+(i+1)); #else ... READ_UINT64 / WRITE_UINT64 or some other workaround ... #endif Preferably encapsulated into a single macro, so it doesn't have to be duplicated in both the wrap and the unwrap function. There's another example of using __builtin_bswap64 in ctr.c. * Intialization: If you don't intend to use the initial values, omit initialization in declarations like union nettle_block16 I = {0}, B = {0}; union nettle_block8 A = {0}; That helps tools like valgrind detect accidental use of uninitialized data. (And then I'm not even sure exactly how initializers are interpreted for a union type). * Some or all memcpys in the main loop can be replaced by uint64_t operations, e.g., I.u64 = A.u64; instead of memcpy(I.b, A.b, 8); (memcpy is needed when either left or right hand side is an unaligned byte buffer). If it turns out that you never use .b on some variable, you can drop the use of the union type for that variable and use uint64_t directly. > Therefore I removed 'uint8_t R[64]' to use TMP_GMP_DECL(R, uint8_t); > instead. Unfortunately, that doesn't work: This code should go into libnettle (not libhogweed), and then it can't depend on GMP. You could do plain malloc + free, but according to the README file, Nettle doesn't do memory allocation, so that's not ideal. I think it should be doable to reuse the output buffer as temporary storage (R = ciphertext for wrap, R = cleartext for unwrap). In-place operation (ciphertext == cleartext) should be supported (but no partial overlap), so it's important to test that case. Using the output area directly has the drawback that it isn't aligned, so you'll need to keep some memcpys in the main loop. One could consider using an aligned pointer into output buffer and separate handling of first and/or last block, but if that's a lot of extra complexty, I wouldn't do it unless either (i) it gives a significant performance improvement, or (ii) it turns out to actually be reasonably nice and clean. * And one more nit: Indentation. It's fine to use TAB characters, but they must be assumed to be traditional TAB to 8 positions: changing the appearance of TAB to anything else in one's editor is wrong, because it makes the code look weird for everyone else (e.g., in gitlab's ui). And the visual appearance should follow GNU standards, braces on their own lines, indent steps of two spaces, which means usually SPC characters, with TAB only for large indentation. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Add AES Key Wrap (RFC 3394) in Nettle
Nicolas Mora writes: > I've updated the MR with the new functions definitions and added test > cases based on the test vectors from the RFC. > > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/19 I've added a couple of comments on the mr. One question: Do you intentionally limit message size to 64 bytes? Is that according to spec? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: Status update
Maamoun TK writes: >> 1. New Arm64 code (don't recall current status off the top of my head). > > I almost forget about fat build, do you want fat support before merging the > code to the master branch or it's ok to be made afterward? I've merged the arm64 branch now, thanks! Fat build would be nice. And I'd like to change to m4 macros. Do you plan to work on arm64 implementations of more algorithms? If I've got it right, there are extensions with AES and SHA instructions? Chacha/salsa20 could benefit from general SIMD instructions. > 2. s390x testing. I'd prefer to not run a git checkout on the s390x test >>machine, but have the ci job make a tarball, ssh it over to the test >>machine, unpack in a fresh directory for build and test. This needs >>to be in place before adding s390x specific code. When done, could >>likely be reused for remote testing on any other platforms of >>interest, which aren't directly available in the ci system. > Done! Thanks! Sorry I'm a bit slow, but I hope to be able to setup an account and try this out reasonably soon. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Re: HPKE implementation
Norbert Pocs writes: > My current project is the implementation of HPKE draft [0]. The first goal > is to implement mode_base. Hi, I was not aware of this work. It could make sense to support in Nettle, in particular if GnuTLS wants to use it. Which combinations of public key mechanism, key derivation/expansion, and aead are of main interest? Do you expect the specification to be finalized soon? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs