Latency in polynomial evaluation

2022-01-29 Thread Niels Möller

I think this is a promising alternative, if one would otherwise need to
interleaving a large number of blocks to get full utilization of the
multipliers.

** How to choose **

When implementing one of those schemes, different processor resources
may be the bottleneck. I'd expect it to be one of

 o  Multiply latency, i.e, latency of the dependency chain from one block
to the next (including also a few additions, but multiply latency
willb e the main part). If this is the bottleneck, it means all
other instructions can be scheduled in parallel, and the processor
will sit idle for some cycles, waiting for a multiply to complete.
Typical latency for multiply is 5 times longer than for an addition
(but ratio difers quite a bit between processors, of course)

 o  Multiply throughput, i.e., the maximum number of (independent) multiply
instructions that can be run per cycle. Typical number is 0.5 -- 2.
If this is the bottleneck, the processor will spend some cycles idle,
waiting for a multiplier to be ready to accept a new input.

 o  A superscalar processor can issue several instructinos in the same
cycle, but there's a fix small limit. Typical number is 2 -- 6. So,
e.g., if the processor can issue maximum 4 instructions per cycle,
the evaluation loop consists of 40 instructions, and the loop
actually runs in close to 10 cycles per iteration, then instruction
issue is the bottleneck.

The tricks discussed in this note are useful for finding an evaluation
scheme where multiply latency isn't a bottleneck. But once a loop hits
the limit on multiply throughput or instructions per cycle, other tricks
are needed to optimize further. In particular, the postponed reduction
has a cost in multiply throughput, since it needs some additional
multiply instruction.

I think one should aim to hit the limit on multiply throughput; that one
is hard to negotiate (it's possible to reduce the number of multiply
instructions somewhat, by the Karatsuba trick, but due to the additional
overhead, likely to be useful only on processors with particularly low
multiply throughput).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Niels Möller
Maamoun TK  writes:

> Great! I believe this is the best we can get for processing one block.

One may be able to squeeze out one or two cycles more using the mulx
extension, which should make it possible to eliminate some of the move
instructions (I don't think moves cost any execution unit resources, but
they do consume decoding resources).

> I'm trying to implement two-way interleaving using AVX extension and
> the main instruction of interest here is 'vpmuludq' that does double
> multiply operation

My manual seems a bit confused if it's called pmuludq or vpmuludq. But
you're thinking of the instruction that does two 32x32 --> 64
multiplies? It will be interesting to see how that works out! It does
half the work compared to a 64 x 64 --> 128 multiply instruction, but
accumulation/folding may get more efficient by using vector registers.
(There seems to also be an avx variant doing four 32x32 --> 64
multiplies, using 256-bit registers).

> the main concern here is there's a shortage of XMM registers as
> there are 16 of them, I'm working on addressing this issue by using memory
> operands of key values for 'vpmuludq' and hope the processor cache do his
> thing here. 

Reading cached values from memory is usally cheap. So probably fine as
long as values modified are kept in registers.

> I'm expecting to complete the assembly implementation tomorrow.

If my analysis of the single-block code is right, I'd expect it to be
rather important to trim number of instructions per block.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

>> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
>
> And I've now tried the same method for the x86_64 implementation. See
> attached file + needed patch to asm.m4. This gives 2.9 GByte/s. 
>
> I'm not entirely sure cycle numbers are accurate, with clock frequence
> not being fixed. I think the machine runs bechmarks at 2.1GHz, and then
> this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4
> instructions per cycle, 0.5 multiply instructions per cycle.
>
> This laptop has an AMD zen2 processor, which should be capable of
> issuing four instructions per cycle and complete one multiply
> instruction per cycle (according to
> https://gmplib.org/~tege/x86-timing.pdf). 
>
> This seems to indicate that on this hardware, speed is not limited by
> multiplier throughput, instead, the bottleneck is instruction
> decoding/issuing, with max four instructions per cycle.

Benchmarked also on my other nearby x86_64 machine (intel broadwell
processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect
it to be generally faster, and have pushed it to the master-updates
branch.

I haven't looked that carefully at what the old code was doing, but I
think the final folding for each block used a multiply instruction that
then depends on the previous ones for that block, increasing the per
block latency. With the new code, all multiplies done for a block are
independent of each other.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-26 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

>> This is the speed I get for C implementations of poly1305_update on my
>> x86_64 laptop:
>>
>> * Radix 26: 1.2 GByte/s (old code)
>>
>> * Radix 32: 1.3 GByte/s
>>
>> * Radix 64: 2.2 GByte/s
[...]
>> For comparison, the current x86_64 asm version: 2.5 GByte/s.
[...]
> I've tried reworking folding, to reduce latency [...] With this trick I get on
> the same machine
>
> Radix 32: 1.65 GByte/s
>
> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.

And I've now tried the same method for the x86_64 implementation. See
attached file + needed patch to asm.m4. This gives 2.9 GByte/s. 

I'm not entirely sure cycle numbers are accurate, with clock frequence
not being fixed. I think the machine runs bechmarks at 2.1GHz, and then
this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4
instructions per cycle, 0.5 multiply instructions per cycle.

This laptop has an AMD zen2 processor, which should be capable of
issuing four instructions per cycle and complete one multiply
instruction per cycle (according to
https://gmplib.org/~tege/x86-timing.pdf). 

This seems to indicate that on this hardware, speed is not limited by
multiplier throughput, instead, the bottleneck is instruction
decoding/issuing, with max four instructions per cycle.

Regards,
/Niels

diff --git a/asm.m4 b/asm.m4
index 4ac21c20..60c66c25 100644
--- a/asm.m4
+++ b/asm.m4
@@ -94,10 +94,10 @@ C For 64-bit implementation
 STRUCTURE(P1305)
   STRUCT(R0, 8)
   STRUCT(R1, 8)
+  STRUCT(S0, 8)
   STRUCT(S1, 8)
-  STRUCT(PAD, 12)
-  STRUCT(H2, 4)
   STRUCT(H0, 8)
   STRUCT(H1, 8)
+  STRUCT(H2, 8)
 
 divert
C x86_64/poly1305-internal.asm

ifelse(`
   Copyright (C) 2013 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

.file "poly1305-internal.asm"

C Registers mainly used by poly1305_block
define(`CTX', `%rdi') C First argument to all functions

define(`KEY', `%rsi')
define(`MASK',` %r8')
C _poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16])
.text
ALIGN(16)
PROLOGUE(_nettle_poly1305_set_key)
W64_ENTRY(2,0)
mov $0x0ffc0fff, MASK
mov (KEY), %rax
and MASK, %rax
and $-4, MASK
mov %rax, P1305_R0 (CTX)
imul$5, %rax
mov %rax, P1305_S0 (CTX)
mov 8(KEY), %rax
and MASK, %rax
mov %rax, P1305_R1 (CTX)
shr $2, %rax
imul$5, %rax
mov %rax, P1305_S1 (CTX)
xor XREG(%rax), XREG(%rax)
mov %rax, P1305_H0 (CTX)
mov %rax, P1305_H1 (CTX)
mov %rax, P1305_H2 (CTX)

W64_EXIT(2,0)
ret

undefine(`KEY')
undefine(`MASK')

EPILOGUE(_nettle_poly1305_set_key)

define(`T0', `%rcx')
define(`T1', `%rsi')C Overlaps message input pointer.
define(`T2', `%r8')
define(`H0', `%r9')
define(`H1', `%r10')
define(`F0', `%r11')
define(`F1', `%r12')

C Compute in parallel
C
C {H1,H0} = R0 T0 + S1 T1 + S0 (T2 >> 2)
C {F1,F0} = R1 T0 + R0 T1 + S1 T2
C T = R0 * (T2 & 3)
C
C Then accumulate as
C
C +--+--+--+
C |T |H1|H0|
C +--+--+--+
C   + |F1|F0|
C   --+--+--+--+
C |H2|H1|H0|
C +--+--+--+

C _poly1305_block (struct poly1305_ctx *ctx, const uint8_t m[16], 
unsigned hi)

PROLOGUE(_nettle_poly1305_block)
W64_ENTRY(3, 0)
push%r12
mov (%rsi), T0
mov 8(%rsi), T1
mov XREG(%rdx), XREG(T2)C Also zero extends

add P1305_H0 (CTX), T0
adc P1305_H1 (CTX), T1
adc P1305_H2 (CTX), T2

mov P1305_R1 (CTX), %rax
mul T0  C R1 T0
mov %rax, F0
mov %rdx, F1

mov T0, %raxC Last use of T0 input
mov P1305_R0 (CTX), T0
mul T0  C R0*T0
mov %rax, H0
mov %rdx, H1

mov T

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-25 Thread Niels Möller
thout adding, to avoid
the explicit clearing of F0 and F1? It may also be doable with one
instruction less; the 5 instructions does 10 multiplies, but I think we
use only 7, the rest must somehow be zeroed or ignored.

>   xxmrgld VSR(TMP), VSR(TMP), VSR(ZERO)
>   li  IDX, 32
>   xxswapd VSR(F0), VSR(F0)
>   vadduqm F1, F1, TMP
>   stxsdx  VSR(F0), IDX, CTX
>
>   li  IDX, 40
>   xxmrgld VSR(F0), VSR(ZERO), VSR(F0)
>   vadduqm F1, F1, F0
>   xxswapd VSR(F1), VSR(F1)
>   stxvd2x VSR(F1), IDX, CTX

This is looks a bit verbose, if what we need to do is just to add high
part of F0 to low part of F1 (with carry to the high part of F1), and
store the result?

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-24 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> This is the speed I get for C implementations of poly1305_update on my
> x86_64 laptop:
>
> * Radix 26: 1.2 GByte/s (old code)
>
> * Radix 32: 1.3 GByte/s
>
> * Radix 64: 2.2 GByte/s
>
> It would be interesting with benchmarks on actual 32-bit hardware,
> 32-bit ARM likely being the most relevant arch.
>
> For comparison, the current x86_64 asm version: 2.5 GByte/s.

I've tried reworking folding, to reduce latency. Idea is to let the most
significant state word be close to a word, rather than limited to <= 4
as in the previous version. When multiplying by r, split one of the
multiplies to take out the low 2 bits. For the radix 64 version, that
term is

  B^2 t_2 * r0

Split t_2 as 4*hi + lo, then this can be reduced to

  B^2 lo * r0 + hi * 5*r0

(Using the same old B^2 = 5/4 (mod p) in a slightly different way).

The 5*r0 fits one word and can be precomputed, and then this
multiplication goes in parallell with the other multiplies, and no
multiply left in the final per-block folding. With this trick I get on
the same machine

Radix 32: 1.65 GByte/s

Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.

I haven't yet done a strict analysis of bounds on the state and
temporaries, but I would expect that it works out with no possibility of
overflow.

See attached file. To fit the precomputed 5*r0 in a nice way I had to
rearrange the unions in struct poly1305_ctx a bit, I also attach the
patch to do this. Size of the struct should be the same, so I think it
can be done without any abi bump.

Regards,
/Niels

diff --git a/poly1305.h b/poly1305.h
index 99c63c8a..6c13a590 100644
--- a/poly1305.h
+++ b/poly1305.h
@@ -55,18 +55,15 @@ struct poly1305_ctx {
   /* Key, 128-bit value and some cached multiples. */
   union
   {
-uint32_t r32[6];
-uint64_t r64[3];
+uint32_t r32[8];
+uint64_t r64[4];
   } r;
-  uint32_t s32[3];
   /* State, represented as words of 26, 32 or 64 bits, depending on
  implementation. */
-  /* High bits first, to maintain alignment. */
-  uint32_t hh;
   union
   {
-uint32_t h32[4];
-uint64_t h64[2];
+uint32_t h32[6];
+uint64_t h64[3];
   } h;
 };
 
/* poly1305-internal.c

   Copyright: 2013 Nikos Mavrogiannopoulos
   Copyright: 2013, 2022 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
*/

#if HAVE_CONFIG_H
#include "config.h"
#endif

#include 
#include 

#include "poly1305.h"
#include "poly1305-internal.h"

#include "macros.h"

#if 1
typedef unsigned __int128 nettle_uint128_t;

#define M64(a,b) ((nettle_uint128_t)(a) * (b))

#define r0 r.r64[0]
#define r1 r.r64[1]
#define s0 r.r64[2]
#define s1 r.r64[3]
#define h0 h.h64[0]
#define h1 h.h64[1]
#define h2 h.h64[2]

void
_nettle_poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16])
{
  uint64_t t0, t1;
  t0 = LE_READ_UINT64 (key);
  t1 = LE_READ_UINT64 (key + 8);

  ctx->r0 = t0 & UINT64_C (0x0ffc0fff);
  ctx->r1 = t1 & UINT64_C (0x0ffc0ffc);
  ctx->s0 = 5*ctx->r0;
  ctx->s1 = 5*(ctx->r1 >> 2);

  ctx->h0 = 0;
  ctx->h1 = 0;
  ctx->h2 = 0;
}

void
_nettle_poly1305_block (struct poly1305_ctx *ctx, const uint8_t *m, unsigned m128)
{
  uint64_t t0, t1, t2;
  nettle_uint128_t s, f0, f1;

  /* Add in message block */
  t0 = ctx->h0 + LE_READ_UINT64(m);
  s = (nettle_uint128_t) ctx->h1 + (t0 < ctx->h0) + LE_READ_UINT64(m+8);
  t1 = s;
  t2 = ctx->h2 + (s >> 64) + m128;

  /* Key constants are bounded by rk < 2^60, sk < 5*2^58, therefore
 all the fk sums fit in 128 bits without overflow, with at least
 one bit margin. */
  f0 = M64(t0, ctx->r0) + M64(t1, ctx->s1) + M64(t2 >> 2, ctx->s0);
  f1 = M64(t0, ctx->r1) + M64(t1, ctx->r0) + M64(t2, ctx->s1)
+ ((nettle_uint128_t)((t2 & 3) * ctx->r0) << 64);

  ctx->h0 = f0;
  f1 += f0 >> 64;
  ctx->h1 = f1;
  ctx->h2 = f1 >> 64;
}

/* Adds digest to the

Re: [PATCH v2 0/6] Add powerpc64 assembly for elliptic curves

2022-01-24 Thread Niels Möller
Amitay Isaacs  writes:

> I posted the modified codes in the earlier email thread, but I think
> posting them as a seperate series will make them easier to cherry pick.

Thanks!

> V2 changes:
>   - Use actual register names when storing/restoring from stack
>   - Drop m4 definitions which are not in use
>   - Simplify C2 folding for P192 curve
>
> Amitay Isaacs (2):
>   ecc: Add powerpc64 assembly for ecc_192_modp
>   ecc: Add powerpc64 assembly for ecc_224_modp
>
> Martin Schwenke (4):
>   ecc: Add powerpc64 assembly for ecc_384_modp
>   ecc: Add powerpc64 assembly for ecc_521_modp
>   ecc: Add powerpc64 assembly for ecc_25519_modp
>   ecc: Add powerpc64 assembly for ecc_448_modp

I merged secp192, secp384, secp521 a few days ago. The other three,
secp224, curve25519, curve448 look good too (with one very minor comment
fix which I can take care of). I'll do some local testing, then merge to
master-updates for a run of the ci system, including tests on ppc
big-endian.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH v2 5/6] ecc: Add powerpc64 assembly for ecc_25519_modp

2022-01-24 Thread Niels Möller
Amitay Isaacs  writes:

> --- /dev/null
> +++ b/powerpc64/ecc-curve25519-modp.asm
> @@ -0,0 +1,101 @@
> +C powerpc64/ecc-25519-modp.asm
> +define(`RP', `r4')
> +define(`XP', `r5')
> +
> +define(`U0', `r6')   C Overlaps unused modulo input
> +define(`U1', `r7')
> +define(`U2', `r8')
> +define(`U3', `r9')
> +define(`T0', `r10')
> +define(`T1', `r11')
> +define(`M', `r12')
> +
> +define(`UN', r3)

Comment seems misplaced, it's UN / r3 that overlaps the unused input,
right?

> + C void ecc_curve25519_modp (const struct ecc_modulo *p, mp_limb_t *rp, 
> mp_limb_t *xp)
> + .text
> +define(`FUNC_ALIGN', `5')
> +PROLOGUE(_nettle_ecc_curve25519_modp)
> +
> + C First fold the limbs affecting bit 255
> + ld  UN, 56(XP)
> + li  M, 38
> + mulhdu  T1, M, UN
> + mulld   UN, M, UN
> + ld  U3, 24(XP)
> + li  T0, 0
> + addcU3, UN, U3
> + addeT0, T1, T0
> +
> + ld  UN, 40(XP)
> + mulhdu  U2, M, UN
> + mulld   UN, M, UN
> +
> + addcU3, U3, U3
> + addeT0, T0, T0
> + srdiU3, U3, 1   C Undo shift, clear high bit
> +
> + C Fold the high limb again, together with RP[5]
> + li  T1, 19
> + mulld   T0, T1, T0
> + ld  U0, 0(XP)
> + ld  U1, 8(XP)
> + ld  T1, 16(XP)
> + addcU0, T0, U0
> + addeU1, UN, U1
> + ld  T0, 32(XP)
> + addeU2, U2, T1
> + addze   U3, U3
> +
> + mulhdu  T1, M, T0
> + mulld   T0, M, T0
> + addcU0, T0, U0
> + addeU1, T1, U1
> + std U0, 0(RP)
> + std U1, 8(RP)
> +
> + ld  T0, 48(XP)
> + mulhdu  T1, M, T0
> + mulld   UN, M, T0
> + addeU2, UN, U2
> + addeU3, T1, U3
> + std U2, 16(RP)
> +     std     U3, 24(RP)
> +
> + blr
> +EPILOGUE(_nettle_ecc_curve25519_modp)

Looks good. I must admit that the x86_64 version this is based on is not
so easy to follow.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-24 Thread Niels Möller
Maamoun TK  writes:

> I made a performance test of this patch on the available architectures I
> have access to.
>
> Arm64 (gcc117 gfarm):
> * Radix 26: 0.65 GByte/s
> * Radix 26 (2-way interleaved): 0.92 GByte/s
> * Radix 32: 0.55 GByte/s
> * Radix 64: 0.58 GByte/s
> POWER9:
> * Radix 26: 0.47 GByte/s
> * Radix 26 (2-way interleaved): 1.15 GByte/s
> * Radix 32: 0.52 GByte/s
> * Radix 64: 0.58 GByte/s
> Z15:
> * Radix 26: 0.65 GByte/s
> * Radix 26 (2-way interleaved): 3.17 GByte/s
> * Radix 32: 0.82 GByte/s
> * Radix 64: 1.22 GByte/s

Interesting. I'm a bit surprised the radix-64 doesn't perform better, in
particular on arm64. (But I'm not yet familiar with arm64 multiply
instructions).

Numbers for 2-way interleaving are impressive, I'd like to understand
how that works. Might be useful derive corresponding multiply
throughput, i.e., number of multiply operations (and with which multiply
instruction) completed per cycle, as well as total cycles per block

It looks like the folding done per-block in the radix-64 code costs at
least 5 or so cycles per block (since these operations are all
dependent, and we also have the multiply by 5 in there, probably adding
a few cycles more). Maybe at least the multiply can be postponed.

> I tried to compile the new code with -m32 flag on x86_64 but I got
> "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this
> target".

That's expected, in two ways: I don't expect radix-64 to give any
performance gain over radix-32 on any 32-bit archs. And I think __int128
is supported only on archs where it fits in two registers. If we start
using __int128 we need a configure test for it, and then it actually
makes things simpler, at least for this in this usecase, if it stays
unsupported on 32-bit archs where it shouldn't be used.

So to compile with -m32, the radix-64 code must be #if:ed out.

> Also, I've disassembled the update function of Radix 64 and none of the
> architectures has made use of SIMD support (including x86_64 that hasn't
> used XMM registers which is standard for this arch, I don't know if gcc
> supports such behavior for C compiling but I'm aware that MSVC takes
> advantage of that standardization for further optimization on compiled C
> code).

The radix-64 code really wants multiply instruction(s) for 64x64 -->
128, and I think that's not so common SIMD instruction sets (but
powerpc64 vmsumudm looks potentially useful?) Either as a
single instruction, or as a pair of mulhigh/mullow instructions. And
some not too complicated way to do a 128-bit add with proper carry
propagation in the middle.

Arm32 neon does have 32x32 --> 64, which looks like a good fit for the
radix-32 variant.

Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> The current C implementation uses radix 26, and 25 multiplies (32x32
> --> 64) per block. And quite a lot of shifts. A radix 32 variant
> analogous to the above would need 16 long multiplies and 4 short. I'd
> expect that to be faster on most machines, but I'd have to try that out.

I've tried this out, see attached file. It has an #if 0/1 to choose
between radix 64 (depending on the non-standard __int128 type for
accumulated products) and radix 32 (portable C).

This is the speed I get for C implementations of poly1305_update on my
x86_64 laptop:

* Radix 26: 1.2 GByte/s (old code)

* Radix 32: 1.3 GByte/s

* Radix 64: 2.2 GByte/s

It would be interesting with benchmarks on actual 32-bit hardware,
32-bit ARM likely being the most relevant arch.

For comparison, the current x86_64 asm version: 2.5 GByte/s.

If I understood correctly, the suggestion to use radix 26 in djb's
original paper was motivated by a high-speed implementation using
floating point arithmetic (possibly in combination with SIMD), where the
product of two 26-bit integers can be represented exactly in an IEEE
double (but it gets a bit subtle if we want to accumulate several
products), I haven't really looked into implementing poly1305 with
either floating point or SIMD.

To improve test coverage, I've also extended poly1305 tests with tests
on random inputs, with results compared to a reference implementation
based on gmp/mini-gmp. I intend to merge those testing changes soon.
See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755ace309.

Unfortunately, the http interface of the main git repo at Lysator is
inaccessible at the moment due to an expired certificate; should be
fixed in a day or two.

Regards,
/Niels

/* poly1305-internal.c

   Copyright: 2013 Nikos Mavrogiannopoulos
   Copyright: 2013, 2022 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
*/

#if HAVE_CONFIG_H
#include "config.h"
#endif

#include 
#include 

#include "poly1305.h"
#include "poly1305-internal.h"

#include "macros.h"

#if 1
typedef unsigned __int128 nettle_uint128_t;

#define M64(a,b) ((nettle_uint128_t)(a) * (b))

#define r0 r.r64[0]
#define r1 r.r64[1]
#define s1 r.r64[2]
#define h0 h.h64[0]
#define h1 h.h64[1]
#define h2 hh

void
_nettle_poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16])
{
  uint64_t t0, t1;
  t0 = LE_READ_UINT64(key);
  t1 = LE_READ_UINT64(key + 8);

  ctx->r0 = t0 & UINT64_C(0x0ffc0fff);
  ctx->r1 = t1 & UINT64_C(0x0ffc0ffc);
  ctx->s1 = 5*(ctx->r1 >> 2);

  ctx->h0 = 0;
  ctx->h1 = 0;
  ctx->h2 = 0;
}

void
_nettle_poly1305_block (struct poly1305_ctx *ctx, const uint8_t *m, unsigned t2)
{
  uint64_t t0, t1;
  nettle_uint128_t s, f0, f1;

  /* Add in message block */
  t0 = ctx->h0 + LE_READ_UINT64(m);
  s = (nettle_uint128_t) (t0 < ctx->h0) + ctx->h1 + LE_READ_UINT64(m+8);
  t1 = s;
  t2 += (s >> 64) + ctx->h2;

  /* Key constants are bounded by rk < 2^60, sk < 5*2^58, therefore
 all the fk sums fit in 128 bits without overflow, with at least
 one bit margin. */
  f0 = M64(t0, ctx->r0) + M64(t1, ctx->s1);
  f1 = M64(t0, ctx->r1) + M64(t1, ctx->r0) + t2 * ctx->s1
+ ((nettle_uint128_t)(t2 * ctx->r0) << 64);

  /* Fold high part of f1. */
  f0 += 5*(f1 >> 66);
  f1 &= ((nettle_uint128_t) 1 << 66) - 1;
  ctx->h0 = f0;
  f1 += f0 >> 64;
  ctx->h1 = f1;
  ctx->h2 = f1 >> 64;
  assert (ctx->h2 <= 4);
}

/* Adds digest to the nonce */
void
_nettle_poly1305_digest (struct poly1305_ctx *ctx, union nettle_block16 *s)
{
  uint64_t t0, t1, t2, c1, mask, s0;

  t0 = ctx->h0;
  t1 = ctx->h1;
  t2 = ctx->h2;

  /* Compute resulting carries when adding 5. */
  c1 = t0 > -(UINT64_C(5));
  t2 += (t1 + c1 < c1);

  /* Set if H >= 2^130 - 5 */
  mask = - (t2 >> 2);

  t0 += mask & 5;
  t1 += mask & c1;

  /* FIXME: Take advan

Re: [Arm64, S390x] Optimize Chacha20

2022-01-20 Thread Niels Möller
Maamoun TK  writes:

> As far as I understand, SIMD is called Advanced SIMD on AArch64 and it's
> standard for this architecture. simd is enabled by default in GCC but it
> can be disabled with nosimd option as I can see in here
> https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html which is why I made
> a specific config option for it.

If it's present on all known aarch64 systems (and HWCAP_ASIMD flag
always set), I think we can keep things simpler and use the code
unconditionally, with no extra subdir, no fat build function pointers or
configure flag.

I've pushed the merge button for the s390x merge request.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-20 Thread Niels Möller
Maamoun TK  writes:

> Wider multiplication would improve the performance for 64-bit general
> registers but as the case for the current SIMD implementation, the radix
> 2^26 fits well there.

If multiply throughput is the bottleneck, it makes sense to do as much
work as possible per multiply. So I don't think I understand the
benefits of interleaving, can you explain?

Let's consider the 64-bit case, since that's less writing. B = 2^64 as
usual. Then the state is

  H = h_2 B^2 + h_1 B + h_0 

(with h_2 rather small, depending on how far we normalize for each
block, lets assume at most 3 bits, or maybe even h_2 <= 4).

  R = r_1 B + r_0

By the spec, high 4 bits of both r_0 and r_1, and low 2 bits of r_1 are
zero, which makes mutliplication R H (mod p) particularly nice.

We get 

  R H = r_0 h_0 + B (r_1 h_0 + r_0 h_1) 
  + B^2 (r_1 h_1 + r_0 h_2) + B^3 r_1 h_2

But then B^2 = 5/4 (mod p), and hence B^2 r_1 = 5 r_1 / 4 (mod p), where
the "/ 4" is just shifting out the two low zero bits. So let r_1' = 5
r_1 / 4,

  R H = r_0 h_0 + r_1' h_1 + B (r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2)

These are 4 long multiplications (64x64 --> 128) and two short, 64x64
--> for the products involving h_2. (The 32-bit version would be 16 long
multiplications and 4 short).

From the zero high bits, we also get bounds on these terms,

 f_0 = r_0 h_0 + r_1' h_1 < 2^124 + 5*2^122 = 9*2^122

 f_1 = r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2
< 2^125 + 5*2^61 + 2^127

So these two chains can be added together as 128-bit quantities with no
overflow, in any order, there's plendy of parallelism. E.g., power
vmsumudm might be useful.

For final folding, we need to split f_1 into top 62 and low 66 bits,
multiply low part by 5 (fits in 64 bits), and add into f_0, which still
fits in 128 bits.

And then take the top 64 bits of f_0 and add into f_1 (result <= 2^66
bits).

The current C implementation uses radix 26, and 25 multiplies (32x32
--> 64) per block. And quite a lot of shifts. A radix 32 variant
analogous to the above would need 16 long multiplies and 4 short. I'd
expect that to be faster on most machines, but I'd have to try that out.


In contrast, trying to use a similar scheme for multiplying by (r^2 (mod
p)), as needed for an interleaved version, seems more expensive. There
are several contributions to the cost:

* First, the accumulation of products by power of B needs to take into
  account carry, as result can exceed 2^128, so one would need something
  closer to general schoolbok multiplication.

* Second, since r^2 (mod p) may exceed 2^128, we need three words rather
  than two, so three more short multiplications to add in.

* Third, we can't pre-divide key words by 4, since low bits are no longer
  guaranteed to be zero. This gives more expensive reduction, with more
  multiplies by 5.

The two first points makes smaller radix more attractive; if we need
three words for both factors, we can distribute the bits to ensure some
of the most significant bits are zero. 

> Since the loop of block iteration is moved to inside the assembly
> implementation, computing one multiple of key at the function prologue
> should be ok.

For large messages, that's fine, but may add a significant cost for
messages of just two blocks.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-19 Thread Niels Möller
Maamoun TK  writes:

> The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> and 382.65% speedup for s390x.
>
> OpenSSL is still ahead in terms of performance speed since it uses 4-way
> interleaving or maybe more!!
> Increasing the interleaving ways more than two has nothing to do with
> parallelism since the execution units are already saturated by using 2-ways
> for the three architectures. The reason behind the performance improvement
> is the number of execution times of reduction procedure is cutted by half
> for 4-way interleaving since the products of multiplying state parts by key
> can be combined before the reduction phase. Let me know if you are
> interested in doing that on nettle!

Interesting. I haven't paid much attention to the poly1305
implementation since it was added back in 2013. The C implementation
doesn't try to use wider multiplication than 32x32 --> 64, which is poor
for 64-bit platforms. Maybe we could use unsigned __int128 if we can
write a configure test to check if it is available and likely to be
efficient?

For most efficient interleaving, I take it one should precompute some
powers of the key, similar to how it's done in the recent gcm code?

> It would be nice if the arm64 patch will be tested on big-endian mode since
> I don't have access to any big-endian variant for testing.

Merged this one too on a branch for ci testing.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, S390x] Optimize Chacha20

2022-01-19 Thread Niels Möller
Maamoun TK  writes:

> I created merge requests that have improvements of Chacha20 for arm64 and
> s390x architectures by following the approach used in powerpc
> implementation.
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/37
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/40
> The patches have 80.85% speedup for arm64 arch and 284.79% speedup for
> s390x arch.

Nice, I've had a quick first look.

> It would be nice if the arm64 patch will be tested on big-endian mode since
> I don't have access to any big-endian variant for testing.

I've merged the arm64 code to a branch, for CI testing.

For the ARM code, which instructions are provided by the asimd
extension? Basic simd is always available, if I've understood correctly.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

2022-01-10 Thread Niels Möller
Amitay Isaacs  writes:

> Compared to the current version in master branch, this version
> definitely improves the performance of the reduction code.
>
> On POWER9, the reduction code shows 7% speed up when tested separately.
>
> The improvement in P256 sign/verify is marginal.  Here are the numbers
> from hogweed-benchmark on POWER9.
>
>  
> name size   sign/ms verify/ms
>ecdsa  256   11.10133.5713  (master)
>ecdsa  256   11.15273.6011  (this patch)

Thanks for testing. Committed to the master branch now.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 4/7] ecc: Add powerpc64 assembly for ecc_384_modp

2022-01-04 Thread Niels Möller
Amitay Isaacs  writes:

> diff --git a/powerpc64/ecc-secp384r1-modp.asm 
> b/powerpc64/ecc-secp384r1-modp.asm
> new file mode 100644
> index ..67791f09
> --- /dev/null
> +++ b/powerpc64/ecc-secp384r1-modp.asm
> @@ -0,0 +1,227 @@
> +C powerpc64/ecc-secp384r1-modp.asm

This looks nice (and it seems folding scheme is the same as for
the x86_64 version). Just one minor thing,

> +define(`FUNC_ALIGN', `5')
> +PROLOGUE(_nettle_ecc_secp384r1_modp)
> +
> + std H0, -48(SP)
> + std H1, -40(SP)
> + std H2, -32(SP)
> + std H3, -24(SP)
> + std H4, -16(SP)
> + std H5, -8(SP)

I find it clearer to use register names rather than the m4 defines for
save and restore of callee-save registers.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

2022-01-04 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>> I think it should be possible to reduce number of needed registers, and
>> completely avoid using callee-save registers (load the values now in
>> U4-U7 one at a time a bit closer to the place where they are needed in),
>> and replace F3 with $1 in the FOLD and FOLDC macros.
>
> Attaching a variant to do this. Passes tests with qemu, but I haven't
> benchmarked it on any real hardware.

Would you like to test and benchmark this on relevant real hardware,
before I merged this version?

Code still below, and committed to the branch ppc-secp256-tweaks.

Regards,
/Niels

C powerpc64/ecc-secp256r1-redc.asm

ifelse(`
   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation

   Based on x86_64/ecc-secp256r1-redc.asm

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

C Register usage:

define(`RP', `r4')
define(`XP', `r5')

define(`F0', `r3')
define(`F1', `r6')
define(`F2', `r7')
define(`T', `r8')

define(`U0', `r9')
define(`U1', `r10')
define(`U2', `r11')
define(`U3', `r12')

.file "ecc-secp256r1-redc.asm"

C FOLD(x), sets (x,F2,F1,F0)  <-- [(x << 192) - (x << 160) + (x << 128) + (x 
<<32)]
define(`FOLD', `
sldiF0, $1, 32
srdiF1, $1, 32
subfc   F2, F0, $1
subfe   $1, F1, $1
')

C FOLDC(x), sets (x,F2,F1,F0)  <-- [((x+c) << 192) - (x << 160) + (x << 128) + 
(x <<32)]
define(`FOLDC', `
sldiF0, $1, 32
srdiF1, $1, 32
addze   T, $1
subfc   F2, F0, $1
subfe   $1, F1, T
')

C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, 
mp_limb_t *xp)
.text
define(`FUNC_ALIGN', `5')
PROLOGUE(_nettle_ecc_secp256r1_redc)

ld  U0, 0(XP)
ld  U1, 8(XP)
ld  U2, 16(XP)
ld  U3, 24(XP)

FOLD(U0)
ld  T, 32(XP)
addcU1, F0, U1
addeU2, F1, U2
addeU3, F2, U3
addeU0, U0, T

FOLDC(U1)
ld  T, 40(XP)
addcU2, F0, U2
addeU3, F1, U3
addeU0, F2, U0
addeU1, U1, T

FOLDC(U2)
ld  T, 48(XP)
addcU3, F0, U3
addeU0, F1, U0
addeU1, F2, U1
addeU2, U2, T

FOLDC(U3)
ld  T, 56(XP)
addcU0, F0, U0
addeU1, F1, U1
addeU2, F2, U2
addeU3, U3, T

C If carry, we need to add in
C 2^256 - p = <0xfffe, 0xff..ff, 0x, 1>
li  F0, 0
addze   F0, F0
neg F2, F0
sldiF1, F2, 32
srdiT, F2, 32
li  XP, -2
and T, T, XP

addcU0, F0, U0
addeU1, F1, U1
addeU2, F2, U2
addeU3, T, U3

std U0, 0(RP)
std U1, 8(RP)
    std U2, 16(RP)
std U3, 24(RP)

blr
EPILOGUE(_nettle_ecc_secp256r1_redc)


-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Build problem on ppc64be + musl

2022-01-04 Thread Niels Möller
Going through some old mail... From a discussion in September:

ni...@lysator.liu.se (Niels Möller) writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>> I've tried a different approach on branch
>> https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-efv2-check. Patch
>> below. (It makes sense to me to have the new check together with the ABI
>> check, but on second thought, probably a mistake to overload the ABI
>> variable. It would be better to have a separate configure variable, more
>> similar to the W64_ABI).
>
> Another iteration, on that branch (sorry for the typo in the branch
> name), or see patch below.
>
> Stijn, can you try it out and see if it works for you?

I haven't seen any response to this, but I've nevertheless just added
these changes on the master-updates branch. It would be nice if you can
confirm that it solves the problem with musl.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Status update

2021-12-17 Thread Niels Möller
Hi, just a heads up that I'll likely not be very responsive next few
weeks. I may or may not get some hacking time during Christmas holidays.

What I'd like to do when I get time: Review recent patches for powerpc
ecc and sm4. Complete support for ANSI x9.62 (I'm not really up-to-date
on the details, but needed square root code is already in, and there's
code to do the rest was posted by Wim Lewis long ago). Prepare a new
release. Maybe write salsa20 and chacha assembly for more platforms.

But not necessarily in that order. Feel free to reply with suggested
priorities, and remind me if there's something important that I've
missed.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

2021-12-09 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Thanks! Merged to master-updates for ci testing.

And now merged to the master branch.

> I think it should be possible to reduce number of needed registers, and
> completely avoid using callee-save registers (load the values now in
> U4-U7 one at a time a bit closer to the place where they are needed in),
> and replace F3 with $1 in the FOLD and FOLDC macros.

Attaching a variant to do this. Passes tests with qemu, but I haven't
benchmarked it on any real hardware.

C powerpc64/ecc-secp256r1-redc.asm

ifelse(`
   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation

   Based on x86_64/ecc-secp256r1-redc.asm

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

C Register usage:

define(`RP', `r4')
define(`XP', `r5')

define(`F0', `r3')
define(`F1', `r6')
define(`F2', `r7')
define(`T', `r8')

define(`U0', `r9')
define(`U1', `r10')
define(`U2', `r11')
define(`U3', `r12')

.file "ecc-secp256r1-redc.asm"

C FOLD(x), sets (x,F2,F1,F0)  <-- [(x << 192) - (x << 160) + (x << 128) + (x 
<<32)]
define(`FOLD', `
sldiF0, $1, 32
srdiF1, $1, 32
subfc   F2, F0, $1
subfe   $1, F1, $1
')

C FOLDC(x), sets (x,F2,F1,F0)  <-- [((x+c) << 192) - (x << 160) + (x << 128) + 
(x <<32)]
define(`FOLDC', `
sldiF0, $1, 32
srdiF1, $1, 32
addze   T, $1
subfc   F2, F0, $1
subfe   $1, F1, T
')

C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, 
mp_limb_t *xp)
.text
define(`FUNC_ALIGN', `5')
PROLOGUE(_nettle_ecc_secp256r1_redc)

ld  U0, 0(XP)
ld  U1, 8(XP)
ld  U2, 16(XP)
ld  U3, 24(XP)

FOLD(U0)
ld  T, 32(XP)
addcU1, F0, U1
addeU2, F1, U2
addeU3, F2, U3
addeU0, U0, T

FOLDC(U1)
ld  T, 40(XP)
addcU2, F0, U2
addeU3, F1, U3
addeU0, F2, U0
addeU1, U1, T

FOLDC(U2)
ld  T, 48(XP)
addcU3, F0, U3
addeU0, F1, U0
addeU1, F2, U1
addeU2, U2, T

FOLDC(U3)
ld  T, 56(XP)
addcU0, F0, U0
addeU1, F1, U1
addeU2, F2, U2
addeU3, U3, T

C If carry, we need to add in
C 2^256 - p = <0xfffe, 0xff..ff, 0x, 1>
li  F0, 0
addze   F0, F0
neg F2, F0
sldiF1, F2, 32
srdiT, F2, 32
li  XP, -2
and T, T, XP

addcU0, F0, U0
addeU1, F1, U1
addeU2, F2, U2
addeU3, T, U3

std U0, 0(RP)
std U1, 8(RP)
std U2, 16(RP)
std     U3, 24(RP)

blr
EPILOGUE(_nettle_ecc_secp256r1_redc)

> Regards,
> /Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH] doc: documentation fot SM3 hash

2021-12-07 Thread Niels Möller
Tianjia Zhang  writes:

> Signed-off-by: Tianjia Zhang 
> ---
>  nettle.texinfo | 74 --
>  1 file changed, 72 insertions(+), 2 deletions(-)

Thanks! Merged now.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

2021-12-07 Thread Niels Möller
Amitay Isaacs  writes:

> On POWER9, the new code gives ~20% speedup for ecc_secp256r1_redc in
> isolation, and ~1% speedup for ecdsa sign and verify over the earlier
> assembly version.

Thanks! Merged to master-updates for ci testing.

I think it should be possible to reduce number of needed registers, and
completely avoid using callee-save registers (load the values now in
U4-U7 one at a time a bit closer to the place where they are needed in),
and replace F3 with $1 in the FOLD and FOLDC macros.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


x86_64 ecc_256_redc (was: Re: ARM64 ecc_256_redc)

2021-12-06 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I think the approach should apply to other 64-bit archs (should probably
> work also on x86_64, where it's sometimes tricky to avoid x86_64
> instructions clobbering the carry flag when it should be preserved, but
> probably not so difficult in this case).

x86_64 version below. I could also trimmed register usage, so it no
longer needs to save and restore any registers. On my machine, this
gives a speedup of 17% for ecc_secp256r1_redc in isolation, 3% speedup
for ecdsa sign and 7% speedup of ecdsa verify.

Regards,
/Niels

C x86_64/ecc-secp256r1-redc.asm

ifelse(`
   Copyright (C) 2013, 2021 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

.file "ecc-secp256r1-redc.asm"

define(`RP', `%rsi')
define(`XP', `%rdx')

define(`U0', `%rdi') C Overlaps unused modulo input
define(`U1', `%rcx')
define(`U2', `%rax')
define(`U3', `%r8')
define(`F0', `%r9')
define(`F1', `%r10')
define(`F2', `%r11')
define(`F3', `%rdx') C Overlap XP, used only in final carry folding

C FOLD(x), sets (x,F2,F1,F0 )  <--  (x << 192) - (x << 160) + (x << 128) + (x 
<< 32)
define(`FOLD', `
mov $1, F0
mov $1, F1
mov $1, F2
shl `$'32, F0
shr `$'32, F1
sub F0, F2
sbb F1, $1
')
C FOLDC(x), sets (x,F2,F1,F0)  <--  ((x+c) << 192) - (x << 160) + (x << 128) + 
(x << 32)
define(`FOLDC', `
mov $1, F0
mov $1, F1
mov $1, F2
adc `$'0, $1
shl `$'32, F0
shr `$'32, F1
sub F0, F2
sbb F1, $1
')
PROLOGUE(_nettle_ecc_secp256r1_redc)
W64_ENTRY(3, 0)

mov (XP), U0
FOLD(U0)
mov 8(XP), U1
mov 16(XP), U2
mov 24(XP), U3
add F0, U1
adc F1, U2
adc F2, U3
adc 32(XP), U0

FOLDC(U1)
add F0, U2
adc F1, U3
adc F2, U0
adc 40(XP), U1

FOLDC(U2)
add F0, U3
adc F1, U0
adc F2, U1
adc 48(XP), U2

FOLDC(U3)
add F0, U0
adc F1, U1
adc F2, U2
adc 56(XP), U3

C Sum, including carry, is < 2^{256} + p.
C If carry, we need to add in 2^{256} mod p = 2^{256} - p
C = <0xfffe, 0xff..ff, 0x, 1>
C and this addition can not overflow.
sbb F2, F2
mov F2, F0
mov F2, F1
mov XREG(F2), XREG(F3)
neg F0
shl $32, F1
and $-2, XREG(F3)

add F0, U0
mov U0, (RP)
adc F1, U1
mov U1, 8(RP)
adc F2, U2
mov U2, 16(RP)
adc F3, U3

mov U3, 24(RP)

W64_EXIT(3, 0)
ret
EPILOGUE(_nettle_ecc_secp256r1_redc)

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: ANNOUNCE: Serious bug in Nettle's ecdsa_verify - Critical Confirmation

2021-12-06 Thread Niels Möller
"Jayakumar, Jaikanth"  writes:

> There is a small confusion, I believe the bug reported here
> (https://lists.lysator.liu.se/pipermail/nettle-bugs/2021/009457.html)
> is related to CVE-2021-20305, right ? and this (CVE-2021-20305) is
> fixed in version 3.7.2.

Which *two* problems are you asking about? The problem referred to as 
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-20305
was fixed in nettle-3.7.2. 

Then there was a different problem, in RSA decryption,
https://cve.mitre.org/cgi-bin/cvename.cgi?name=2021-3580, fixed in
nettle-3.7.3.

> In the case it is the same, it would help big time if the CVE was
> mentioned somewhere in the bug announcement thread.

I'll try to remember to mention relevant CVE ids in future release
announcements. Would help to also document in the NEWS file?

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


ARM64 ecc_256_redc (was: Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc)

2021-12-05 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I'm looking at a different approach (experimenting on ARM64, which is
> quite similar to powerpc, but I don't yet have working code). To
> understand what the redc code is doing we need to keep in mind that what
> one folding step does is to compute
>
> + U0*p 
>
> which cancels the low limb, since p = -1 (mod 2^64). So since the low
> limb always cancel, what we need is
>
> + U0*((p+1)/2^64) 
>  
> The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} -
> p) * U0, subtracting in the folding step, and adding in the high part
> later. But one doesn't have to do it that way. One could instead use a
> FOLD macro that computes
>
>   (2^{192} - 2^{160} + 2^{128} + 2^{32}) U0
>
> I also wonder of there's some way to use carry out from one fold step
> and apply it at the right place while preparing the F0,F1,F2,F3 for the next 
> step.

I've got this working now, attaching the version with early carry
folding. Also checked in on the branch arm64-ecc. The preceding commit
(5ee0839bb28c092044fce09534651b78640518c4) collects carries and adds
them in as a separate pass over the data.

I've tested it only with Tested only with qemu-aarch64, help with
benchmarking on real arm64 hardware appreciated (just add the file in
the arm64/ directory and run ./config.status --recheck &&
./config.status have the build pick it up).

I think the approach should apply to other 64-bit archs (should probably
work also on x86_64, where it's sometimes tricky to avoid x86_64
instructions clobbering the carry flag when it should be preserved, but
probably not so difficult in this case).

C arm64/ecc-secp256r1-redc.asm

ifelse(`
   Copyright (C) 2013, 2021 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

.file "ecc-secp256r1-redc.asm"

define(`RP', `x1')
define(`XP', `x2')

define(`U0', `x0') C Overlaps unused modulo input
define(`U1', `x3')
define(`U2', `x4')
define(`U3', `x5')
define(`U4', `x6')
define(`U5', `x7')
define(`U6', `x8')
define(`U7', `x9')
define(`F0', `x10')
define(`F1', `x11')
define(`F2', `x12')
define(`F3', `x13')
define(`ZERO', `x14')

C FOLD(x), sets (F3, F2,F1,F0 )  <--  (x << 192) - (x << 160) + (x << 128) + (x 
<< 32)
define(`FOLD', `
lsl F0, $1, #32
lsr F1, $1, #32
subsF2, $1, F0
sbc F3, $1, F1
')

C FOLDC(x), sets (F3, F2,F1,F0)  <--  ((x+c) << 192) - (x << 160) + (x << 128) 
+ (x << 32)
define(`FOLDC', `
lsl F0, $1, #32
lsr F1, $1, #32
adc F3, $1, ZEROC May overflow, but final result will not.
subsF2, $1, F0
sbc F3, F3, F1
')

PROLOGUE(_nettle_ecc_secp256r1_redc)
ldr U0, [XP]
ldr U1, [XP, #8]
ldr U2, [XP, #16]
ldr U3, [XP, #24]
ldr U4, [XP, #32]
ldr U5, [XP, #40]
ldr U6, [XP, #48]
ldr U7, [XP, #56]
mov ZERO, #0

FOLD(U0)
addsU1, U1, F0
adcsU2, U2, F1
adcsU3, U3, F2
adcsU4, U4, F3

FOLDC(U1)
addsU2, U2, F0
adcsU3, U3, F1
adcsU4, U4, F2
adcsU5, U5, F3

FOLDC(U2)
addsU3, U3, F0
adcsU4, U4, F1
adcsU5, U5, F2
adcsU6, U6, F3

FOLDC(U3)
addsU4, U4, F0
adcsU5, U5, F1
adcsU6, U6, F2
adcsU7, U7, F3

C Sum, including carry, is < 2^{256} + p.
C If carry, we need to add in 2^{256} mod p = 2^{256} - p
C = <0xfffe, 0xff..ff, 0x, 1>
C and this addition can not overflow.
adc F0, ZERO, ZERO
neg F2, F0
lsl F1, F2, #32
lsr F3, F2, #32
and F3, F3, #-2

addsU0, F0, U4
adcsU1, F1, U5
adcsU2, F2, U6
adc 

Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

2021-12-03 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> If this works,
> FOLD would turn into something like
>
>   sldiF0, $1, 32
>   srdiF1, $1, 32
>   subfc   F2, $1, F0
>   addme   F3, F1

I'm looking at a different approach (experimenting on ARM64, which is
quite similar to powerpc, but I don't yet have working code). To
understand what the redc code is doing we need to keep in mind that what
one folding step does is to compute

+ U0*p 

which cancels the low limb, since p = -1 (mod 2^64). So since the low
limb always cancel, what we need is

+ U0*((p+1)/2^64) 
 
The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} -
p) * U0, subtracting in the folding step, and adding in the high part
later. But one doesn't have to do it that way. One could instead use a
FOLD macro that computes

  (2^{192} - 2^{160} + 2^{128} + 2^{32}) U0

I also wonder of there's some way to use carry out from one fold step
and apply it at the right place while preparing the F0,F1,F2,F3 for the next 
step.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH v2 1/4] Add OSCCA SM3 hash algorithm

2021-12-02 Thread Niels Möller
Tianjia Zhang  writes:

> Hi Niels,
>
>> Would you mind writing a short description of the algorithm for the
>> manual? I think it should go under "Miscellaneous hash functions". Would
>> be nice with some brief background on this hash function (origin,
>> intended applications, when and where it's useful) plus reference docs
>> for the defined constants and functions.
>>
>
> SM3 is a cryptographic hash function standard adopted by the
> government of the People's Republic of China, which was issued by the
> Cryptography Standardization Technical Committee of China on December
> 17, 2010. The corresponding standard is "GM/T 0004-2012 "SM3
> Cryptographic Hash Algorithm"".
>
> SM3 algorithm is a hash algorithm in ShangMi cryptosystems. SM3 is
> mainly used for digital signature and verification, message
> authentication code generation and verification, random number
> generation, etc.

Thanks for the backround.

>  Its algorithm is public. Combined with the public key
> algorithm SM2 and the symmetric encryption algorithm SM4, it can be
> used in various data security and network security scenarios such as
> the TLS 1.3 protocol, disk encryption, standard digital certificates,
> and digital signatures. 

I think the above two sentences could be removed or shortened. I think
the mention of TLS, with reference to RFC 8998, is the part most
relevant for the Nettle manual. Besides that I think your text provides
right level of detail.

> According to the State Cryptography
> Administration of China, its security and efficiency are equivalent to
> SHA-256.

This is relevant too.

> Thanks for your reminder, the above is the information I provided. Do
> I need to submit it to the document through PATCH?

If you can prepare a patch for nettle.texinfo, that would be ideal.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

2021-12-01 Thread Niels Möller
Amitay Isaacs  writes:

> --- /dev/null
> +++ b/powerpc64/ecc-secp256r1-redc.asm
> @@ -0,0 +1,144 @@
> +C powerpc64/ecc-secp256r1-redc.asm
> +ifelse(`
> +   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
> +
> +   Based on x86_64/ecc-secp256r1-redc.asm

Looks good, and it seems method follows the x86_64 version closely. I
just checked in a correction and a clarification to the comments to the
x86_64 version.

A few comments below.

> +C Register usage:
> +
> +define(`SP', `r1')
> +
> +define(`RP', `r4')
> +define(`XP', `r5')
> +
> +define(`F0', `r3')
> +define(`F1', `r6')
> +define(`F2', `r7')
> +define(`F3', `r8')
> +
> +define(`U0', `r9')
> +define(`U1', `r10')
> +define(`U2', `r11')
> +define(`U3', `r12')
> +define(`U4', `r14')
> +define(`U5', `r15')
> +define(`U6', `r16')
> +define(`U7', `r17')

One could save one register by letting U7 and XP overlap, since XP isn't
used after loading U7.

> + .file "ecc-secp256r1-redc.asm"
> +
> +C FOLD(x), sets (F3,F2,F1,F0)  <-- [(x << 224) - (x << 192) - (x << 96)] >> 
> 64
> +define(`FOLD', `
> + sldiF2, $1, 32
> + srdiF3, $1, 32
> + li  F0, 0
> + li  F1, 0
> + subfc   F0, F2, F0
> + subfe   F1, F3, F1

I think the 

li  F0, 0
li  F1, 0
subfc   F0, F2, F0
subfe   F1, F3, F1

could be replaced with 

subfic  F0, F2, 0C "negate with borrow"
subfze  F1, F3 

If that is measurably faster, I can't say. 

Another option: Since powerpc, like arm, seems to use the proper two's
complement convention that "borrow" is not carry, maybe we don't need to
negate to F0 and F1 at all, and instead change the later subtraction, replacing

subfc   U1, F0, U1
subfe   U2, F1, U2
subfe   U3, F2, U3
subfe   U0, F3, U0

with

addcU1, F0, U1
addeU2, F1, U2
subfe   U3, F2, U3
subfe   U0, F3, U0

I haven't thought that through, but it does make some sense to me. I
think the arm code propagates carry through a mix of add and sub
instructions in a some places. Maybe F2 needs to be incremented
somewhere for this to work, but probably still cheaper. If this works,
FOLD would turn into something like

sldiF0, $1, 32
srdiF1, $1, 32
subfc   F2, $1, F0
addme   F3, F1

(If you want to investigate this later on, that's fine too, I could merge
the code with the current folding logic).

> + C If carry, we need to add in
> + C 2^256 - p = <0xfffe, 0xff..ff, 0x, 1>
> + li  F0, 0
> + addze   F0, F0
> + neg F2, F0
> + sldiF1, F2, 32
> + srdiF3, F2, 32
> + li  U7, -2
> + and     F3, F3, U7

I think the three instructions to set F3 could be replaced with

srdiF3, F2, 31
sldiF3, F3, 1

Or maybe the and operation is faster than shift?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH v2 1/4] Add OSCCA SM3 hash algorithm

2021-12-01 Thread Niels Möller
Tianjia Zhang  writes:

> Add OSCCA SM3 secure hash (OSCCA GM/T 0004-2012 SM3) generic
> hash transformation.

Thanks, merged the patch series onto a branch "sm3" for testing, with
only minor changes.

> --- /dev/null
> +++ b/sm3.h
[...]
> +#define SM3_DIGEST_SIZE 32
> +#define SM3_BLOCK_SIZE 64
> +/* For backwards compatibility */
> +#define SM3_DATA_SIZE SM3_BLOCK_SIZE

I dropped the definition of SM3_DATA_SIZE, since this is a new feature
in Nettle, there's no old version to be compatible with.

Would you mind writing a short description of the algorithm for the
manual? I think it should go under "Miscellaneous hash functions". Would
be nice with some brief background on this hash function (origin,
intended applications, when and where it's useful) plus reference docs
for the defined constants and functions.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

2021-11-30 Thread Niels Möller
Amitay Isaacs  writes:

> + .file "ecc-secp192r1-modp.asm"

Thanks, I'm looking at this file first (being the simplest, even though
the security level of this curve is a bit low for current usage, so
performance not of so great importance).

I'm quite new to powerpc, so I'm refering to the instruction reference,
and trying to learn as we go along. It seems addc is addition with carry
output (but no carry input), adde is addition with carry input and
output, and addze is addition of zero with carry input and output.

> +define(`RP', `r4')
> +define(`XP', `r5')
> +
> +define(`T0', `r6')
> +define(`T1', `r7')
> +define(`T2', `r8')
> +define(`T3', `r9')
> +define(`C1', `r10')
> +define(`C2', `r11')

As I understand it, we could also use register r3 (unused input
argument), but we don't need to, since we have enough free scratch
registers.

> + C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp)
> + .text
> +define(`FUNC_ALIGN', `5')
> +PROLOGUE(_nettle_ecc_secp192r1_modp)
> + ld  T0, 0(XP)
> + ld  T1, 8(XP)
> + ld  T2, 16(XP)
> +
> + li  C1, 0
> + li  C2, 0
> +
> + ld  T3, 24(XP)
> + addcT0, T3, T0
> + addeT1, T3, T1
> + addze   T2, T2
> + addze   C1, C1
> +
> + ld  T3, 32(XP)
> + addcT1, T3, T1
> + addeT2, T3, T2
> + addze   C1, C1
> +
> + ld  T3, 40(XP)
> + addcT0, T3, T0
> + addeT1, T3, T1
> + addeT2, T3, T2
> + addze   C1, C1

To analyze what we are doing, I'm using the Nettle and GMP convention
that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod
p). Denote the six input words as

  

representing the number 

  B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0

The accumulation above, as I understand it, computes

   =  + a_3 (B+1) + a_4 (B^2 + B) 
+ a_5 (B^2 + B + 1>

or more graphically,

  a_2 a_2 a_1
  a_3 a_3
  a_4 a_4
+ a_5 a_5 a_5
  ---
  c_1 t_2 t_1 t_0

This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the
addze instructions can increment it).

This looks nice, and I think it is pretty efficient too. It looks a bit
different from what the x86_64 code is doing; maybe the latter could be
improved.

> + addcT0, C1, T0
> + addeT1, C1, T1
> + addze   T2, T2
> + addze   C2, C2

Above, c_1 is folded in at the right places, 

<--   + c_1 (B + 1)

This number is < B^3 + 3 (B+1). This implies that in the (quite
unlikely) case we get carry out, i.e., c_2 = 1, then the value of the
low three words is < 3 (B+1). That means that there can be no new carry
out when folding c_2.

> + li  C1, 0
> + addcT0, C2, T0
> + addeT1, C2, T1
> + addze   T2, T2
> + addze   C1, C1
> +
> + addcT0, C1, T0
> + addeT1, C1, T1
> + addze   T2, T2

So I think this final folding could be reduced to just

addcT0, C2, T0
addeT1, C2, T1
addze   T2, T2

There's no carry out, from this, because either C2 was zero, or T2 was
small, <= 3. Does that make sense?

> + std T0, 0(RP)
> + std T1, 8(RP)
> + std T2, 16(RP)
> +
> + blr
> +EPILOGUE(_nettle_ecc_secp192r1_modp)

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 0/7] Add powerpc64 assembly for elliptic curves

2021-11-28 Thread Niels Möller
Amitay Isaacs  writes:

> This series of patches add the powerpc64 assembly for modp/redc functions
> for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results
> in 15-30% performance improvements as measured on POWER9 system using
> hogweed-benchmark.

Nice. For testing these functions, I recommend running

  while NETTLE_TEST_SEED=0 ./testsuite/ecc-mod-test ; do : ; done

and

  while NETTLE_TEST_SEED=0 ./testsuite/ecc-redc-test ; do : ; done

for a few hours.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 0/4] Introduce OSCCA SM3 hash algorithm

2021-11-28 Thread Niels Möller
Tianjia Zhang  writes:

> You can refer to the ISO specification here:
> https://www.iso.org/standard/67116.html
> Or PDF version:
> https://github.com/alipay/tls13-sm-spec/blob/master/sm-en-pdfs/sm3/GBT.32905-2016.SM3-en.pdf

I see that RFC 8998 refers to
http://www.gmbz.org.cn/upload/2018-07-24/1532401392982079739.pdf, which
looks like the same pdf file. I find it a bit odd that the document
carries no information on author or organization.

> The specification does not define the reference implementation of the
> algorithm. This series of patches mainly refers to the SM3
> implementation in libgcrypt and gnulib.

It looks like the gcrypt implementation is licensed under LGPLv2.1 or
later (see https://github.com/gpg/libgcrypt/blob/master/cipher/sm3.c),
so should be fine to copy into nettle (in contrast to gnulib code, which
appears to be GPLv3, and would need explicit permission from copyright
holder before relicensing). But if it is a derived work of libgcrypt, in
the sense of copyright law, the copyright header needs to acknowledge
that, ie,

   Copyright (C) 2017 Jia Zhang
   Copyright (C) 2021 Tianjia Zhang 

Or did you write both versions, with Jia being an alternate form of
your name?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH 0/4] Introduce OSCCA SM3 hash algorithm

2021-11-25 Thread Niels Möller
Tianjia Zhang  writes:

> Add OSCCA SM3 secure hash generic hash algorithm, described
> in OSCCA GM/T 0004-2012 SM3. 

Thanks, I've had a first quick look, and it looks nice. I don't know
much about this hash function, though. A few questions:

* Is there some reasonably authoritative English reference for the
  algorithm? I checked wikipedia, and it only links to an old internet
  draft, https://tools.ietf.org/id/draft-oscca-cfrg-sm3-02.html

* The name "sm3" is a bit short, would it make sense to add some
  family-prefix, maybe "oscca_sm3"? 
 
* Do you have some examples of protocols or applications that specify
  the use of sm3?

* The implementation, it's written from scratch, or is it based on some
  reference implementation?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [PATCH] Curve point decompression

2021-11-10 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Wim Lewis  writes:
>
>> Now that 3.5.1 is out, is there a chance this could be looked at?
> Not sure in which order to do things. Maybe it will be best to first add
> the square root routines, with tests, and then add functions for
> converting between points and octet strings (and related utilities, if
> needed).

I have added sqrt functions on the branch ecc-sqrt (sorry for a forced
update since previous attempt). So this is now on top of the changes to
the inversion improvements from last year. All the secpxxxr1 curves are
supported, but not the gost curves.

Tests pass (I have additional changes to enable randomized tests that
I'd like to commit in a few days), except that sqrt(0) fails for the
secp224 curve, where the implementation uses the full Tonelli-Shanks
algorithm. I'm looking at the algorithm description in Cohen's book (A
course in computational algebraic number theory), and it seems to not
work for this case.

If we need sqrt(0), it must be handled as a special case. Also, unlike
the other square root functions, it seems tricky to make the secp224r1
square root function side-channel silent. But I expect the main use case
of point decompression is for public input (secrets in elliptic curve
crypto tend to be scalars, not points), right?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


secp256r1 mod functions

2021-10-22 Thread Niels Möller
Hi,

a while ago I was asked to explain the 64-bit C versions of
ecc_secp256r1_modp and ecc_secp256r1_modq (in ecc-secp256r1.c), and I
found that a bit difficult.

I've rewritten them, on branch
https://git.lysator.liu.se/nettle/nettle/-/blob/secp256r1-mod/ecc-secp256r1.c.
Main difference is handling of the case that next quotient is close to
2^{64}: Old code allowed the quotient to overflow 64 bits, using an
additional carry variable q2. New code ensures that next quotient is
always at most 2^{64} - 1.

For the new implementation, the modp function is a special case of the
2/1 division in https://gmplib.org/~tege/division-paper.pdf (would
usually need 3/2 division to get sufficient accuracy, but reduces to 2/1
since the next most significant word of p is 0), and the modq function
is a special case of divappr2, described in
https://www.lysator.liu.se/~nisse/misc/schoolbook-divappr.pdf.

I've not been able to measure any significant difference in speed (I get
somewhat noisy measurements from the examples/ecc-benchmark tool),
although I would expect the new code to be very slightly faster. These
functions are not that performance critical, since the bulk of the
reductions for this curve is done using redc, not mod.

Any additional testing, benchmarking, or code staring, is appreciated. I
will likely merge the new code to the master branch in a few days.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize SHA3 permute using vector facility

2021-10-22 Thread Niels Möller
Maamoun TK  writes:

> I've added a new patch that optimizes SHA3 permute function for S390x
> architecture https://git.lysator.liu.se/nettle/nettle/-/merge_requests/36
> More about the patch in merge request description.

Really nice speedup, and interesting that it's significantly faster than
your previous version using the special sha3 instructions.

I'm sorry the existing implementations are quite hard to follow, with
irregular data movements and rather unstructured comments. It must have
been a bit challenging to decipher the x86_64 version. Do you have any
ideas on how to improve documentation and comments?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Structural fixes to the manual

2021-09-22 Thread Niels Möller
I've spent some time to improve structure (mostly non-text changes) of
the manual.

1. Deleted all explicit node pointers in nettle.texinfo, instead letting
   makeinfo infer the node structure. This is the recommended way these
   days, according to texinfo documentation.

2. Changed the make rules producing nettle.pdf to use texi2pdf, instead
   of the chain texi2dvi + dvips + pstopdf. Most obvious result is that
   hyperlinks work better, and output file is slightly smaller. It's
   done in whatever way is default in texi2pdf, I haven't tried to check
   the details (e.g., what kind of fonts are used, and if they're all
   embedded in the file).

3. Split the huge Cipher functions node into one node per cipher.

4. Fixed a few places where urls or example code was too wide for the
   page.

According to the docs
(https://www.gnu.org/software/texinfo/manual/texinfo/html_node/URL-Line-Breaking.html),
line breaks should be automatically added in urls when needed (and
that's true also according to the docs for texinfo-6.7, which is what I
have installed), but that didn't work at all when I tried it, so I've
added a few explicit hints on how to break long urls. Also the
@urefbreakstyle command wasn't recognized at all. Anyone here more
familiar with texinfo that can explain?

Regards,
/Niels


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize SHA1 with fat build support

2021-09-20 Thread Niels Möller
Maamoun TK  writes:

> I got almost 12% speedup of optimizing the sha3_permute() function using
> the SHA hardware accelerator of s390x, is it worth adding that assembly
> implementation?

For such a small assembly function, I think it's worth the effort (more
questionable if it was worth adding the special instructions for it...).

If you have the time, you could also try out doing it with vector
registers, like on x86_64 and arm/neon. Some difficulties in the x86_64
implementation were (i) xmm register shortage, (ii) moving 64-bit pieces
between the 128-bit xmm registers, and (iii) rotating the 64-bit pieces
of an xmm register by different shift counts.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Feature request: OCB mode

2021-09-18 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> If someone wants to work on it, please post to the list. I might look
> into it myself, but as you have noticed, I have rather limited hacking
> time.

I've given it a try, see branch ocb-mode. Based on RFC 7253. Passes
tests, but not particularly optimized. Some comments and questions:

1. Most of the operations use only the enrypt function of the underlying
   block cipher. Except ocb decrypt, which needs *both* the decrypt
   function and the encrypt function. For ciphers that use different key
   setup for encrypt and decrypt, e.g., AES, that means that to decrypt
   OCB one needs to initialize two separate aes128_ctx. To call the
   somewhat unwieldy

  void
  ocb_decrypt (struct ocb_ctx *ctx, const struct ocb_key *key,
   const void *encrypt_ctx, nettle_cipher_func *encrypt,
   const void *decrypt_ctx, nettle_cipher_func *decrypt,
   size_t length, uint8_t *dst, const uint8_t *src);

2. It's not obvious how to best manage the different L_i values. Can be
   computed upfront, on demand, or cached in some way. Current code
   computes only L_*, L_$ and L_0 up front (part of ocb_set_key), and
   the others recomputed each time they're needed.

3. The processing of the authenticated data doesn't depend on the nonce
   in any way. That means that if one processes several messages with
   the same key and associated data, the associated data can be
   processed once, with the same sum reused for all messages.

   Is that something that is useful in practice, and which nettle
   interfaces should support?

4. The way the nonce is used seems designed to allow cheap incrementing
   of the nonce. The nonce is used to determine

 Offset_0 = Stretch[1+bottom..128+bottom]

   where "bottom" is the least significant 6 bits of the nonce, acting as
   a shift, and "Stretch" is independent of those nonce bits, so
   unchanged on all but one out of 64 nonce increments.

   Should nettle support some kind of auto-incrementing nonce that takes
   advantage of this? Nettle does something similar for UMAC (not sure
   if there are others).

As I said, current code is not particularly optimized, but OCB has
potential to be quite fast. The per-block processing for authentication
of the message (not associated data) is just an XOR. And
encryption/decryption can be done several blocks in parallel, like CTR
mode. If we do, e.g., 4 or 8 blocks at a time, there will be a fairly
regular structure of the needed Offset_i values, possibly making them
cheaper to setup, but I haven't yet looked into those details.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


CBC-AES (was: Re: [S390x] Optimize AES modes)

2021-09-13 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I've also added a cbc-aes128-encrypt.asm.
> That gives more significant speedup, almost 60%. I think main reason for
> the speedup is that we avoid reloading subkeys between blocks.

I've continued this path, see branch aes-cbc. The aes128 variant is at 

https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes128-encrypt.asm

Benchmark results are positive but a bit puzzling. On my laptop (AMD
Ryzen 5) I get

aes128  ECB encrypt 5450.18

This is the latest version, doing two blocks per iteration.

aes128  CBC encrypt  547.34

The general CBC mode written in C, with one call to aes128_encrypt per
block. 10(!) times slower than ECB.

cbc_aes128  encrypt  865.11

The new assembly function. Almost 60% speedup over the old code, which
is nice, and large enough that it seems motivated to have the new
functin. But still 6 times slower than ECB. I'm not sure why. Let's look
a bit closer at cycle numbers.

Not sure I get accurate cycle numbers (it's a bit tricky with variable
features and turbo modes and whatnot), but it looks like ECB mode is 6
cycles per block, which would be consistent with issue of two aesenc
instructions per block. While the CBC mode is 37 cycles per block,
almost 4 cycles per aesenc. 

This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii)
the processor's out-of-order machinery results in as many as 7-8 blocks
processed in parallel when executing the ECB loop, i.e., instruction
issue for 3-4 iterations through the loop before the results of the
first iteration is ready.

The interface for the new function is 

  struct cbc_aes128_ctx CBC_CTX(struct aes128_ctx, AES_BLOCK_SIZE);
  void
  cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, 
 uint8_t *dst, const uint8_t *src);

I'm not that fond of the struct cbc_aes128_ctx though, which includes
both (constant) subkeys and iv. So I'm considering changing that to

  void
  cbc_aes128_encrypt(const struct aes128_ctx *ctx, uint8_t *iv,
 size_t length, uint8_t *dst, const uint8_t *src);

I.e., similar to cbc_encrypt, but without the arguments
nettle_cipher_func *f, size_t block_size.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Build problem on ppc64be + musl

2021-09-02 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I've tried a different approach on branch
> https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-efv2-check. Patch
> below. (It makes sense to me to have the new check together with the ABI
> check, but on second thought, probably a mistake to overload the ABI
> variable. It would be better to have a separate configure variable, more
> similar to the W64_ABI).

Another iteration, on that branch (sorry for the typo in the branch
name), or see patch below.

Stijn, can you try it out and see if it works for you?

Regards,
/Niels

diff --git a/config.m4.in b/config.m4.in
index d89325b8..b98a5817 100644
--- a/config.m4.in
+++ b/config.m4.in
@@ -5,6 +5,7 @@ define(`COFF_STYLE', `@ASM_COFF_STYLE@')dnl
 define(`TYPE_FUNCTION', `@ASM_TYPE_FUNCTION@')dnl
 define(`TYPE_PROGBITS', `@ASM_TYPE_PROGBITS@')dnl
 define(`ALIGN_LOG', `@ASM_ALIGN_LOG@')dnl
+define(`ELFV2_ABI', `@ELFV2_ABI@')dnl
 define(`W64_ABI', `@W64_ABI@')dnl
 define(`RODATA', `@ASM_RODATA@')dnl
 define(`WORDS_BIGENDIAN', `@ASM_WORDS_BIGENDIAN@')dnl
diff --git a/configure.ac b/configure.ac
index ebec8759..2ed4ab4e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -311,6 +311,9 @@ AC_SUBST([GMP_NUMB_BITS])
 # Figure out ABI. Currently, configurable only by setting CFLAGS.
 ABI=standard
 
+ELFV2_ABI=no # For powerpc64
+W64_ABI=no   # For x86_64 windows
+
 case "$host_cpu" in
   [x86_64 | amd64])
 AC_TRY_COMPILE([
@@ -355,6 +358,15 @@ case "$host_cpu" in
 ], [
   ABI=64
 ])
+if test "$ABI" = 64 ; then
+  AC_TRY_COMPILE([
+#if _CALL_ELF == 2
+#error ELFv2 ABI
+#endif
+  ], [], [], [
+   ELFV2_ABI=yes
+  ])
+fi
 ;;
   aarch64*)
 AC_TRY_COMPILE([
@@ -750,7 +762,6 @@ IF_DLL='#'
 LIBNETTLE_FILE_SRC='$(LIBNETTLE_FORLINK)'
 LIBHOGWEED_FILE_SRC='$(LIBHOGWEED_FORLINK)'
 EMULATOR=''
-W64_ABI=no
 
 case "$host_os" in
   mingw32*|cygwin*)
@@ -1031,6 +1042,7 @@ AC_SUBST(ASM_TYPE_FUNCTION)
 AC_SUBST(ASM_TYPE_PROGBITS)
 AC_SUBST(ASM_MARK_NOEXEC_STACK)
 AC_SUBST(ASM_ALIGN_LOG)
+AC_SUBST(ELFV2_ABI)
 AC_SUBST(W64_ABI)
 AC_SUBST(ASM_WORDS_BIGENDIAN)
 AC_SUBST(EMULATOR)
diff --git a/powerpc64/machine.m4 b/powerpc64/machine.m4
index 187a49b8..b59f0863 100644
--- a/powerpc64/machine.m4
+++ b/powerpc64/machine.m4
@@ -1,7 +1,7 @@
 define(`PROLOGUE',
 `.globl C_NAME($1)
 DECLARE_FUNC(C_NAME($1))
-ifelse(WORDS_BIGENDIAN,no,
+ifelse(ELFV2_ABI,yes,
 `ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN')
 C_NAME($1):
 addis 2,12,(.TOC.-C_NAME($1))@ha
@@ -17,7 +17,7 @@ ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN')
 undefine(`FUNC_ALIGN')')
 
 define(`EPILOGUE',
-`ifelse(WORDS_BIGENDIAN,no,
+`ifelse(ELFV2_ABI,yes,
 `.size C_NAME($1), . - C_NAME($1)',
 `.size .C_NAME($1), . - .C_NAME($1)
 .size C_NAME($1), . - .C_NAME($1)')')

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Reorganization of x86_64 aesni code

2021-09-02 Thread Niels Möller
I've merged a reorganization of the x86_64 aesni code to the
master-updates branch for testing. This replaces the
x86_64/aesni/aes-*crypt-internal.asm files with separate files for the
different key sizes, as has been discussed earlier.

And I've implemented 2-way interleaving, i.e., doing 2 blocks at a time,
which gave a nice speedup on the order of 15% in my tests. I may be
worthwhile to go to 3-way or 4-way, but I don't plan to try that soon.

Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Feature request: OCB mode

2021-09-01 Thread Niels Möller
Justus Winter  writes:

> we (Sequoia PGP) would love to see OCB being implemented in Nettle.  The
> OpenPGP working group is working on a revision of RFC4880, which will
> mostly be a cryptographic refresh, and will bring AEAD to OpenPGP.
>
> The previous -now abandoned- draft called for EAX being mandatory, and
> OCB being optional [0].  This was motivated by OCB being encumbered by
> patents.  However, said patents were waived by the holder [1].
>
> 0: 
> https://datatracker.ietf.org/doc/html/draft-ietf-openpgp-rfc4880bis-10#section-9.6
> 1: https://mailarchive.ietf.org/arch/msg/cfrg/qLTveWOdTJcLn4HP3ev-vrj05Vg/

That's good news, I hadn't seen that. Then OCB gets a lot more
interesting. And https://datatracker.ietf.org/doc/html/rfc7253 is a
proper reference (there seems to be a couple of different versions of
OCB)?

> Unfortunately, we don't have the expertise in our team to contribute a
> patch, and we currently aren't in a position to offer funding for the
> implementation.

If someone wants to work on it, please post to the list. I might look
into it myself, but as you have noticed, I have rather limited hacking
time.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Big endian tests (no mips)

2021-08-30 Thread Niels Möller
Maamoun TK  writes:

> On Mon, Aug 23, 2021 at 8:59 PM Niels Möller  wrote:
>
>> I would like to keep testing on big-endian. s390x is big-endian, right?
>> And so is powerpc64 (non -el). So it would be nice to configure cross
>> tests on one of those platforms configured with --disable-assembler, to
>> test portability of the C code. Are s390x cross tools and qemu-user in
>> good enough shape (it's an official debian release arch), or is
>> powerpc64 a better option?
>>
>
> Yes, s390x is big-endian and it's good for such purposes. Along being
> officially supported in debian releases, it runs natively on remote
> instance in gitlab CI.

I've just added an s390x cross-build to the gitlab ci, with
--disable-assembler to exercise all #if WORDS_BIGENDIAN. 

I noticed that for some of the archs (powerpc64, powerpc64el, s390x,
i.e., the ones not used in gnutls tests) we don't have any cross
libgmp-dev packages preinstalled in the image, and since we don't
explicitly install them either, there's no test coverage of public key
functions in these builds. I'll see if I can fix that.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Big endian tests (no mips) (was: Re: Build problem on ppc64be + musl)

2021-08-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Unfortunaly, the CI cross builds aren't working at the moment (the
> buildenv images are based on Debian Buster ("stable" at the time images
> were built), and nettle's ci scripts do apt-get update and apt-get
> install, which now attempts to get Bullseye packages (new "stable" since
> a week ago)).

Images now updated to debian stable (thanks, Daiki!). But we'll have to
drop mips tests for now, since current setup assumes archs under tests
are available in debian, and mips has been discontinued as a debian
release architecture. Other cross builds now work (change to drop mips
is on the master-updates branch). If you have ideas on how to revive mips
tests, that's welcome, but for now we'll have to do without.

I would like to keep testing on big-endian. s390x is big-endian, right?
And so is powerpc64 (non -el). So it would be nice to configure cross
tests on one of those platforms configured with --disable-assembler, to
test portability of the C code. Are s390x cross tools and qemu-user in
good enough shape (it's an official debian release arch), or is
powerpc64 a better option?
 
Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Build problem on ppc64be + musl

2021-08-22 Thread Niels Möller
Maamoun TK  writes:

> That's right, in little-endian systems I got "#define _CALL_ELF 2" while in
> big-endian ones that value is 1 except when using musl.

That's good.

> I've updated the
> patch in the branch
> https://git.lysator.liu.se/mamonet/nettle/-/tree/ppc64_musl_fix to exploit
> this distinction.

I've tried a different approach on branch
https://git.lysator.liu.se/nettle/nettle/-/tree/ppc64-efv2-check. Patch
below. (It makes sense to me to have the new check together with the ABI
check, but on second thought, probably a mistake to overload the ABI
variable. It would be better to have a separate configure variable, more
similar to the W64_ABI).

Unfortunaly, the CI cross builds aren't working at the moment (the
buildenv images are based on Debian Buster ("stable" at the time images
were built), and nettle's ci scripts do apt-get update and apt-get
install, which now attempts to get Bullseye packages (new "stable" since
a week ago)).

Regards,
/Niels

diff --git a/config.m4.in b/config.m4.in
index d89325b8..2ac19a84 100644
--- a/config.m4.in
+++ b/config.m4.in
@@ -5,6 +5,7 @@ define(`COFF_STYLE', `@ASM_COFF_STYLE@')dnl
 define(`TYPE_FUNCTION', `@ASM_TYPE_FUNCTION@')dnl
 define(`TYPE_PROGBITS', `@ASM_TYPE_PROGBITS@')dnl
 define(`ALIGN_LOG', `@ASM_ALIGN_LOG@')dnl
+define(`ABI', `@ABI@')dnl
 define(`W64_ABI', `@W64_ABI@')dnl
 define(`RODATA', `@ASM_RODATA@')dnl
 define(`WORDS_BIGENDIAN', `@ASM_WORDS_BIGENDIAN@')dnl
diff --git a/configure.ac b/configure.ac
index ebec8759..0efa5795 100644
--- a/configure.ac
+++ b/configure.ac
@@ -353,8 +353,15 @@ case "$host_cpu" in
 ], [], [
   ABI=32
 ], [
-  ABI=64
-])
+  AC_TRY_COMPILE([
+#if _CALL_ELF == 2
+#error ELFv2 ABI
+#endif
+  ], [], [
+   ABI=64v1
+  ], [
+   ABI=64v2
+  ])])
 ;;
   aarch64*)
 AC_TRY_COMPILE([
@@ -514,7 +521,7 @@ if test "x$enable_assembler" = xyes ; then
   fi
   ;;
 *powerpc64*)
-  if test "$ABI" = 64 ; then
+  if test "$ABI" != 32 ; then
GMP_ASM_POWERPC_R_REGISTERS
asm_path="powerpc64"
if test "x$enable_fat" = xyes ; then
@@ -1032,6 +1039,7 @@ AC_SUBST(ASM_TYPE_PROGBITS)
 AC_SUBST(ASM_MARK_NOEXEC_STACK)
 AC_SUBST(ASM_ALIGN_LOG)
 AC_SUBST(W64_ABI)
+AC_SUBST(ABI)
 AC_SUBST(ASM_WORDS_BIGENDIAN)
 AC_SUBST(EMULATOR)
 AC_SUBST(ASM_X86_ENDBR)
diff --git a/powerpc64/machine.m4 b/powerpc64/machine.m4
index 187a49b8..60c7465d 100644
--- a/powerpc64/machine.m4
+++ b/powerpc64/machine.m4
@@ -1,7 +1,7 @@
 define(`PROLOGUE',
 `.globl C_NAME($1)
 DECLARE_FUNC(C_NAME($1))
-ifelse(WORDS_BIGENDIAN,no,
+ifelse(ABI,64v2,
 `ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN')
 C_NAME($1):
 addis 2,12,(.TOC.-C_NAME($1))@ha
@@ -17,7 +17,7 @@ ifdef(`FUNC_ALIGN',`.align FUNC_ALIGN')
 undefine(`FUNC_ALIGN')')
 
 define(`EPILOGUE',
-`ifelse(WORDS_BIGENDIAN,no,
+`ifelse(ABI,64v2,
 `.size C_NAME($1), . - C_NAME($1)',
 `.size .C_NAME($1), . - .C_NAME($1)
 .size C_NAME($1), . - .C_NAME($1)')')


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Build problem on ppc64be + musl

2021-08-19 Thread Niels Möller
Maamoun TK  writes:

> config.guess detects the C standard library based on a result from the
> compiler defined in "CC_FOR_BUILD" variable, for some reason OpenWrt build
> system failed to set that variable properly, from your config.log I can see
> CC_FOR_BUILD='gcc -O -g' but when I use bare musl tools I get
> CC_FOR_BUILD='musl-gcc'

In Nettle's Makefiles, CC_FOR_BUILD is intended to be a compiler
targetting the *build* system, used to compile things like eccdata.c
that are run on the build system as part of the build. It's intended to
be different from CC when cross compiling.

Not entirely sure how CC_FOR_BUILD is used in config.guess, but I think
it is used to detect the system type of the build system.

> There is nothing specific in the output of powerpc64-openwrt-linux-musl-gcc
> -E -dM log as I can see. In musl libc FAQ, they stated that there is no
> __MUSL__ in the preprocessor macros https://wiki.musl-libc.org/faq.html

The interesting thing I see is 

#define _CALL_ELF 2

I hope this can be used to distinguish from other big-endian systems,
that use ELFv1 abi?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize SHA1 with fat build support

2021-08-18 Thread Niels Möller
Maamoun TK  writes:

> What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n
> function for that particular type?

That's an input file for an obscure "loop mixer" tool, IIRC, it was
written mainly by David Harvey for use with GMP loops. This tool tries
permuting the instructions of an assembly loop, taking dependencies into
account, benchmarks each variant, and tries to find the fastest
instruction sequence. It seems I tried this toool on x86 sha1_compress
back in 2009, on an AMD K7, and it gave a 17% speedup at the time,
according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.

So you can just ignore this file. And you may want to look at the more
readable version of x86/sha1_compress.asm, just before that commit.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Build problem on ppc64be + musl

2021-08-17 Thread Niels Möller
David Edelsohn  writes:

> Musl Libc does not support ELFv1, so I don't understand how this
> configuration is possible.

If I understood the original report, musl always uses ELFv2 abi, for
both little and big endian configurations. Which for big endian is
incompatible with the way powerpc64 assembly is configured in nettle.

Nettle assembly files currently use ELFv2 on little endian, but always
uses ELFv1 on big endian.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Build problem on ppc64be + musl

2021-08-17 Thread Niels Möller
Maamoun TK  writes:

> Forcing ELFv2 abi doesn't work for big-endian mode as this mode has no
> support for ELFv2. ppc64 linux big-endian is deprecated, it' not unexpected
> to get such issues. Dropping big-endian support for powerpc could be an
> option to solve this issue but that will be a drawback for AIX (BE) systems.

The configuration where it didn't work was
powerpc64-openwrt-linux-musl. I'd like Nettle to work on embedded
systems whenever practical. But support depends on assistance from users
of those systems.

As I understood it, this system needs to use the v2 ABI. I would hope
it's easy to detect the abi used by the configured C compiler, and then
select the same prologue sequence as is currently used for
little-endian. I.e., one more configure test, and changing the
"ifelse(WORDS_BIGENDIAN,no," condition in powerpc64/machine.m4 to check
a different configure variable.

I don't know how the linker detected abi incompatibility (ld error message
like "gcm-hash.o: ABI version 1 is not compatible with ABI version 2
output"), if that's based just on the presence of the special ".opd"
section, or if there are other attributes in the ELF file, and if so,
how the assembler decides which attributes to attach.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize SHA1 with fat build support

2021-08-10 Thread Niels Möller
Maamoun TK  writes:

> I made a merge request in the main repository that optimizes SHA1 for s390x
> architecture with fat build support !33
> <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33>.

Regarding the discussion on
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005:
It seems the sha1 instructions on s390x are fast enough that the
overhead of loading constants, and loading and storing the state, all
per block, is a significant cost.

I think it makes sense to change the internal convention for
_sha1_compress so that it can do multiple blocks. There are currently 5
assembly implementations that would need updating: arm/v6, arm64/crypto, x86,
x86_64 and x86_64/sha_ni. And the C implementation, of course.

If it turns out to be too large a change to do them all at once, one
could introduce some new _sha1_compress_n function or the like, and use
when available. Actually, we probably need to do that anyway, since for
historical reasons, _nettle_sha1_compress is a public function, and needs
to be kept (as just a simple C wrapper) for backwards compatibility.
Changing it incrementally should be doable but a bit hairy.

There are some other similar compression functions with
assembly implementation, for md5, sha256 and sha512. But there's no need
to change them all at the same time, or at all.

Regarding the MD_UPDATE macro, that one is defined in the public header
file macros.h (which in retrospect was a mistake). So it's probably best
to leave it unchanged. New macros for the new convention should be put
into some internal header, e.g., md-internal.h.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Is there an equivalent to curve25519_mul for ECC keys?

2021-08-10 Thread Niels Möller
Nicolas Mora  writes:

> I'm wondering if there is a function of a combination of functions to
> perform a DH computation using ECC keys and their parameters "struct
> ecc_point *pub1, struct ecc_scalar *key2"?

ecc_point_mul (declared in ecc.h) is intended to do that. There's also
a variant ecc_point_mul_g.

But it seems they're not properly documented in the manual.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Fat build support for AES and GHASH

2021-07-24 Thread Niels Möller
Maamoun TK  writes:

> I've applied for your change requests. I think we're ready to merge the
> s390x branch at this point, let me know if there are conflicts with the
> master branch tho.

Merged to master branch now! Had to commit some minor fixes to make
"make dist" and the s390x ci build work, and added a brief ChangeLog
entry for latest additions.

For the memxor merge requests, it would be good to retarget to the
master branch (but I'm not sure how to do that in gitlab).

Regards,
/Niels

> regards,
> Mamone

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Fat build support for AES and GHASH

2021-07-21 Thread Niels Möller
Maamoun TK  writes:

> I've applied for your change requests. I think we're ready to merge the
> s390x branch at this point, let me know if there are conflicts with the
> master branch tho.

Fixes merged, thanks!. I'll try out merging the s390x branch into
master(-updates), to see if there are any difficulties.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize GHASH

2021-07-17 Thread Niels Möller
Maamoun TK  writes:

> You are right, modern operating systems are supposed to have this
> functionality but accessing some program's memory is pretty easy nowadays,
> I think it's a good practice to clean behind the cipher functions for what
> it makes sense and whenever possible.

I think it's futile to try to do that thoroughly, e.g., code generated
by the compiler will not clear each stack frame on return (and I'm not
even ware of any compiler option to generate code like that). We have to
trust the operating system (where as usual, "trust" can also be read as
"depend on").

For the specific case of key material, it might make sense to go to a
little extra effort to not leave copies in memory, but other neetle code
doesn't do that.

> In another topic, I've optimized the SHA-512 algorithm for arm64
> architecture but it turned out all CFarm variants don't support SHA-512
> crypto extension so I can't do any performance or correctness testing for
> now. Do you know any CFarm alternative that supports SHA-512 and SHA3
> extensions for arm64 architectures?

Can you do correctness tests on qemu? (I've been using a crosscompiler
and qemu-user to test other ARM code, and that's also what the ci tests
do).

I have access to the systems listed on
https://gmplib.org/devel/testsystems, is any of those applicable? The
arm64 machines available includes one Cortex-A73 and one Apple M1.

Regards,
/Niels


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Fat build support for AES and GHASH

2021-07-17 Thread Niels Möller
Maamoun TK  writes:

> I created a MR !31
> <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/31> that adds
> fat build support of AES and GHASH for S390x architecture. The MR's
> description has a brief overview of the modifications done to add the fat
> build support.

Merged, thanks! I wrote some comments asking for two followup changes
(avoid inline asm, and setting of FAT_TEST_LIST).

Do you think we're getting ready to merge the s390x branch to master?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Aarch64] Optimize AES

2021-07-17 Thread Niels Möller
Maamoun TK  writes:

> I made this patch operate AES ciphering with fixed key sizes of 128-bit,
> 192-bit, and 256-bit, in this case I eliminated the loading process of key
> expansion for every round. Since this technique produces performance
> benefits, I'm planning to keep the implementation as is and in case
> handling uncommon key size is mandatory, I can append additional branch to
> process message blocks with any key size. What do you think?

There's no need to support non-standard key sizes. _nettle_aes_encrypt
should only ever be called with one of the constants _AES128_ROUNDS,
_AES192_ROUNDS, _AES256_ROUNDS as the first argument.

I think it's becoming clearer that we should make assembly for
_nettle_aes_encypt optional, in favor of separate entry points for
aes{128,192,256}_{en,de}crypt. I think you or I had an experimental
branch to do that.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize GHASH

2021-07-09 Thread Niels Möller
Maamoun TK  writes:

> My concern is if the program
> terminates then the operation system will deallocate the program's stack
> without clearing its content so that leftover data will remain somewhere at
> the RAM which could be a subject for a memory allocation or dumbing by
> other programs.

I think the kernel is responsible for clearing that memory before
handing it out to a new process. If it didn't, that would be a huge
security problem. I'm fairly sure operating systems do this correctly.
(And I would be a bit curious to know of any exceptions, maybe some
embedded or ancient systems don't do it?)

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize GHASH

2021-07-08 Thread Niels Möller
Maamoun TK  writes:

> Any update on this patch? I think we have reached the merging stage of this
> patch if there are no further queries.

Merged, thanks!

>> I'm thinking it's also worth it to wipe the authentication tag and the
>> leftover bytes of input data from the stack. Leaving out the output
>> authentication tag in the stack is never a good idea and in case of
>> processing AAD the input data is left in the clear so leaving leftover
>> bytes in the stack may reveal potential secret data. I've pushed another
>> commit to wipe the whole parameter block content (authentication tag and
>> hash subkey) and the leftover bytes of input data.

Other nettle functions don't do that, it's generally assumed that the
running program is trustworthy, and that the operating system protects
the data from non-trustworthy processes. I think using encrypted swap
(using an ephemeral key destroyed on shutdown) is a good idea.

To me, it makes some sense for nettle to wipe the copy of the key (since
the application might wipe the context struct and expect no copies to
remain), but probably overkill for the other data. But it shouldn't hurt
either.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [AArch64] Fat build support for SHA-256 compress

2021-07-05 Thread Niels Möller
Maamoun TK  writes:

> I made a merge request that adds fat build support for SHA-256 compress
> function !29 <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/29>

Thanks, merged!

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [AArch64] Optimize SHA-256 compress

2021-07-01 Thread Niels Möller
Maamoun TK  writes:

> I made a merge request !28
> <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/28> to the
> 'arm64-sha1' branch that optimizes SHA-256 compress function, I've added a
> brief description of the patch in addition to benchmark numbers in the MR
> description. A patch for fat build support will be followed in another
> merge request.

Thanks, merged now!

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize GHASH

2021-06-30 Thread Niels Möller
Maamoun TK  writes:

> I made a merge request !26
> <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/26> that
> optimizes the GHASH algorithm for S390x architecture.

Nice! I've added a few comments in the mr.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Aarch64] Fat build support for SHA1 compress

2021-06-30 Thread Niels Möller
Maamoun TK  writes:

> This patch added fat build support SHA1 compress function using the regular
> HWCAP features.

Thanks, merged to the arm64-sha1 branch for testing. 

The patch in the email didn't apply cleanly, there were some breakage
with added newline characters etc. Maybe try as an attachment next time
(or create a merge request).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: ANNOUNCE: Nettle-3.7.3

2021-06-08 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I've prepared a new bug-fix release of Nettle, a low-level
> cryptographics library, to fix bugs in the RSA decryption functions. The
> bugs cause crashes on certain invalid inputs, which could be used
> for denial of service attacks on applications using these functions.

I forgot to reference the CVE id allocated for this problem:
CVE-2021-3580 (at the moment still in the "reserved" state). Thanks to
Simo Sorce and Redhat for that registration.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


ANNOUNCE: Nettle-3.7.3

2021-06-07 Thread Niels Möller
I've prepared a new bug-fix release of Nettle, a low-level
cryptographics library, to fix bugs in the RSA decryption functions. The
bugs cause crashes on certain invalid inputs, which could be used
for denial of service attacks on applications using these functions.
More details in NEWS file below.

Upgrading is strongly recommended.

The Nettle home page can be found at
https://www.lysator.liu.se/~nisse/nettle/, and the manual at
https://www.lysator.liu.se/~nisse/nettle/nettle.html.

The release can be downloaded from

  https://ftp.gnu.org/gnu/nettle/nettle-3.7.3.tar.gz
  ftp://ftp.gnu.org/gnu/nettle/nettle-3.7.3.tar.gz
  https://www.lysator.liu.se/~nisse/archive/nettle-3.7.3.tar.gz

Regards,
/Niels

NEWS for the Nettle 3.7.3 release

This is bugfix release, fixing bugs that could make the RSA
decryption functions crash on invalid inputs.

Upgrading to the new version is strongly recommended. For
applications that want to support older versions of Nettle,
the bug can be worked around by adding a check that the RSA
ciphertext is in the range 0 < ciphertext < n, before
attempting to decrypt it.

Thanks to Paul Schaub and Justus Winter for reporting these
problems.

The new version is intended to be fully source and binary
compatible with Nettle-3.6. The shared library names are
libnettle.so.8.4 and libhogweed.so.6.4, with sonames
libnettle.so.8 and libhogweed.so.6.

Bug fixes:

* Fix crash for zero input to rsa_sec_decrypt and
  rsa_decrypt_tr. Potential denial of service vector.

* Ensure that all of rsa_decrypt_tr and rsa_sec_decrypt return
  failure for out of range inputs, instead of either crashing,
  or silently reducing input modulo n. Potential denial of
  service vector.

* Ensure that rsa_decrypt returns failure for out of range
  inputs, instead of silently reducing input modulo n.

* Ensure that rsa_sec_decrypt returns failure if the message
  size is too large for the given key. Unlike the other bugs,
  this would typically be triggered by invalid local
  configuration, rather than by processing untrusted remote
  data.

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.



signature.asc
Description: PGP signature
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Aarch64] Optimize SHA1 Compress

2021-06-01 Thread Niels Möller
Maamoun TK  writes:

>> Great speedup! Any idea why openssl is still slightly faster?
>>
>
> Sure, OpenSSL implementation uses a loop inside SH1 update function which
> eliminates the constant initialization and state loading/sotring for each
> block while nettle does that for every block iteration.

I see, that can make a difference if the actual compressing is fast
enough.

> Modifying the message words in-place will change the value used by
> 'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set
> Architecture:
> SHA1SU0 .4S, .4S, .4S
>  Is the name of the SIMD source and destination register
> .
> .
>
> SHA1SU1 .4S, .4S
>  Is the name of the SIMD source and destination register
> .
> .
>
> So using TMP variable is necessary here. I can't think of any replacement,
> let me know how the other implementations handle this case.

I'm afraid I have no concrete suggestion, I would need to read up on the
aarch64 instructions. Implementations that do only a single round at a
time (e.g., the C implementation) uses a 16-word circular buffer for the
message expansion state, and updates one of the words per round. If I
read the latest patch correctly, you also don't keep any state besides
the MSGx registers?

> It would be nice to either make the TMP registers more temporary (i.e.,
>> no round depends on the value in these registers from previous rounds)
>> and keep needed state only on the MSG variables. Or rename them to give
>> a better hint on how they're used.
>>
>
> Done! Yield a slight performance increase btw.

Nice.

> We can load all the constants (including duplicate values) from memory with
> one instruction. The issue is how to get the data address properly for
> every supported abi!

> the easiest solution is to define
> the data in the .text section to make sure the address is near enough to be
> loaded with certain instruction. Do you want to do that?

Using .text would probably work, even if it's in some sense more correct to put
the constants in rodata segment. But let's leave as is for now.

>  We have an intensive discussion about that in the GCM patch. The short
> story, this patch should work well for both endianness modes.

Sounds good.

I've pushed the combined patches to a branch arm64-sha1. Would you like
to update the fat build setup, before merging to master?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Add AES Key Wrap (RFC 3394) in Nettle

2021-05-23 Thread Niels Möller
Nicolas Mora  writes:

> I've added test cases to verify that unwrap fail if the input values
> are incorrect [1]. I reuse all the unwrap test cases, changed one
> ciphertext byte and expect the unwrap function to return 0.

I've merged the latest version of
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/19 to the
master-updates branch, with some minor changes (moved function typedefs
out of nettle-types.h, and indentation fixes to nist-keywrap.h.

Thanks for your contribution and patience.

>> Or possibly under "7.3 Cipher modes", if it's too different from the
>> AEAD constructions.
>>
> Until we come to a solution on where to put the documentation, I've
> started a first draft for the documentation. Can you give me feedback
> on it?

I think putting it under cipher modes probably makes the most sense.

The function reference looks good, it doesn't have to be a lot of text.
Please spell "cipher" consistently, not "cypher".

In the introduction, you write "Its intention is to provide an algorithm
to wrap and unwrap cryptographic keys.". Is it possible to give a bit
more details, some guidance on when it is a good idea to use this key
wrapping rather than a more general AEAD algorithm? If there's some
interesting background, or examples of protocols that use aes keywrap,
that could also go here.

I think it would also be nice to clarify that the spec defines the key
wrapping as an aes-specific mode, but Nettle's implementation supports
any block cipher with a block size of 16 bytes.

> Also, I've never used LaTex. What tool do you recommend to write LaTex
> documentation? I've tried gummi but it says there are errors in the
> nettle.texinfo file...

Texinfo is not quite the same as LaTeX, even if it uses the same TeX
machinery for the typeset pdf version.

Manual is here:
https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html, but I
think you can mostly go by the examples elsewhere in the Nettle manual,
and check the docs only for the markup you need. You probably need to
grasp the @node thing, though. See
https://www.gnu.org/software/texinfo/manual/texinfo/texinfo.html#Writing-a-Node
(the nettle manual uses the old-fashined way with explicit node links).

I edit it in emacs, like any other file.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Aarch64] Optimize SHA1 Compress

2021-05-23 Thread Niels Möller
Maamoun TK  writes:

Looks pretty good. A few comments and questions below.

> This patch optimizes SHA1 compress function for arm64 architecture by
> taking advantage of SHA-1 instructions of Armv8 crypto extension.
> The SHA-1 instructions:
> SHA1C: SHA1 hash update (choose)
> SHA1H: SHA1 fixed rotate
> SHA1M: SHA1 hash update (majority)
> SHA1P: SHA1 hash update (parity)
> SHA1SU0: SHA1 schedule update 0
> SHA1SU1: SHA1 schedule update 1

Can you add this brief summary of instructions as a comment in the asm
file?

> Benchmark on gcc117 instance of CFarm before applying the patch:
>  Algorithm modeMbyte/s
>  sha1   update   214.16
>  openssl sha1  update   849.44

> Benchmark on gcc117 instance of CFarm after applying the patch:
>  Algorithm modeMbyte/s
>  sha1update   795.57
>  openssl sha1   update   849.25

Great speedup! Any idea why openssl is still slightly faster?

> +define(`TMP0', `v21')
> +define(`TMP1', `v22')

Not sure I understand how these are used, but it looks like the TMP
variables are used in some way for the message expansion state? E.g.,
TMP0 assigned in the code for rounds 0-3, and this value used in the
code for rounds 8-11. Other implementations don't need extra state for
this, but just modifies the 16 message words in-place. 

It would be nice to either make the TMP registers more temporary (i.e.,
no round depends on the value in these registers from previous rounds)
and keep needed state only on the MSG variables. Or rename them to give
a better hint on how they're used.

> +C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
> +
> +PROLOGUE(nettle_sha1_compress)
> +C Initialize constants
> +movw2,#0x7999
> +movk   w2,#0x5A82,lsl #16
> +dupCONST0.4s,w2
> +movw2,#0xEBA1
> +movk   w2,#0x6ED9,lsl #16
> +dupCONST1.4s,w2
> +movw2,#0xBCDC
> +movk   w2,#0x8F1B,lsl #16
> +dupCONST2.4s,w2
> +movw2,#0xC1D6
> +movk   w2,#0xCA62,lsl #16
> +dupCONST3.4s,w2

Maybe would be clearer or more efficient to load these from memory? Not
sure if there's an nice and consice way to load the four 32-bit values
into a 128-bit, and then copy/duplicate them into the four const
registers.

> +C Load message
> +ld1{MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
> +
> +C Reverse for little endian
> +rev32  MSG0.16b,MSG0.16b
> +rev32  MSG1.16b,MSG1.16b
> +rev32  MSG2.16b,MSG2.16b
> +rev32  MSG3.16b,MSG3.16b

How does this work on big-endian? The ld1 with .16b is endian-neutral
(according to the README), that means we always get the wrong order, and
then we do unconditional byteswapping? Maybe add a comment. Not sure if
it's worth the effort to make it work differently (ld1 .4w on
big-endian)? It's going to be a pretty small fraction of the per-block
processing.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-05-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> We could either switch it on by default in configure.ac, or add a
> configure flag in .gitlab-ci.

Just pushed a change to .gitlab-ci to pass --enable-s390x-msa, and it
seems to work, see

https://gitlab.com/gnutls/nettle/-/jobs/1284895250#L580

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+

2021-05-20 Thread Niels Möller
"Christopher M. Riedl"  writes:

> So in total, if we assume an ideal (but impossible) zero-cost version
> for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector
> load/stores we can only account for 11.82 cycles/block; leaving 4.97
> cycles/block as an additional benefit of the combined implementation.

One hypothesis for that gain is that we can avoid storing the aes input
in memory at all; instead, generated the counter values on the fly in
the appropriate registers.

>> Another potential overhead is that data is stored to memory when passed
>> between these functions. It seems we store a block 3 times, and loads a
>> block 4 times (the additional accesses should be cache friendly, but
>> wills till cost some execution resources). Optimizing that seems to need
>> some kind of combined function. But maybe it is sufficient to optimize
>> something a bit more general than aes gcm, e.g., aes ctr?
>
> This would basically have to replace the nettle_crypt16 function call
> with arch-specific assembly, right? I can code this up and try it out in
> the context of AES-GCM.

Yes, something like that. If we leave the _nettle_gcm_hash unchanged
(with its own independent assembly implementation), and look at
gcm_encrypt, what we have is

  const void *cipher, nettle_cipher_func *f,

  _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);

It would be nice if we could replace that with a call to aes_ctr_crypt,
and then optimizing that would benefit both gcm and plain ctr. But it's
not quite that easy, because gcm unfortunately uses it's own variant of
ctr mode, which is why we need to pass the gcm_fill function in the
first place.

So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they
*might* still share some code, but they would be distinct entry points).
Say we call the gcm-specific ctr function from some variant of
gcm_encrypt via a different function pointer. Then that gcm_encrypt
variant is getting a bit pointless. Maybe it's better to do

  void aes128_gcm_encrypt(...)
  {
_nettle_aes128_gcm_ctr(...);
_nettle_gcm_hash(...);
  }

At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256
(and any other algorithms we might want to optimize in a similar way).
And each of the aes assembly routines should be fairly small and easy to
maintain. 

I wonder if there are any reasonable alternatives with similar
performance? One idea that occurs to me is to replace the role of
gcm_fill function (and the nettle_fill16_fOBunc type) with an
arch-specific assembly only hook-interface that gets inputs in specified
registers, and is expected to produce the next cipher input in
registers.

We could then have a aes128_any_encrypt that takes the same args as
aes128_encrypt + a pointer to such a magic assembly function.

The aes128_any_encrypt assembly would then put required input in the
right registers (address of clear text, current counter block, previous
ciphertext block, etc) and have a loop where each iteration calls the
hook, and encrypts a block from registers.

But I'm afraid it's not going to be so easy, given that where possible
(i.e., all modes but cbc encrypt) would like to have the option to do
multiple blocks in parallell. Perhaps better to have an assembly
interface to functions doing ECB on one block, two blocks, three blocks
(if there are sufficient number of registers), etc, in registers, and
call that from the other assembly functions. A bit like the recent
chacha_Ncore functions, but with input and output output in registers
rather than stored in memory.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Aarch64] Optimize SHA1 Compress

2021-05-20 Thread Niels Möller
Maamoun TK  writes:

> I've written the patch from scratch while keeping in mind how to use the
> SHA-1 instructions of Arm64 crypto extension from sha1-arm.c in Jeffrey's
> repository.

If that is the case, avoid phrases like "based on" which are easily
misread as implying it's a derived work in the copyright sense.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-05-19 Thread Niels Möller
Maamoun TK  writes:

> Did you get the credentials of the new VM? 

Yes! I set it up and updated the gitlab config last evening, and I've
seen a successfull ci run.

> I'm thinking after adding the
> address and ssh key of new VM, we can't get the optimized cores of AES
> tested since enable-msa isn't triggered. We need to push some sort of
> hard-coded option in configure.ac to get it tested in the VM during ci job.

We could either switch it on by default in configure.ac, or add a
configure flag in .gitlab-ci.

Once fat build support is added, we will no longer to enable it
explicitly, right?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: S390x other modes and memxor

2021-05-09 Thread Niels Möller
Maamoun TK  writes:

> This is great information that I can keep in my memory for next
> implementations. s390x arch offers 'xc' instruction "Storage-to-storage
> XOR" at maximum length of 256 bytes but we can do as many iterations as we
> need. I optimized memxor using that instruction as it achieves the optimal
> performance for such case, I'll attach the patch at the end of
> message.

Nice! I'd like to merge this as soon as the s390x ci is up and running
again.

> Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction
> because while it supports the overlapped operands it processes them from
> left to right, one byte at a time.

Hmm, I wonder if there's some way to work around that.

> However, I think optimizing just memxor could make a good sense of how much
> it would increase the performance of AES modes. CBC mode could come in
> handy here since it uses memxor in encrypt and decrypt operations in case
> the operands of decrypt operation don't overlap. Here is the benchmark
> result of CBC mode:
>
> *---*
> |  AES-128 Encrypt | AES-128
> Decrypt |
> |||
> | CBC-Accelerator 1.18 cbp | 0.75 cbp
> |
> | Basic AES-Accelerator13.50 cbp   | 3.34 cbp
>   |
> | Basic AES-Accelerator with memxor 15.50 | 1.57
>   |
> *-*

This seems to confirm that cbc encrypt is the operation that gains the
most from assembly for the combined operation. That aes decrypt can also
gain a factor two in performance, does that mean that both aes-cbc and
memxor run at speed limited by memory bandwidth? And then the gain is
from one less pass loading and storing data from memory?

What unit is "cbp"? If it's cycles per byte, 0.77 cycles/byte for memxor
(the cost of "Basic AES-Accelerator with memxor" minus cost of
CBC-Accellerator) sounds unexpectedly slow, compared to, e.g, x86_64,
where I get 0.08 cycles per byte (regardless of alignment), or 0.64
cycles per 64-bit word.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


S390x other modes and memxor (was: Re: [S390x] Optimize AES modes)

2021-05-09 Thread Niels Möller
Maamoun TK  writes:

> On Sat, May 1, 2021 at 6:11 PM Niels Möller  wrote:
>
>> Is https://git.lysator.liu.se/nettle/nettle/-/merge_requests/23 still
>> the current code?
>>
>
> I've added the basic AES-192 and AES-256 too since there is no problem to
> test them all together.

Merged to the s390x branch now. Thanks for your patience.

For further improvement, it would be nice to have aesN_set_encrypt_key
and aesN_set_decrypt_key be two entrypoints to the same function. But
will make the file replacement logic a bit more complex.

And maybe the public aes*_invert_key functions should be marked
as deprecated (and deleted, next time we have an abi break)? No other
ciphers in Nettle have this feature, and it's not that useful for
applications. From codesearch.debian.net, it looks like they are exposed
by the haskell and rust bindings, though.

> For the other the modes, 

Before doing the other modes, do you think you could investigate if
memxor and memxor3 can be sped up? That should benefit many ciphers
and modes, and give more relevant speedup numbers for specialized
functions like aes cbc and aes ctr.

The best strategy depends on whether or not unaligned memory access is
possible and efficient. All current implementations do aligned writes to
the destination area (and smaller writes if needed at the edges). For the
C implementation and several of the asm implementations, they also do
aligned reads, and use shifting to get inputs xored together at the right
places. 

While the x86_64 implementation uses unaligned reads, since that seems
as efficient, and reduces complexity quite a lot.

On all platforms I'm familiar with, assembly implementations can assume
that it is safe to read a few bytes outside the edge of the input
buffer, as long as those reads don't cross a word boundary
(corresponding to valgrind option --partial-loads-ok=yes).

Ideally, memxor performance should be limited by memory/cache bandwidth
(with data in L1 cache probably being the most important case. It looks
like nettle-benchmark calls it with a size of 10 KB).

Note that memxor3 must process data in descending address order, to
support the call from cbc_decrypt, with overlapping operands.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-05-08 Thread Niels Möller
David Edelsohn  writes:

> Thanks for setting this up.  The default accounts have a limited time
> (90 days?).  For long-term CI access, I can help request a long-term
> account for Nettle.

Hi, I set up the s390x vm for Nettle ci tests late March. What
information do you need to arrange an extension to long-term access, so
it doesn't expire?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-05-01 Thread Niels Möller
Maamoun TK  writes:

> Hi Niels, hope you are doing well now
> Any update on this patch?

Thanks, I'm feeling a lot better, although still a bit tired.

Is https://git.lysator.liu.se/nettle/nettle/-/merge_requests/23 still
the current code?

I hope to be back to reviewing pending patches soon, but I also got a
fairly serious bug report a few days ago that I need to attend to first.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+

2021-04-05 Thread Niels Möller
"Christopher M. Riedl"  writes:

> An implementation combining AES+GCM _can potentially_ yield significant
> performance boosts by allowing for increased instruction parallelism, avoiding
> C-function call overhead, more flexibility in assembly fine-tuning, etc. This
> series provides such an implementation based on the existing optimized Nettle
> routines for POWER9 and later processors. Benchmark results on a POWER9
> Blackbird running at 3.5GHz are given at the end of this mail.

Benchmark results are impressive. If I get the numbers right, cycles per
block (16 bytes) is reduced from 40 to 22.5. You can run
nettle-benchmark with the flag -f 3.5e9 (for 3.5GHz clock frequency) to
get cycle numbers in the output.

I'm a bit conservative about about adding assembly code for combined
operations, since it can lead to an explosion in the amount of code to
maintain. So I'd like to understand a bit better where the 17.5 saved
cycles were spent. For the code on master, gcm_encrypt (with aes) is built from
these building blocks:

  * gcm_fill

C code, essentially 2 64-bit stores per block. On little endian, it
also needs some byte swapping.

  * aes_encrypt

Using power assembly. Performance measured as the "aes128  ECB
encrypt" line in nettle-benchmark output.

  * memxor3

This is C code on power (and rather hairy C code). Performance can
be measured with nettle-benchmark, and it's going to be a bit
alignment dependent.

  * gcm_hash

This uses power assembly. Performance is measured as the "gcm
update" line in nettle-benchmark output. From your numbers, this
seems to be 7.3 cycles per block.

So before going all the way with a combined aes_gcm function, I think
it's good to try to optimize the building blocks. Please benchmark
memxor3, to see if it could benefit from assembly implementation. If so,
that should give a nice speedup to several modes, not just gcm. (If you
implement memxor3, beware that it needs to support some overlap, to not
break in-place CBC decrypt).

Another potential overhead is that data is stored to memory when passed
between these functions. It seems we store a block 3 times, and loads a
block 4 times (the additional accesses should be cache friendly, but
wills till cost some execution resources). Optimizing that seems to need
some kind of combined function. But maybe it is sufficient to optimize
something a bit more general than aes gcm, e.g., aes ctr?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-04-01 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> (iii) I've considered doing it earlier, to make it easier to implement
>   aes without a round loop (like for all current versions of
>   aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load
>   all subkeys into registers and still have registers left to do two
>   or more blocks in parallel, but then we'd need to override
>   aes128_encrypt separately from the other aes*_encrypt.

I've given this a try, see experimental patch below. It adds a
x86_64/aesni/aes128-encrypt.asm, with a 2-way loop. It gives a very
modest speedup, 5%, when I benchmark on my laptop (which is now a pretty
fast machine, AMD Ryzen 5). I've also added a cbc-aes128-encrypt.asm.
That gives more significant speedup, almost 60%. I think main reason for
the speedup is that we avoid reloading subkeys between blocks.

If we want to go this way, I wonder how to do it without an explosion of
files and functions. For s390x, it seems each function will be very
small, but not so for most other archs. There are at least three modes
that are similar to cbc encrypt in that they have to process blocks
sequentially, with no parallelism: CBC encrypt, CMAC, and XTS (there may
be more). It's not so nice if we need (modes × ciphers) number of assembly
files, with lots of duplication.

Regards,
/Niels

diff --git a/ChangeLog b/ChangeLog
index 3d19b1dd..68b8f632 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
 2021-04-01  Niels Möller  
 
+   * cbc-aes128-encrypt.c (nettle_cbc_aes128_encrypt): New file and 
function.
+   * x86_64/aesni/cbc-aes128-encrypt.asm: New file.
+
+   * configure.ac (asm_replace_list): Add aes128-encrypt.asm
+   aes128-decrypt.asm.
+   * x86_64/aesni/aes128-encrypt.asm: New file, with 2-way loop.
+   * x86_64/aesni/aes128-decrypt.asm: Likewise.
+
Move aes128_encrypt and similar functions to their own files. To
make it easier for assembly implementations to override specific
AES variants.
diff --git a/Makefile.in b/Makefile.in
index 8d474d1e..b6b983fd 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -101,7 +101,8 @@ nettle_SOURCES = aes-decrypt-internal.c aes-decrypt.c 
aes-decrypt-table.c \
 camellia256-set-encrypt-key.c camellia256-crypt.c \
 camellia256-set-decrypt-key.c \
 camellia256-meta.c \
-cast128.c cast128-meta.c cbc.c \
+cast128.c cast128-meta.c \
+cbc.c cbc-aes128-encrypt.c \
 ccm.c ccm-aes128.c ccm-aes192.c ccm-aes256.c cfb.c \
 siv-cmac.c siv-cmac-aes128.c siv-cmac-aes256.c \
 cnd-memcpy.c \
diff --git a/cbc-aes128-encrypt.c b/cbc-aes128-encrypt.c
new file mode 100644
index ..5f7d1c8c
--- /dev/null
+++ b/cbc-aes128-encrypt.c
@@ -0,0 +1,42 @@
+/* cbc-aes128-encrypt.c
+
+   Copyright (C) 2013, 2014 Niels Möller
+
+   This file is part of GNU Nettle.
+
+   GNU Nettle is free software: you can redistribute it and/or
+   modify it under the terms of either:
+
+ * the GNU Lesser General Public License as published by the Free
+   Software Foundation; either version 3 of the License, or (at your
+   option) any later version.
+
+   or
+
+ * the GNU General Public License as published by the Free
+   Software Foundation; either version 2 of the License, or (at your
+   option) any later version.
+
+   or both in parallel, as here.
+
+   GNU Nettle is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received copies of the GNU General Public License and
+   the GNU Lesser General Public License along with this program.  If
+   not, see http://www.gnu.org/licenses/.
+*/
+
+#if HAVE_CONFIG_H
+# include "config.h"
+#endif
+
+#include "cbc.h"
+
+void
+nettle_cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, uint8_t 
*dst, const uint8_t *src)
+{
+  CBC_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
+}
diff --git a/cbc.h b/cbc.h
index 93b2e739..beece610 100644
--- a/cbc.h
+++ b/cbc.h
@@ -35,6 +35,7 @@
 #define NETTLE_CBC_H_INCLUDED
 
 #include "nettle-types.h"
+#include "aes.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -79,6 +80,10 @@ memcpy((ctx)->iv, (data), sizeof((ctx)->iv))
 sizeof((self)->iv), (self)->iv,\
 (length), (dst), (src)))
 
+struct cbc_aes128_ctx CBC_CTX(struct aes128_ctx, AES_BLOCK_SIZE);
+void
+nettle_cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, uint8_t 
*dst, const uint8_t *src);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/configure.ac b/configure.ac
index be2916c1..26e41d89 100644
--- a/configure.ac
+++ b/configure.ac
@@ -544,6 +544,7 @@ fi
 # Files whic

Re: [S390x] Optimize AES modes

2021-03-31 Thread Niels Möller
Maamoun TK  writes:

>> I've tried out a split, see below patch. It's a rather large change,
>> moving pieces to new places, but nothing difficult. I'm considering
>> committing this to the s390x branch, what do you think?
>>
>
> I agree, I'll modify the patch of basic AES-128 optimized functions to be
> built on top of the splitted aes functions.

Ok, pushed to the s390x branch now.

> memxor performs the same in C and assembly since s390 architecture offers
> memory xor instruction "xc" see xor_len macro in machine.m4 of the original
> patch for an implementation example.

But the C implmementation is somewhat complicated, splitting into
several cases depending on alignment, and shifting data around to be able
to do word operations. If it can be done simpler with the nc
instruction, that would at least cut some overhead. (Note that memxor3
must support the overlap case needed by cbc decrypt).

> However, s390x AES accelerators offer considerable speedup against C
> implementation with optimized internal AES. The following table
> demonstrates the idea more clearly:
>
> Function   S390x accelerator   C implementation with optimized
> internal AES (Only enable aes128.asm, aes192.asm, aes256.asm)
> ---
[...]
> CBC AES128 Decrypt  0.647008 cpb  3.131405 cpb
[...]
> CTR AES128 Crypt0.710237 cpb  4.767290 cpb

For these two, the speed difference should essentially be the time for
the C implementation of memxor. "cpb" mean cycles per byte, right? 2-4
cycles per byte for memxor is quite slow. On my x86_64 laptop (ok,
comparing apples to oranges), memxor, for the aligned case, is 0.08 cpb,
and memxor twice as much. And even the C implementation is not that much
slower.

> GCM AES128 Encrypt  0.630504 cpb  15.473187 cpb

For GCM, are there instructions that combine AES-CTR and GCM HASH? Or
are those done separately? It would be nice to have GCM HASH being fast
by itself, for performance with other ciphers than aes.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Add AES Key Wrap (RFC 3394) in Nettle

2021-03-29 Thread Niels Möller
Nicolas Mora  writes:

>> The new feature also needs documentation, will you look into that once
>> code, and in particular the interfaces, are solid?
>>
> Definitely!
> What do you think the documentation should look like? Should it be
> near paragraph 7.2.1? Like
>
> 7.2.1.1 AES Key Wrap

That's one possibility, but I think it would also be natural to put it
somewhere under or close to "7.4. Authenticated encryption and
associated data", even though there's no associated data. That section
could perhaps be retitled to "Authenticated encryption" to generalize
it?

Or possibly under "7.3 Cipher modes", if it's too different from the
AEAD constructions.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Compile issue on Solaris 11.3

2021-03-28 Thread Niels Möller
Jeffrey Walton  writes:

> --enable-fat turns on cpu identification and runtime switching. I need
> that. I need AES. I don't need SHA. It is impossible to get into a
> good configuration.

I don't think it's worthwhile to add complexity to configure and the fat
machinery, and testing thereof, to make it flexible enough for that
usecase. In your case, you need it to be able to use an assembler
missing support for instructions added to the architecture 7.5 years
ago. Are there other usecases where more flexibility would be
beneficial?

I might consider it, if someone else wants to do the work, and it turns
out it doesn't get too messy.

To get it to work in your setting, I would suggest one of:

(i) Stick to --disable-assembler, to get something that works but is
slow (and unfortunately a performance regression since nettle-3.6).

(ii) Upgrade your assembler to a version that recognizes the sha
 instructions (not sure which assembler you're using, I did ask,
 when you reported the problem back in January, but I haven't seen
 an answer). I would be a bit surprised if support for these
 instructions is still missing in recent releases of Oracle's
 development tools, if that's what you're using.

> Nettle wastes a fair amount our time trying to work through these problems.

To be honest, high performance on the proprietary and somewhat obscure
Solaris operating system is not going to be a high priority to me, in
particular version 11.3 which is soon officially end of support (January
2024, according to wikipedia, curiously the same date as for the much
older Solaris 10). Correctness, which you'd get with
--disable-assembler, is considerably more important. I'm willing to help
getting Nettle to work on obscure and obsolete systems, as long as the
cost in added complexity is small.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Compile issue on Solaris 11.3

2021-03-28 Thread Niels Möller
Jeffrey Walton  writes:

>> I added --disable-x86-sha-ni and it still produces the error. How is
>> the ASM being used if it is disabled???

You need to choose *either* --enable-fat (now the default), *or* use the
explicit config options for particular instructions. Mixing is not
supported. Don't do that.

And I think this is at least the third time I point this out to you,
most recently just a few days ago. If, e.g., you deeply dislike the way
Nettle's configure works and would like it to change, your current
behavior is not a productive way of improving anything. It is annoying
me and wasting my time.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-03-27 Thread Niels Möller
Maamoun TK  writes:

>> >   only:
>> > variables:
>> > - $S390X_SSH_IP_ADDRESS
>> > - $S390X_SSH_PRIVATE_KEY
>> > - $S390X_SSH_CI_DIRECTORY
>>
>> What does this mean? Ah, it excludes the job if these variables aren't
>> set?
>>
>
> Yes, this is what it does according to gitlab ci docs
> <https://docs.gitlab.com/ee/ci/yaml/#onlyexcept-basic>. otherwise, fresh
> forks will have always-unsuccessful job.

Hmm, docs aren't quite clear, but it doesn't seem to work as is. I
accidentally set the new S390X_ACCOUNT varable to "protected", and then
the job was started but with $S390X_ACCOUNT expanding to the empty
string, and failing.. Perhaps
it needs to be written as

  - $FOO != ""

instead?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-03-27 Thread Niels Möller
Maamoun TK  writes:

> Isn't it better to define S390X_SSH_IP_ADDRESS variable rather than
> hard-code the remote server address in .gitlab-ci.yml? fresh forks now need
> to update .gitlab-ci.yml to get a S390x job which is a bit unwieldy in my
> opinion.

Makes sense. I've added it as a variable, and renamed to S390X_ACCOUNT.
Value is of the form username@ip-address.

> Yes, this is what it does according to gitlab ci docs
> <https://docs.gitlab.com/ee/ci/yaml/#onlyexcept-basic>. otherwise, fresh
> forks will have always-unsuccessful job.

Ok, added a section

  only:
variables:
- $SSH_PRIVATE_KEY
- $S390X_ACCOUNT

Still on the master-updates branch, will merge as soon as the run looks
green.

Regards,
/Nies

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: xts.c:59: warning: integer constant is too large for ‘long’ type

2021-03-25 Thread Niels Möller
Jeffrey Walton  writes:

> This is building Nettle 3.7.2 on a PowerMac with OS X 10.5:
>
> /usr/bin/cc -I. -I/usr/local/include -DNDEBUG -DHAVE_CONFIG_H -g2 -O2
> -mlong-double-64 -fno-common -maltivec -fPIC -pthread -ggdb3
> -Wno-pointer-sign -Wall -W   -Wmissing-prototypes
> -Wmissing-declarations -Wstrict-prototypes   -Wpointer-arith
> -Wbad-function-cast -Wnested-externs -fPIC -MT xts-aes128.o -MD -MP
> -MF xts-aes128.o.d -c xts-aes128.c \
> && true
> xts.c: In function ‘xts_shift’:
> xts.c:59: warning: integer constant is too large for ‘long’ type
> xts.c:59: warning: integer constant is too large for ‘long’ type
> xts.c:60: warning: integer constant is too large for ‘long’ type
> xts.c:60: warning: integer constant is too large for ‘long’ type
> xts.c:60: warning: integer constant is too large for ‘long’ type
>
> On OS X 10.5, you have to use unsigned long long and the ull suffix.

This is confusing. The xts_shift function is not in nettle-3.7.2, as far
as I can tell, it was deleted long ago in
https://git.lysator.liu.se/nettle/nettle/-/commit/685cc919a37b60d3f81dd569bf6e93ad7be0f89b.

> Maybe you should add a configure test to see whether you need the ull suffix.

The current related code uses UINT64_C for the 64-bit constants. No
configure test needed.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: bug#47222: Serious bug in Nettle's ecdsa_verify

2021-03-25 Thread Niels Möller
Ludovic Courtès  writes:

> Are there plans to make a new 3.5 release including these fixes?

No, I don't plan any 3.5.x release.

> Alternatively, could you provide guidance as to which commits should be
> cherry-picked in 3.5 for downstream distros?

Look at the branch release-3.7-fixes
(https://git.lysator.liu.se/nettle/nettle/-/commits/release-3.7-fixes/).
The commits since 3.7.1 are the ones you need.

Changes to gostdsa and ed448 will not apply, since those curves didn't
exist in nettle-3.5. Changes to ed25519 might not apply cleanly, due to
refactoring when adding ed448.

> I’m asking because in Guix, the easiest way for us to deploy the fixes
> on the ‘master’ branch would be by “grafting” a new Nettle variant
> ABI-compatible with 3.5.1, which is the one packages currently depend on.

I still recommend upgrading to the latest version. There were an abi
break in 3.6 (so you'd need to recompile lots of guix packages), but no
incompatible changes to the (source level) api.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: libhgwwed has gone missing...

2021-03-25 Thread Niels Möller
Jeffrey Walton  writes:

> It looks like Nettle is no longer building or installing hogweed on
> some Apple platforms.
>
> This is from a PowerMac G5 running OS X 10.5:

Most likely the configure check for libgmp failed. Check config.log for
details. I think the most recent change to the gmp dependency was in
nettle-3.6, which requires gmp-6.1 or later.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [S390x] Optimize AES modes

2021-03-24 Thread Niels Möller
Maamoun TK  writes:

> I managed to get the tarball approach working in gitlab ci with the
> following steps:

Thanks for the research. I've added a test job based on these ideas. See
https://git.lysator.liu.se/nettle/nettle/-/commit/c25774e230985a625fa5112f3f19e03302e49e7f.
An almost identical setup was run successfully as
https://gitlab.com/gnutls/nettle/-/jobs/1125145345.

> - In gitlab go to settings -> CI / CD. Expand Variables and add the
> following variables:
>
>- S390X_SSH_IP_ADDRESS: username@instance_ip
>- S390X_SSH_PRIVATE_KEY: private key of ssh connection
>- S390X_SSH_CI_DIRECTORY: name of directory in remote server where the
>tarball is extracted and tested

I made only the private key a variable (and of type "file", which means
it's stored in a temporary file, with file name in $SSH_PRIVATE_KEY).
The others are defined in the .gitlab-ci.yml file.

> - Update gitlab-ci.yml as follows:
>
>- Add this line to variables category at the top of file:
>
>   DEBIAN_BUILD: buildenv-debian

I used the same fedora image as for the simpler build jobs.

>   script:
>   - tar --exclude=.git --exclude=gitlab-ci.yml -cf - . | ssh
> $S390X_SSH_IP_ADDRESS "cd $S390X_SSH_CI_DIRECTORY/$CI_PIPELINE_IID && tar
> -xf - &&

I'm using ./configure && make dist instead, then we get a bit testing of
that too. On the remote side, directory name is based on
$CI_PIPELINE_IID, that seems to be a good way to get one directory per job.

>   only:
> variables:
> - $S390X_SSH_IP_ADDRESS
> - $S390X_SSH_PRIVATE_KEY
> - $S390X_SSH_CI_DIRECTORY

What does this mean? Ah, it excludes the job if these variables aren't
set?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [AArch64] Fat build support for GCM optimization and syntax improvements

2021-03-22 Thread Niels Möller
Maamoun TK  writes:

> I made a merge request #21
> <https://git.lysator.liu.se/nettle/nettle/-/merge_requests/21> that adds
> fat build support for GCM implementation on arm64, the patch also updates
> the README file to stay on par with the other architectures and use m4
> macros in gcm-hash.asm (patch provided by Niels Möller), in addition to add
> documentation comments.

Thanks! Merged to master-updates, for testing.

Regard,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Nettle 3.7.2 and OS X 10.12.6

2021-03-22 Thread Niels Möller
Jeffrey Walton  writes:

> And it looks like examples are not quite working either:
>
> $ make check
> ...
> 
> All 110 tests passed
> 
> Making check in examples
> TEST_SHLIB_DIR="/Users/jwalton/Build-Scripts/nettle-3.7.2/.lib" \
>   srcdir="." EMULATOR="" EXEEXT="" \
>   ".."/run-tests rsa-sign-test rsa-verify-test rsa-encrypt-test
> Opening `testkey' failed: No such file or directory
> Invalid key
> FAIL: rsa-sign
> Opening `testkey' failed: No such file or directory
> Invalid key
> FAIL: rsa-verify
> Opening `testkey.pub' failed: No such file or directory
> Invalid key
> FAIL: rsa-encrypt
> ===
> 3 of 3 tests failed
> ===
> make[1]: *** [check] Error 1
> make: *** [check] Error 2
>
> $ find . -name testkey.pub
> $ find . -name testkey

My best guess is that your operating system fails to regard the scripts
examples/setup-env and teardown-env as executable (similarly to the
main run-tests script). The setup-env script is supposed to create those
files.

The executability-bit that is set on certain files in the tarball must
be honored for the build to work correctly. Please to whatever it takes
to convince your build environment to do that.

> Examples have been breaking the build for years. Why are examples even
> built during 'make check'?

The tests that are failing for you act as a kind of integration-level
test for the library. I think that has some value.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Nettle 3.7.2 and OS X 10.5

2021-03-22 Thread Niels Möller
Jeffrey Walton  writes:

> I enabled Altivec builds with
> --enable-power-altivec and --enable-fat.

Don't do that. As I've tried to explain before, that combination makes
no sense. --enable-power-altivec means "unconditionally use the altivec
code". --enable-fat (now the default) means "let the fat setup code
determine at runtime if altivec (and other) features should be used".

That said, I haven't done any tests of the altivec code on Mac. I'd have
to rely on help from Mac users to fix any problems.

> Auditing the dylib it appears Altivec was not engaged:
>
> $ otool -tV /usr/local/lib/libnettle.dylib | grep perm
> 0001f124b   _nettle_sha3_permute
> _nettle_sha3_permute:
> 000204ecbl  _nettle_sha3_permute
>
> I think there's something a bit sideways here.

You're a bit too terse, I have no idea what problem this is intended to
illustrate.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


ANNOUNCE: Nettle-3.7.2

2021-03-21 Thread Niels Möller
I've prepared a new bug-fix release of Nettle, a low-level
cryptographics library, to fix a serious bug in the function to verify
ECDSA signatures. Implications include an assertion failure, which could
be used for denial-of-service, when verifying signatures on the
secp_224r1 and secp521_r1 curves. More details in NEWS file below.

Upgrading is strongly recomended.

The Nettle home page can be found at
https://www.lysator.liu.se/~nisse/nettle/, and the manual at
https://www.lysator.liu.se/~nisse/nettle/nettle.html.

The release can be downloaded from

  https://ftp.gnu.org/gnu/nettle/nettle-3.7.2.tar.gz
  ftp://ftp.gnu.org/gnu/nettle/nettle-3.7.2.tar.gz
  https://www.lysator.liu.se/~nisse/archive/nettle-3.7.2.tar.gz

Regards,
/Niels

NEWS for the Nettle 3.7.2 release

This is a bugfix release, fixing a bug in ECDSA signature
verification that could lead to a denial of service attack
(via an assertion failure) or possibly incorrect results. It
also fixes a few related problems where scalars are required
to be canonically reduced modulo the ECC group order, but in
fact may be slightly larger.

Upgrading to the new version is strongly recommended.

Even when no assert is triggered in ecdsa_verify, ECC point
multiplication may get invalid intermediate values as input,
and produce incorrect results. It's trivial to construct
alleged signatures that result in invalid intermediate values.
It appears difficult to construct an alleged signature that
makes the function misbehave in such a way that an invalid
signature is accepted as valid, but such attacks can't be
ruled out without further analysis.

Thanks to Guido Vranken for setting up the fuzzer tests that
uncovered this problem.

The new version is intended to be fully source and binary
compatible with Nettle-3.6. The shared library names are
libnettle.so.8.3 and libhogweed.so.6.3, with sonames
libnettle.so.8 and libhogweed.so.6.

Bug fixes:

* Fixed bug in ecdsa_verify, and added a corresponding test
  case.

* Similar fixes to ecc_gostdsa_verify and gostdsa_vko.

* Similar fixes to eddsa signatures. The problem is less severe
  for these curves, because (i) the potentially out or range
  value is derived from output of a hash function, making it
  harder for the attacker to to hit the narrow range of
  problematic values, and (ii) the ecc operations are
  inherently more robust, and my current understanding is that
  unless the corresponding assert is hit, the verify
  operation should complete with a correct result.

* Fix to ecdsa_sign, which with a very low probability could
  return out of range signature values, which would be
  rejected immediately by a verifier.

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.



signature.asc
Description: PGP signature
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


ANNOUNCE: Serious bug in Nettle's ecdsa_verify

2021-03-16 Thread Niels Möller
I've been made aware of a bug in Nettle's code to verify ECDSA
signatures. Certain signatures result in the ecc point multiply function
being called with out-of-range scalars, which may give incorrect
results, or crash in an assertion failure. It's an old bug, probably
since Nettle's initial implementation of ECDSA.

I've just pushed fixes for ecdsa_verify, as well as a few other cases of
potentially out-of-range scalars, to the master-updates branch. I haven't
fully analysed the implications, but I'll describe my current
understanding.

I think an assertion failure, useful for a denial-of-service attack, is
easy on the curves where the bitsize of q, the group order, is not an
integral number of words. That's secp224r1, on 64-bit platforms, and
secp521r1.

Even when it's not possible to trigger an assertion failure, it's easy
to produce valid-looking input "signatures" that hit out-of range
intermediate scalar values where point multiplication may misbehave.
This applies to all the NIST secp* curves as well as the GOST curves.

To me, it looks very difficult to make it misbehave in such a way that
ecdsa_verify will think an invalid signature is valid, but it might be
possible; further analysis is needed. I will not be able to analyze it
properly now, if anyone else would like to look into it, I can provide a
bit more background.

ed25519 and ed448 may be affected too, but it appears a bit harder to
find inputs that hit out of range values. And since point operations are
inherently more robust on these curves, I think they will produce
correct results as long as they don't hit the assert.

Advise on how to deal best with this? My current plan is to prepare a
3.7.2 bugfix release (from a new bugfix-only branch, without the new
arm64 code). Maybe as soon as tomorrow (Wednesday, european time), or in
the weekend.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Status update

2021-03-07 Thread Niels Möller
@@ IF_LE(`
 
 eorC0.16b,C0.16b,D.16b
 
-PMUL C1,H1M,H1L
-PMUL_SUM C0,H2M,H2L
+PMUL(C1,H1M,H1L)
+PMUL_SUM(C0,H2M,H2L)
 
-REDUCTION D
+REDUCTION(D)
 
 andLENGTH,LENGTH,#31
 
@@ -284,9 +287,9 @@ IF_LE(`
 
 eorC0.16b,C0.16b,D.16b
 
-PMUL C0,H1M,H1L
+PMUL(C0,H1M,H1L)
 
-REDUCTION D
+REDUCTION(D)
 
 Lmod:
 tstLENGTH,#15
@@ -325,9 +328,9 @@ Lmod_8_load:
 Lmod_8_done:
 eorC0.16b,C0.16b,D.16b
 
-PMUL C0,H1M,H1L
+PMUL(C0,H1M,H1L)
 
-REDUCTION D
+    REDUCTION(D)
 
 Ldone:
 IF_LE(`

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Add AES Key Wrap (RFC 3394) in Nettle

2021-03-07 Thread Niels Möller
Nicolas Mora  writes:

> I've added 2 macros definitions: MSB_XOR_T_WRAP and MSB_XOR_T_UNWRAP,
> I couldn't find how to make just one macro for both cases because of
> the direction of the xor.

Hmm. Maybe better to define an optional swap operation. Like

#if WORDS_BIGENDIAN
#define bswap_if_le(x) (x)
#elif HAVE_BUILTIN_BSWAP64
#define bswap_if_le(x) (__builtin_bswap64 (x))
#else
static uint64_t
bswap_if_le(uint64_t x) 
{
  x = ((x >> 32) & UINT64_C(0x)) 
   | ((x & UINT64_C(0x)) << 32);
  x = ((x >> 16) & UINT64_C(0x)) 
   | ((x & UINT64_C(0x)) << 16);
  x = ((x >> 8)  & UINT64_C(0xff00ff00ff00ff)) 
   | ((x & UINT64_C(0xff00ff00ff00ff)) << 8);
  return x;
}
#endif

and then use as 

  B.u64[0] = A.u64 ^ bswap_if_le((n * j) + (i + 1));

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Add AES Key Wrap (RFC 3394) in Nettle

2021-03-07 Thread Niels Möller
Nicolas Mora  writes:

> memcpy (I.b + 8, R + (i * 8), 8); // This one works
> I.u64[1] = *(R + (i * 8)); // This one doesn't work
>
> Is there something I'm missing?

The reason it doesn't work is the type of R. R is now an unaligned
uint8_t *. *(R + (i * 8)) (the same as R[i*8]) is an uint8_t, not an
uint64_t.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Add AES Key Wrap (RFC 3394) in Nettle

2021-03-06 Thread Niels Möller
Nicolas Mora  writes:

> I still have one uresolved comment about byte swapping but the rest
> are resolved.

Thanks. I'll do this round of comments on email, since it might be of
interest to other contributors.

* About the byteswapping comment, the code

 // A = MSB(64, B) ^ t where t = (n*j)+i
 A64 = READ_UINT64(B.b);
 A64 ^= (n*j)+(i+1);
 WRITE_UINT64(A.b, A64);

could be replaced by something like

#if WORDS_BIGENDIAN
 A.u64 = B.u64 ^ (n*j)+(i+1);
#elif HAVE_BUILTIN_BSWAP64
 A.u64 = B.u64 ^ __builtin_bswap64((n*j)+(i+1));
#else
 ... READ_UINT64 / WRITE_UINT64 or some other workaround ...
#endif

Preferably encapsulated into a single macro, so it doesn't have to be
duplicated in both the wrap and the unwrap function. There's another
example of using __builtin_bswap64 in ctr.c.


* Intialization: If you don't intend to use the initial values, omit
initialization in declarations like

  union nettle_block16 I = {0}, B = {0};
  union nettle_block8 A = {0};

That helps tools like valgrind detect accidental use of uninitialized
data. (And then I'm not even sure exactly how initializers are
interpreted for a union type).

* Some or all memcpys in the main loop can be replaced by uint64_t
operations, e.g.,

  I.u64 = A.u64;

instead of 

  memcpy(I.b, A.b, 8);

(memcpy is needed when either left or right hand side is an unaligned
byte buffer). If it turns out that you never use .b on some variable,
you can drop the use of the union type for that variable and use
uint64_t directly.

> Therefore I removed 'uint8_t R[64]' to use TMP_GMP_DECL(R, uint8_t);
> instead.

Unfortunately, that doesn't work: This code should go into libnettle
(not libhogweed), and then it can't depend on GMP. You could do plain
malloc + free, but according to the README file, Nettle doesn't do
memory allocation, so that's not ideal.

I think it should be doable to reuse the output buffer as temporary
storage (R = ciphertext for wrap, R = cleartext for unwrap). In-place
operation (ciphertext == cleartext) should be supported (but no partial
overlap), so it's important to test that case.

Using the output area directly has the drawback that it isn't aligned,
so you'll need to keep some memcpys in the main loop. One could consider
using an aligned pointer into output buffer and separate handling of
first and/or last block, but if that's a lot of extra complexty, I
wouldn't do it unless either (i) it gives a significant performance
improvement, or (ii) it turns out to actually be reasonably nice and clean.

* And one more nit: Indentation. It's fine to use TAB characters, but
they must be assumed to be traditional TAB to 8 positions: changing the
appearance of TAB to anything else in one's editor is wrong, because it
makes the code look weird for everyone else (e.g., in gitlab's ui). And
the visual appearance should follow GNU standards, braces on their own
lines, indent steps of two spaces, which means usually SPC characters,
with TAB only for large indentation.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Add AES Key Wrap (RFC 3394) in Nettle

2021-03-04 Thread Niels Möller
Nicolas Mora  writes:

> I've updated the MR with the new functions definitions and added test
> cases based on the test vectors from the RFC.
>
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/19

I've added a couple of comments on the mr.

One question: Do you intentionally limit message size to 64 bytes? Is
that according to spec?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: Status update

2021-03-04 Thread Niels Möller
Maamoun TK  writes:

>> 1. New Arm64 code (don't recall current status off the top of my head).
>
> I almost forget about fat build, do you want fat support before merging the
> code to the master branch or it's ok to be made afterward?

I've merged the arm64 branch now, thanks! Fat build would be nice. And
I'd like to change to m4 macros.

Do you plan to work on arm64 implementations of more algorithms? If I've
got it right, there are extensions with AES and SHA instructions?
Chacha/salsa20 could benefit from general SIMD instructions.

> 2. s390x testing. I'd prefer to not run a git checkout on the s390x test
>>machine, but have the ci job make a tarball, ssh it over to the test
>>machine, unpack in a fresh directory for build and test. This needs
>>to be in place before adding s390x specific code. When done, could
>>likely be reused for remote testing on any other platforms of
>>interest, which aren't directly available in the ci system.

> Done!

Thanks! Sorry I'm a bit slow, but I hope to be able to setup an account
and try this out reasonably soon.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: HPKE implementation

2021-02-25 Thread Niels Möller
Norbert Pocs  writes:

> My current project is the implementation of HPKE draft [0]. The first goal
> is to implement mode_base.

Hi, I was not aware of this work. It could make sense to support in
Nettle, in particular if GnuTLS wants to use it.

Which combinations of public key mechanism, key derivation/expansion,
and aead are of main interest?

Do you expect the specification to be finalized soon?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


<    1   2   3   4   5   6   7   8   9   10   >