Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-19 Thread Niels Möller
Maamoun TK  writes:

> The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> and 382.65% speedup for s390x.
>
> OpenSSL is still ahead in terms of performance speed since it uses 4-way
> interleaving or maybe more!!
> Increasing the interleaving ways more than two has nothing to do with
> parallelism since the execution units are already saturated by using 2-ways
> for the three architectures. The reason behind the performance improvement
> is the number of execution times of reduction procedure is cutted by half
> for 4-way interleaving since the products of multiplying state parts by key
> can be combined before the reduction phase. Let me know if you are
> interested in doing that on nettle!

Interesting. I haven't paid much attention to the poly1305
implementation since it was added back in 2013. The C implementation
doesn't try to use wider multiplication than 32x32 --> 64, which is poor
for 64-bit platforms. Maybe we could use unsigned __int128 if we can
write a configure test to check if it is available and likely to be
efficient?

For most efficient interleaving, I take it one should precompute some
powers of the key, similar to how it's done in the recent gcm code?

> It would be nice if the arm64 patch will be tested on big-endian mode since
> I don't have access to any big-endian variant for testing.

Merged this one too on a branch for ci testing.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-19 Thread Maamoun TK
On Wed, Jan 19, 2022 at 10:06 PM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> > and 382.65% speedup for s390x.
> >
> > OpenSSL is still ahead in terms of performance speed since it uses 4-way
> > interleaving or maybe more!!
> > Increasing the interleaving ways more than two has nothing to do with
> > parallelism since the execution units are already saturated by using
> 2-ways
> > for the three architectures. The reason behind the performance
> improvement
> > is the number of execution times of reduction procedure is cutted by half
> > for 4-way interleaving since the products of multiplying state parts by
> key
> > can be combined before the reduction phase. Let me know if you are
> > interested in doing that on nettle!
>
> Interesting. I haven't paid much attention to the poly1305
> implementation since it was added back in 2013. The C implementation
> doesn't try to use wider multiplication than 32x32 --> 64, which is poor
> for 64-bit platforms. Maybe we could use unsigned __int128 if we can
> write a configure test to check if it is available and likely to be
> efficient?
>

Wider multiplication would improve the performance for 64-bit general
registers but as the case for the current SIMD implementation, the radix
2^26 fits well there.


> For most efficient interleaving, I take it one should precompute some
> powers of the key, similar to how it's done in the recent gcm code?
>

Since the loop of block iteration is moved to inside the assembly
implementation, computing one multiple of key at the function prologue
should be ok.

I forgot to mention that the reduction phase uses the tips instructed in
Reduction section in https://cryptojedi.org/papers/neoncrypto-20120320.pdf
for arm64 and s390x implementations while the chain path of h0 -> h1  ->
h2  -> h3  -> h4  -> h0  -> h1 still manages to achieve slightly higher
performance than the two independent carry path on powerpc64 arch.

regards,
Mamone


> > It would be nice if the arm64 patch will be tested on big-endian mode
> since
> > I don't have access to any big-endian variant for testing.
>
> Merged this one too on a branch for ci testing.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-20 Thread Niels Möller
Maamoun TK  writes:

> Wider multiplication would improve the performance for 64-bit general
> registers but as the case for the current SIMD implementation, the radix
> 2^26 fits well there.

If multiply throughput is the bottleneck, it makes sense to do as much
work as possible per multiply. So I don't think I understand the
benefits of interleaving, can you explain?

Let's consider the 64-bit case, since that's less writing. B = 2^64 as
usual. Then the state is

  H = h_2 B^2 + h_1 B + h_0 

(with h_2 rather small, depending on how far we normalize for each
block, lets assume at most 3 bits, or maybe even h_2 <= 4).

  R = r_1 B + r_0

By the spec, high 4 bits of both r_0 and r_1, and low 2 bits of r_1 are
zero, which makes mutliplication R H (mod p) particularly nice.

We get 

  R H = r_0 h_0 + B (r_1 h_0 + r_0 h_1) 
  + B^2 (r_1 h_1 + r_0 h_2) + B^3 r_1 h_2

But then B^2 = 5/4 (mod p), and hence B^2 r_1 = 5 r_1 / 4 (mod p), where
the "/ 4" is just shifting out the two low zero bits. So let r_1' = 5
r_1 / 4,

  R H = r_0 h_0 + r_1' h_1 + B (r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2)

These are 4 long multiplications (64x64 --> 128) and two short, 64x64
--> for the products involving h_2. (The 32-bit version would be 16 long
multiplications and 4 short).

From the zero high bits, we also get bounds on these terms,

 f_0 = r_0 h_0 + r_1' h_1 < 2^124 + 5*2^122 = 9*2^122

 f_1 = r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2
< 2^125 + 5*2^61 + 2^127

So these two chains can be added together as 128-bit quantities with no
overflow, in any order, there's plendy of parallelism. E.g., power
vmsumudm might be useful.

For final folding, we need to split f_1 into top 62 and low 66 bits,
multiply low part by 5 (fits in 64 bits), and add into f_0, which still
fits in 128 bits.

And then take the top 64 bits of f_0 and add into f_1 (result <= 2^66
bits).

The current C implementation uses radix 26, and 25 multiplies (32x32
--> 64) per block. And quite a lot of shifts. A radix 32 variant
analogous to the above would need 16 long multiplies and 4 short. I'd
expect that to be faster on most machines, but I'd have to try that out.


In contrast, trying to use a similar scheme for multiplying by (r^2 (mod
p)), as needed for an interleaved version, seems more expensive. There
are several contributions to the cost:

* First, the accumulation of products by power of B needs to take into
  account carry, as result can exceed 2^128, so one would need something
  closer to general schoolbok multiplication.

* Second, since r^2 (mod p) may exceed 2^128, we need three words rather
  than two, so three more short multiplications to add in.

* Third, we can't pre-divide key words by 4, since low bits are no longer
  guaranteed to be zero. This gives more expensive reduction, with more
  multiplies by 5.

The two first points makes smaller radix more attractive; if we need
three words for both factors, we can distribute the bits to ensure some
of the most significant bits are zero. 

> Since the loop of block iteration is moved to inside the assembly
> implementation, computing one multiple of key at the function prologue
> should be ok.

For large messages, that's fine, but may add a significant cost for
messages of just two blocks.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> The current C implementation uses radix 26, and 25 multiplies (32x32
> --> 64) per block. And quite a lot of shifts. A radix 32 variant
> analogous to the above would need 16 long multiplies and 4 short. I'd
> expect that to be faster on most machines, but I'd have to try that out.

I've tried this out, see attached file. It has an #if 0/1 to choose
between radix 64 (depending on the non-standard __int128 type for
accumulated products) and radix 32 (portable C).

This is the speed I get for C implementations of poly1305_update on my
x86_64 laptop:

* Radix 26: 1.2 GByte/s (old code)

* Radix 32: 1.3 GByte/s

* Radix 64: 2.2 GByte/s

It would be interesting with benchmarks on actual 32-bit hardware,
32-bit ARM likely being the most relevant arch.

For comparison, the current x86_64 asm version: 2.5 GByte/s.

If I understood correctly, the suggestion to use radix 26 in djb's
original paper was motivated by a high-speed implementation using
floating point arithmetic (possibly in combination with SIMD), where the
product of two 26-bit integers can be represented exactly in an IEEE
double (but it gets a bit subtle if we want to accumulate several
products), I haven't really looked into implementing poly1305 with
either floating point or SIMD.

To improve test coverage, I've also extended poly1305 tests with tests
on random inputs, with results compared to a reference implementation
based on gmp/mini-gmp. I intend to merge those testing changes soon.
See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755ace309.

Unfortunately, the http interface of the main git repo at Lysator is
inaccessible at the moment due to an expired certificate; should be
fixed in a day or two.

Regards,
/Niels

/* poly1305-internal.c

   Copyright: 2013 Nikos Mavrogiannopoulos
   Copyright: 2013, 2022 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
*/

#if HAVE_CONFIG_H
#include "config.h"
#endif

#include 
#include 

#include "poly1305.h"
#include "poly1305-internal.h"

#include "macros.h"

#if 1
typedef unsigned __int128 nettle_uint128_t;

#define M64(a,b) ((nettle_uint128_t)(a) * (b))

#define r0 r.r64[0]
#define r1 r.r64[1]
#define s1 r.r64[2]
#define h0 h.h64[0]
#define h1 h.h64[1]
#define h2 hh

void
_nettle_poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16])
{
  uint64_t t0, t1;
  t0 = LE_READ_UINT64(key);
  t1 = LE_READ_UINT64(key + 8);

  ctx->r0 = t0 & UINT64_C(0x0ffc0fff);
  ctx->r1 = t1 & UINT64_C(0x0ffc0ffc);
  ctx->s1 = 5*(ctx->r1 >> 2);

  ctx->h0 = 0;
  ctx->h1 = 0;
  ctx->h2 = 0;
}

void
_nettle_poly1305_block (struct poly1305_ctx *ctx, const uint8_t *m, unsigned t2)
{
  uint64_t t0, t1;
  nettle_uint128_t s, f0, f1;

  /* Add in message block */
  t0 = ctx->h0 + LE_READ_UINT64(m);
  s = (nettle_uint128_t) (t0 < ctx->h0) + ctx->h1 + LE_READ_UINT64(m+8);
  t1 = s;
  t2 += (s >> 64) + ctx->h2;

  /* Key constants are bounded by rk < 2^60, sk < 5*2^58, therefore
 all the fk sums fit in 128 bits without overflow, with at least
 one bit margin. */
  f0 = M64(t0, ctx->r0) + M64(t1, ctx->s1);
  f1 = M64(t0, ctx->r1) + M64(t1, ctx->r0) + t2 * ctx->s1
+ ((nettle_uint128_t)(t2 * ctx->r0) << 64);

  /* Fold high part of f1. */
  f0 += 5*(f1 >> 66);
  f1 &= ((nettle_uint128_t) 1 << 66) - 1;
  ctx->h0 = f0;
  f1 += f0 >> 64;
  ctx->h1 = f1;
  ctx->h2 = f1 >> 64;
  assert (ctx->h2 <= 4);
}

/* Adds digest to the nonce */
void
_nettle_poly1305_digest (struct poly1305_ctx *ctx, union nettle_block16 *s)
{
  uint64_t t0, t1, t2, c1, mask, s0;

  t0 = ctx->h0;
  t1 = ctx->h1;
  t2 = ctx->h2;

  /* Compute resulting carries when adding 5. */
  c1 = t0 > -(UINT64_C(5));
  t2 += (t1 + c1 < c1);

  /* Set if H >= 2^130 - 5 */
  mask = - (t2 >> 2);

  t0 += mask & 5;
  t1 += mask & c1;

  /* FIXME: Take advantage of s being aligned as an unsigned long. */
  s0 = t0 + LE_READ_UINT64(s->b);
  t1 += (s0 < t0) + LE_READ_UINT64(s->b+8);

  LE_WRITE_UINT64(s->b, s0);
  LE_WRITE_UINT64(s->b+8, t1);

  ctx->h0 = 0;
  ctx->h1 = 0;

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread Maamoun TK
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller  wrote:

> ni...@lysator.liu.se (Niels Möller) writes:
>
> > The current C implementation uses radix 26, and 25 multiplies (32x32
> > --> 64) per block. And quite a lot of shifts. A radix 32 variant
> > analogous to the above would need 16 long multiplies and 4 short. I'd
> > expect that to be faster on most machines, but I'd have to try that out.
>
> I've tried this out, see attached file. It has an #if 0/1 to choose
> between radix 64 (depending on the non-standard __int128 type for
> accumulated products) and radix 32 (portable C).
>
> This is the speed I get for C implementations of poly1305_update on my
> x86_64 laptop:
>
> * Radix 26: 1.2 GByte/s (old code)
>
> * Radix 32: 1.3 GByte/s
>
> * Radix 64: 2.2 GByte/s
>
> It would be interesting with benchmarks on actual 32-bit hardware,
> 32-bit ARM likely being the most relevant arch.
>
> For comparison, the current x86_64 asm version: 2.5 GByte/s.
>

I made a performance test of this patch on the available architectures I
have access to.

Arm64 (gcc117 gfarm):
* Radix 26: 0.65 GByte/s
* Radix 26 (2-way interleaved): 0.92 GByte/s
* Radix 32: 0.55 GByte/s
* Radix 64: 0.58 GByte/s
POWER9:
* Radix 26: 0.47 GByte/s
* Radix 26 (2-way interleaved): 1.15 GByte/s
* Radix 32: 0.52 GByte/s
* Radix 64: 0.58 GByte/s
Z15:
* Radix 26: 0.65 GByte/s
* Radix 26 (2-way interleaved): 3.17 GByte/s
* Radix 32: 0.82 GByte/s
* Radix 64: 1.22 GByte/s

Apparently, the higher radix version has performance improvements on
x86_64, powerpc, and s390x but this is not the case for arm64 arch where
the performance has a slight hit there.

I tried to compile the new code with -m32 flag on x86_64 but I got
"poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this
target". Unfortunately, I don't have access to arm 32-bit too.

Also, I've disassembled the update function of Radix 64 and none of the
architectures has made use of SIMD support (including x86_64 that hasn't
used XMM registers which is standard for this arch, I don't know if gcc
supports such behavior for C compiling but I'm aware that MSVC takes
advantage of that standardization for further optimization on compiled C
code).

I'm trying to implement the radix 64 using SIMD to see if we can get any
performance boost, I'll post the result once I get done with it.

regards,
Mamone


> If I understood correctly, the suggestion to use radix 26 in djb's
> original paper was motivated by a high-speed implementation using
> floating point arithmetic (possibly in combination with SIMD), where the
> product of two 26-bit integers can be represented exactly in an IEEE
> double (but it gets a bit subtle if we want to accumulate several
> products), I haven't really looked into implementing poly1305 with
> either floating point or SIMD.
>
> To improve test coverage, I've also extended poly1305 tests with tests
> on random inputs, with results compared to a reference implementation
> based on gmp/mini-gmp. I intend to merge those testing changes soon.
> See
>
> https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755ace309
> .
>
> Unfortunately, the http interface of the main git repo at Lysator is
> inaccessible at the moment due to an expired certificate; should be
> fixed in a day or two.
>
> Regards,
> /Niels
>
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-23 Thread David Edelsohn
On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK  wrote:
>
> On Sun, Jan 23, 2022 at 9:10 PM Niels Möller  wrote:
>
> > ni...@lysator.liu.se (Niels Möller) writes:
> >
> > > The current C implementation uses radix 26, and 25 multiplies (32x32
> > > --> 64) per block. And quite a lot of shifts. A radix 32 variant
> > > analogous to the above would need 16 long multiplies and 4 short. I'd
> > > expect that to be faster on most machines, but I'd have to try that out.
> >
> > I've tried this out, see attached file. It has an #if 0/1 to choose
> > between radix 64 (depending on the non-standard __int128 type for
> > accumulated products) and radix 32 (portable C).
> >
> > This is the speed I get for C implementations of poly1305_update on my
> > x86_64 laptop:
> >
> > * Radix 26: 1.2 GByte/s (old code)
> >
> > * Radix 32: 1.3 GByte/s
> >
> > * Radix 64: 2.2 GByte/s
> >
> > It would be interesting with benchmarks on actual 32-bit hardware,
> > 32-bit ARM likely being the most relevant arch.
> >
> > For comparison, the current x86_64 asm version: 2.5 GByte/s.
> >
>
> I made a performance test of this patch on the available architectures I
> have access to.
>
> Arm64 (gcc117 gfarm):
> * Radix 26: 0.65 GByte/s
> * Radix 26 (2-way interleaved): 0.92 GByte/s
> * Radix 32: 0.55 GByte/s
> * Radix 64: 0.58 GByte/s
> POWER9:
> * Radix 26: 0.47 GByte/s
> * Radix 26 (2-way interleaved): 1.15 GByte/s
> * Radix 32: 0.52 GByte/s
> * Radix 64: 0.58 GByte/s
> Z15:
> * Radix 26: 0.65 GByte/s
> * Radix 26 (2-way interleaved): 3.17 GByte/s
> * Radix 32: 0.82 GByte/s
> * Radix 64: 1.22 GByte/s
>
> Apparently, the higher radix version has performance improvements on
> x86_64, powerpc, and s390x but this is not the case for arm64 arch where
> the performance has a slight hit there.

That might be an artifact of the specific ARM processor
microarchitecture or the memory subsystem of the ARM system, not
inherent to the Arm AArch64 architecture and ISA.

- David

>
> I tried to compile the new code with -m32 flag on x86_64 but I got
> "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this
> target". Unfortunately, I don't have access to arm 32-bit too.
>
> Also, I've disassembled the update function of Radix 64 and none of the
> architectures has made use of SIMD support (including x86_64 that hasn't
> used XMM registers which is standard for this arch, I don't know if gcc
> supports such behavior for C compiling but I'm aware that MSVC takes
> advantage of that standardization for further optimization on compiled C
> code).
>
> I'm trying to implement the radix 64 using SIMD to see if we can get any
> performance boost, I'll post the result once I get done with it.
>
> regards,
> Mamone
>
>
> > If I understood correctly, the suggestion to use radix 26 in djb's
> > original paper was motivated by a high-speed implementation using
> > floating point arithmetic (possibly in combination with SIMD), where the
> > product of two 26-bit integers can be represented exactly in an IEEE
> > double (but it gets a bit subtle if we want to accumulate several
> > products), I haven't really looked into implementing poly1305 with
> > either floating point or SIMD.
> >
> > To improve test coverage, I've also extended poly1305 tests with tests
> > on random inputs, with results compared to a reference implementation
> > based on gmp/mini-gmp. I intend to merge those testing changes soon.
> > See
> >
> > https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755ace309
> > .
> >
> > Unfortunately, the http interface of the main git repo at Lysator is
> > inaccessible at the moment due to an expired certificate; should be
> > fixed in a day or two.
> >
> > Regards,
> > /Niels
> >
> >
> > --
> > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> > Internet email is subject to wholesale government surveillance.
> >
> ___
> nettle-bugs mailing list
> nettle-bugs@lists.lysator.liu.se
> http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-24 Thread Niels Möller
Maamoun TK  writes:

> I made a performance test of this patch on the available architectures I
> have access to.
>
> Arm64 (gcc117 gfarm):
> * Radix 26: 0.65 GByte/s
> * Radix 26 (2-way interleaved): 0.92 GByte/s
> * Radix 32: 0.55 GByte/s
> * Radix 64: 0.58 GByte/s
> POWER9:
> * Radix 26: 0.47 GByte/s
> * Radix 26 (2-way interleaved): 1.15 GByte/s
> * Radix 32: 0.52 GByte/s
> * Radix 64: 0.58 GByte/s
> Z15:
> * Radix 26: 0.65 GByte/s
> * Radix 26 (2-way interleaved): 3.17 GByte/s
> * Radix 32: 0.82 GByte/s
> * Radix 64: 1.22 GByte/s

Interesting. I'm a bit surprised the radix-64 doesn't perform better, in
particular on arm64. (But I'm not yet familiar with arm64 multiply
instructions).

Numbers for 2-way interleaving are impressive, I'd like to understand
how that works. Might be useful derive corresponding multiply
throughput, i.e., number of multiply operations (and with which multiply
instruction) completed per cycle, as well as total cycles per block

It looks like the folding done per-block in the radix-64 code costs at
least 5 or so cycles per block (since these operations are all
dependent, and we also have the multiply by 5 in there, probably adding
a few cycles more). Maybe at least the multiply can be postponed.

> I tried to compile the new code with -m32 flag on x86_64 but I got
> "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this
> target".

That's expected, in two ways: I don't expect radix-64 to give any
performance gain over radix-32 on any 32-bit archs. And I think __int128
is supported only on archs where it fits in two registers. If we start
using __int128 we need a configure test for it, and then it actually
makes things simpler, at least for this in this usecase, if it stays
unsupported on 32-bit archs where it shouldn't be used.

So to compile with -m32, the radix-64 code must be #if:ed out.

> Also, I've disassembled the update function of Radix 64 and none of the
> architectures has made use of SIMD support (including x86_64 that hasn't
> used XMM registers which is standard for this arch, I don't know if gcc
> supports such behavior for C compiling but I'm aware that MSVC takes
> advantage of that standardization for further optimization on compiled C
> code).

The radix-64 code really wants multiply instruction(s) for 64x64 -->
128, and I think that's not so common SIMD instruction sets (but
powerpc64 vmsumudm looks potentially useful?) Either as a
single instruction, or as a pair of mulhigh/mullow instructions. And
some not too complicated way to do a 128-bit add with proper carry
propagation in the middle.

Arm32 neon does have 32x32 --> 64, which looks like a good fit for the
radix-32 variant.

Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-24 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> This is the speed I get for C implementations of poly1305_update on my
> x86_64 laptop:
>
> * Radix 26: 1.2 GByte/s (old code)
>
> * Radix 32: 1.3 GByte/s
>
> * Radix 64: 2.2 GByte/s
>
> It would be interesting with benchmarks on actual 32-bit hardware,
> 32-bit ARM likely being the most relevant arch.
>
> For comparison, the current x86_64 asm version: 2.5 GByte/s.

I've tried reworking folding, to reduce latency. Idea is to let the most
significant state word be close to a word, rather than limited to <= 4
as in the previous version. When multiplying by r, split one of the
multiplies to take out the low 2 bits. For the radix 64 version, that
term is

  B^2 t_2 * r0

Split t_2 as 4*hi + lo, then this can be reduced to

  B^2 lo * r0 + hi * 5*r0

(Using the same old B^2 = 5/4 (mod p) in a slightly different way).

The 5*r0 fits one word and can be precomputed, and then this
multiplication goes in parallell with the other multiplies, and no
multiply left in the final per-block folding. With this trick I get on
the same machine

Radix 32: 1.65 GByte/s

Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.

I haven't yet done a strict analysis of bounds on the state and
temporaries, but I would expect that it works out with no possibility of
overflow.

See attached file. To fit the precomputed 5*r0 in a nice way I had to
rearrange the unions in struct poly1305_ctx a bit, I also attach the
patch to do this. Size of the struct should be the same, so I think it
can be done without any abi bump.

Regards,
/Niels

diff --git a/poly1305.h b/poly1305.h
index 99c63c8a..6c13a590 100644
--- a/poly1305.h
+++ b/poly1305.h
@@ -55,18 +55,15 @@ struct poly1305_ctx {
   /* Key, 128-bit value and some cached multiples. */
   union
   {
-uint32_t r32[6];
-uint64_t r64[3];
+uint32_t r32[8];
+uint64_t r64[4];
   } r;
-  uint32_t s32[3];
   /* State, represented as words of 26, 32 or 64 bits, depending on
  implementation. */
-  /* High bits first, to maintain alignment. */
-  uint32_t hh;
   union
   {
-uint32_t h32[4];
-uint64_t h64[2];
+uint32_t h32[6];
+uint64_t h64[3];
   } h;
 };
 
/* poly1305-internal.c

   Copyright: 2013 Nikos Mavrogiannopoulos
   Copyright: 2013, 2022 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
*/

#if HAVE_CONFIG_H
#include "config.h"
#endif

#include 
#include 

#include "poly1305.h"
#include "poly1305-internal.h"

#include "macros.h"

#if 1
typedef unsigned __int128 nettle_uint128_t;

#define M64(a,b) ((nettle_uint128_t)(a) * (b))

#define r0 r.r64[0]
#define r1 r.r64[1]
#define s0 r.r64[2]
#define s1 r.r64[3]
#define h0 h.h64[0]
#define h1 h.h64[1]
#define h2 h.h64[2]

void
_nettle_poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16])
{
  uint64_t t0, t1;
  t0 = LE_READ_UINT64 (key);
  t1 = LE_READ_UINT64 (key + 8);

  ctx->r0 = t0 & UINT64_C (0x0ffc0fff);
  ctx->r1 = t1 & UINT64_C (0x0ffc0ffc);
  ctx->s0 = 5*ctx->r0;
  ctx->s1 = 5*(ctx->r1 >> 2);

  ctx->h0 = 0;
  ctx->h1 = 0;
  ctx->h2 = 0;
}

void
_nettle_poly1305_block (struct poly1305_ctx *ctx, const uint8_t *m, unsigned m128)
{
  uint64_t t0, t1, t2;
  nettle_uint128_t s, f0, f1;

  /* Add in message block */
  t0 = ctx->h0 + LE_READ_UINT64(m);
  s = (nettle_uint128_t) ctx->h1 + (t0 < ctx->h0) + LE_READ_UINT64(m+8);
  t1 = s;
  t2 = ctx->h2 + (s >> 64) + m128;

  /* Key constants are bounded by rk < 2^60, sk < 5*2^58, therefore
 all the fk sums fit in 128 bits without overflow, with at least
 one bit margin. */
  f0 = M64(t0, ctx->r0) + M64(t1, ctx->s1) + M64(t2 >> 2, ctx->s0);
  f1 = M64(t0, ctx->r1) + M64(t1, ctx->r0) + M64(t2, ctx->s1)
+ ((nettle_uint128_t)((t2 & 3) * ctx->r0) << 64);

  ctx->h0 = f0;
  f1 += f0 >> 64;
  ctx->h1 = f1;
  ctx->h2 = f1 >> 64;
}

/* Adds digest to the nonce */
void
_nettle_poly1305_digest (struct poly1305_ctx *ctx, union nettle_block16 *s)
{
  uint64_t t0, t1, t2, c0, c1, mask, m0;
  t0 = ctx->h0;
  t1 = ctx->h1;
  t2 = ctx->h2;

  /* Fold high part of t2 */
  c

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-25 Thread Maamoun TK
On Mon, Jan 24, 2022 at 12:58 AM David Edelsohn  wrote:

> On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK 
> wrote:
> >
> > On Sun, Jan 23, 2022 at 9:10 PM Niels Möller 
> wrote:
> >
> > > ni...@lysator.liu.se (Niels Möller) writes:
> > >
> > > > The current C implementation uses radix 26, and 25 multiplies (32x32
> > > > --> 64) per block. And quite a lot of shifts. A radix 32 variant
> > > > analogous to the above would need 16 long multiplies and 4 short. I'd
> > > > expect that to be faster on most machines, but I'd have to try that
> out.
> > >
> > > I've tried this out, see attached file. It has an #if 0/1 to choose
> > > between radix 64 (depending on the non-standard __int128 type for
> > > accumulated products) and radix 32 (portable C).
> > >
> > > This is the speed I get for C implementations of poly1305_update on my
> > > x86_64 laptop:
> > >
> > > * Radix 26: 1.2 GByte/s (old code)
> > >
> > > * Radix 32: 1.3 GByte/s
> > >
> > > * Radix 64: 2.2 GByte/s
> > >
> > > It would be interesting with benchmarks on actual 32-bit hardware,
> > > 32-bit ARM likely being the most relevant arch.
> > >
> > > For comparison, the current x86_64 asm version: 2.5 GByte/s.
> > >
> >
> > I made a performance test of this patch on the available architectures I
> > have access to.
> >
> > Arm64 (gcc117 gfarm):
> > * Radix 26: 0.65 GByte/s
> > * Radix 26 (2-way interleaved): 0.92 GByte/s
> > * Radix 32: 0.55 GByte/s
> > * Radix 64: 0.58 GByte/s
> > POWER9:
> > * Radix 26: 0.47 GByte/s
> > * Radix 26 (2-way interleaved): 1.15 GByte/s
> > * Radix 32: 0.52 GByte/s
> > * Radix 64: 0.58 GByte/s
> > Z15:
> > * Radix 26: 0.65 GByte/s
> > * Radix 26 (2-way interleaved): 3.17 GByte/s
> > * Radix 32: 0.82 GByte/s
> > * Radix 64: 1.22 GByte/s
> >
> > Apparently, the higher radix version has performance improvements on
> > x86_64, powerpc, and s390x but this is not the case for arm64 arch where
> > the performance has a slight hit there.
>
> That might be an artifact of the specific ARM processor
> microarchitecture or the memory subsystem of the ARM system, not
> inherent to the Arm AArch64 architecture and ISA.
>

Seems right, I've tested the enhanced versions of wider multiplication on
other gfarm aarch64 instances and I got different results.

On Mon, Jan 24, 2022 at 10:01 AM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > I made a performance test of this patch on the available architectures I
> > have access to.
> >
> > Arm64 (gcc117 gfarm):
> > * Radix 26: 0.65 GByte/s
> > * Radix 26 (2-way interleaved): 0.92 GByte/s
> > * Radix 32: 0.55 GByte/s
> > * Radix 64: 0.58 GByte/s
> > POWER9:
> > * Radix 26: 0.47 GByte/s
> > * Radix 26 (2-way interleaved): 1.15 GByte/s
> > * Radix 32: 0.52 GByte/s
> > * Radix 64: 0.58 GByte/s
> > Z15:
> > * Radix 26: 0.65 GByte/s
> > * Radix 26 (2-way interleaved): 3.17 GByte/s
> > * Radix 32: 0.82 GByte/s
> > * Radix 64: 1.22 GByte/s
>
> Interesting. I'm a bit surprised the radix-64 doesn't perform better, in
> particular on arm64. (But I'm not yet familiar with arm64 multiply
> instructions).
>

It looks like wider multiplication would achieve higher speed on different
aarch64 instance on gfarm. Here are the numbers on gcc185 instance:

* Radix 26: 0.83 GByte/s
* Radix 26 (2-way interleaved): 0.70 GByte/s
* Radix 64 (Latest version): 1.25 GByte/s

These numbers are a bit of a surprise too since the 2-way interleaving is
supposed to perform better than the old C version similarly to other
architectures!
Anyway, the benchmark numbers of powerpc and s390x were not taken from
gfarm instances and it's ok to be based on.


> Numbers for 2-way interleaving are impressive, I'd like to understand
> how that works.
>

Vector 32-bit multiplication applies two multiply operations on inputs and
places the concatenated results on the destination vector. So the current
simd/altivec implementations interleave two blocks horizontally over vector
registers and execute both multiplication and reduction phases on both
blocks simultaneously. However, after each block iteration we should
combine the two states together by splitting the concatenated value and
adding it to the origin but to avoid that overhead Ione can multiply both
state parts with r^2 except for the last two blocks that imply multiplying
the first part with r^2 and the second one with r.
Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by hash
has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r
With interleaved implementation this sequence is executed in two iteration.
First iteration:
(h+b0) r^2 || b1 r^2
Second iteration:
((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r

When getting out of the loop we combine the two state parts together so we
get the same correct sequence of r powers for each block.

Also, the two-independent carry technique that mentioned previously
overlaps two carry procedures with each other including the long carry from
h4 to h0 which offers the opportunity for further boost.

Might be useful derive correspon

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-25 Thread Niels Möller
Maamoun TK  writes:

> It looks like wider multiplication would achieve higher speed on different
> aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
>
> * Radix 26: 0.83 GByte/s
> * Radix 26 (2-way interleaved): 0.70 GByte/s
> * Radix 64 (Latest version): 1.25 GByte/s
>
> These numbers are a bit of a surprise too since the 2-way interleaving is
> supposed to perform better than the old C version similarly to other
> architectures!
> Anyway, the benchmark numbers of powerpc and s390x were not taken from
> gfarm instances and it's ok to be based on.

In the meantime, I pushed the latest C radix-32 version to a branch
poly1305-radix32, and benchmarked on one of the 32-bit ARM boards that
are part of the GMP test systems (the odxu4 system on
https://gmplib.org/devel/testsystems, labeled Cortex-A15/A7). I got:

  Radix 26 (old code): 326 MByte/s
  Radix 32 (new code): 260 MByte/s

So radix-32 doesn't seem to be a good option there. I had a quick look
at the generated assembly, besides the expected umull and umlal
instructions to do the main multiply work, there's an awful lot of loads
and stores to the stack, roughly half of the instructions. The compiler
installed is gcc-5.4, rather old.

> Vector 32-bit multiplication applies two multiply operations on inputs and
> places the concatenated results on the destination vector. So the current
> simd/altivec implementations interleave two blocks horizontally over vector
> registers and execute both multiplication and reduction phases on both
> blocks simultaneously. However, after each block iteration we should
> combine the two states together by splitting the concatenated value and
> adding it to the origin but to avoid that overhead Ione can multiply both
> state parts with r^2 except for the last two blocks that imply multiplying
> the first part with r^2 and the second one with r.
> Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by hash
> has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r
> With interleaved implementation this sequence is executed in two iteration.
> First iteration:
> (h+b0) r^2 || b1 r^2
> Second iteration:
> ((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
>
> When getting out of the loop we combine the two state parts together so we
> get the same correct sequence of r powers for each block.
>
> Also, the two-independent carry technique that mentioned previously
> overlaps two carry procedures with each other including the long carry from
> h4 to h0 which offers the opportunity for further boost.

Hmm, it seems that avoiding long carry chains is the main reason why
radix 26 can be faster.

> Great! It performs better on all tested architectures. Apparently, AArch64
> SIMD doesn't support 64*64->128 vector multiplication so I've implemented
> this version on powerpc by utilizing vmsumudm power9-specific instruction.
> I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly
> version, I'll attach the hardware implementation in this email.

But you had 1.15 GByte/s for the 2-way interleaved version on this machine?

> define(`FUNC_ALIGN', `5')
> PROLOGUE(_nettle_poly1305_block)
>   ld  H0, 32(CTX)
>   ld  H1, 40(CTX)
>   ld  H2, 48(CTX)
>   ld  T0, 0(M)
>   ld  T1, 8(M)
>
>   addcT0, T0, H0
>   addeT1, T1, H1
>   addeT2, M128, H2
>
>   li  IDX, 16
>   lxvd2x  VSR(R), 0, CTX
>   lxvd2x  VSR(S), IDX, CTX
>
>   li  RZ, 0
>   vxorZERO, ZERO, ZERO
>   vxorF0, F0, F0
>   vxorF1, F1, F1
>   vxorTMP, TMP, TMP
>
>   xxpermdiVSR(MU0), VSR(R), VSR(S), 0b01
>   xxswapd VSR(MU1), VSR(R)
>   
>   mtvsrdd VSR(T), T0, T1
>   mtvsrdd VSR(T10), 0, T2
>   andi.   T2A, T2, 3
>   mtvsrdd VSR(T11), 0, T2A
>   srdiT2A, T2, 2
>   mtvsrdd VSR(T00), T2A, RZ

I don't get all of the setup, but perhaps it would be better to load
input (T0, T1) and state (H0, H1, H2) directly into vector registers,
and avoid move between regular registers and vectors.

For the R and S values, the key setup could store them in the right
order so they don't have to be permuted after load.
 
>   vmsumudmF0, T, MU0, F0
>   vmsumudmF1, T, MU1, F1
>   vmsumudmTMP, T11, MU1, TMP
>
>   vmsumudmF0, T00, S, F0
>   vmsumudmF1, T10, MU0, F1

This part is as neat as I had hoped! Is there some variant of the
instructions that writes the result register without adding, to avoid
the explicit clearing of F0 and F1? It may also be doable with one
instruction less; the 5 instructions does 10 multiplies, but I think we
use only 7, the rest must somehow be zeroed or ignored.

>   xxmrgld

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-25 Thread Maamoun TK
On Tue, Jan 25, 2022 at 10:24 PM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > It looks like wider multiplication would achieve higher speed on
> different
> > aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
> >
> > * Radix 26: 0.83 GByte/s
> > * Radix 26 (2-way interleaved): 0.70 GByte/s
> > * Radix 64 (Latest version): 1.25 GByte/s
> >
> > These numbers are a bit of a surprise too since the 2-way interleaving is
> > supposed to perform better than the old C version similarly to other
> > architectures!
> > Anyway, the benchmark numbers of powerpc and s390x were not taken from
> > gfarm instances and it's ok to be based on.
>
> In the meantime, I pushed the latest C radix-32 version to a branch
> poly1305-radix32, and benchmarked on one of the 32-bit ARM boards that
> are part of the GMP test systems (the odxu4 system on
> https://gmplib.org/devel/testsystems, labeled Cortex-A15/A7). I got:
>
>   Radix 26 (old code): 326 MByte/s
>   Radix 32 (new code): 260 MByte/s
>
> So radix-32 doesn't seem to be a good option there. I had a quick look
> at the generated assembly, besides the expected umull and umlal
> instructions to do the main multiply work, there's an awful lot of loads
> and stores to the stack, roughly half of the instructions. The compiler
> installed is gcc-5.4, rather old.
>
> > Vector 32-bit multiplication applies two multiply operations on inputs
> and
> > places the concatenated results on the destination vector. So the current
> > simd/altivec implementations interleave two blocks horizontally over
> vector
> > registers and execute both multiplication and reduction phases on both
> > blocks simultaneously. However, after each block iteration we should
> > combine the two states together by splitting the concatenated value and
> > adding it to the origin but to avoid that overhead Ione can multiply both
> > state parts with r^2 except for the last two blocks that imply
> multiplying
> > the first part with r^2 and the second one with r.
> > Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by
> hash
> > has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r
> > With interleaved implementation this sequence is executed in two
> iteration.
> > First iteration:
> > (h+b0) r^2 || b1 r^2
> > Second iteration:
> > ((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
> >
> > When getting out of the loop we combine the two state parts together so
> we
> > get the same correct sequence of r powers for each block.
> >
> > Also, the two-independent carry technique that mentioned previously
> > overlaps two carry procedures with each other including the long carry
> from
> > h4 to h0 which offers the opportunity for further boost.
>
> Hmm, it seems that avoiding long carry chains is the main reason why
> radix 26 can be faster.
>

Yes, with the sequential carry path the SIMD function would slightly beat
the C function of radix 26.


> > Great! It performs better on all tested architectures. Apparently,
> AArch64
> > SIMD doesn't support 64*64->128 vector multiplication so I've implemented
> > this version on powerpc by utilizing vmsumudm power9-specific
> instruction.
> > I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly
> > version, I'll attach the hardware implementation in this email.
>
> But you had 1.15 GByte/s for the 2-way interleaved version on this machine?
>

Right, the 2-way interleaved version is more efficient than this one and
supports POWER7+ processors.


> > define(`FUNC_ALIGN', `5')
> > PROLOGUE(_nettle_poly1305_block)
> >   ld  H0, 32(CTX)
> >   ld  H1, 40(CTX)
> >   ld  H2, 48(CTX)
> >   ld  T0, 0(M)
> >   ld  T1, 8(M)
> >
> >   addcT0, T0, H0
> >   addeT1, T1, H1
> >   addeT2, M128, H2
> >
> >   li  IDX, 16
> >   lxvd2x  VSR(R), 0, CTX
> >   lxvd2x  VSR(S), IDX, CTX
> >
> >   li  RZ, 0
> >   vxorZERO, ZERO, ZERO
> >   vxorF0, F0, F0
> >   vxorF1, F1, F1
> >   vxorTMP, TMP, TMP
> >
> >   xxpermdiVSR(MU0), VSR(R), VSR(S), 0b01
> >   xxswapd VSR(MU1), VSR(R)
> >
> >   mtvsrdd VSR(T), T0, T1
> >   mtvsrdd VSR(T10), 0, T2
> >   andi.   T2A, T2, 3
> >   mtvsrdd VSR(T11), 0, T2A
> >   srdiT2A, T2, 2
> >   mtvsrdd VSR(T00), T2A, RZ
>
> I don't get all of the setup, but perhaps it would be better to load
> input (T0, T1) and state (H0, H1, H2) directly into vector registers,
> and avoid move between regular registers and vectors.
>

I was having difficulty using vector addition with carry so I got to deal
with the general register for that purpose. Also, general AND and Shift
operations are more easier to use than the vector ones since the la

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-26 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

>> This is the speed I get for C implementations of poly1305_update on my
>> x86_64 laptop:
>>
>> * Radix 26: 1.2 GByte/s (old code)
>>
>> * Radix 32: 1.3 GByte/s
>>
>> * Radix 64: 2.2 GByte/s
[...]
>> For comparison, the current x86_64 asm version: 2.5 GByte/s.
[...]
> I've tried reworking folding, to reduce latency [...] With this trick I get on
> the same machine
>
> Radix 32: 1.65 GByte/s
>
> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.

And I've now tried the same method for the x86_64 implementation. See
attached file + needed patch to asm.m4. This gives 2.9 GByte/s. 

I'm not entirely sure cycle numbers are accurate, with clock frequence
not being fixed. I think the machine runs bechmarks at 2.1GHz, and then
this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4
instructions per cycle, 0.5 multiply instructions per cycle.

This laptop has an AMD zen2 processor, which should be capable of
issuing four instructions per cycle and complete one multiply
instruction per cycle (according to
https://gmplib.org/~tege/x86-timing.pdf). 

This seems to indicate that on this hardware, speed is not limited by
multiplier throughput, instead, the bottleneck is instruction
decoding/issuing, with max four instructions per cycle.

Regards,
/Niels

diff --git a/asm.m4 b/asm.m4
index 4ac21c20..60c66c25 100644
--- a/asm.m4
+++ b/asm.m4
@@ -94,10 +94,10 @@ C For 64-bit implementation
 STRUCTURE(P1305)
   STRUCT(R0, 8)
   STRUCT(R1, 8)
+  STRUCT(S0, 8)
   STRUCT(S1, 8)
-  STRUCT(PAD, 12)
-  STRUCT(H2, 4)
   STRUCT(H0, 8)
   STRUCT(H1, 8)
+  STRUCT(H2, 8)
 
 divert
C x86_64/poly1305-internal.asm

ifelse(`
   Copyright (C) 2013 Niels Möller

   This file is part of GNU Nettle.

   GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:

 * the GNU Lesser General Public License as published by the Free
   Software Foundation; either version 3 of the License, or (at your
   option) any later version.

   or

 * the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

   or both in parallel, as here.

   GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.

   You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')

.file "poly1305-internal.asm"

C Registers mainly used by poly1305_block
define(`CTX', `%rdi') C First argument to all functions

define(`KEY', `%rsi')
define(`MASK',` %r8')
C _poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16])
.text
ALIGN(16)
PROLOGUE(_nettle_poly1305_set_key)
W64_ENTRY(2,0)
mov $0x0ffc0fff, MASK
mov (KEY), %rax
and MASK, %rax
and $-4, MASK
mov %rax, P1305_R0 (CTX)
imul$5, %rax
mov %rax, P1305_S0 (CTX)
mov 8(KEY), %rax
and MASK, %rax
mov %rax, P1305_R1 (CTX)
shr $2, %rax
imul$5, %rax
mov %rax, P1305_S1 (CTX)
xor XREG(%rax), XREG(%rax)
mov %rax, P1305_H0 (CTX)
mov %rax, P1305_H1 (CTX)
mov %rax, P1305_H2 (CTX)

W64_EXIT(2,0)
ret

undefine(`KEY')
undefine(`MASK')

EPILOGUE(_nettle_poly1305_set_key)

define(`T0', `%rcx')
define(`T1', `%rsi')C Overlaps message input pointer.
define(`T2', `%r8')
define(`H0', `%r9')
define(`H1', `%r10')
define(`F0', `%r11')
define(`F1', `%r12')

C Compute in parallel
C
C {H1,H0} = R0 T0 + S1 T1 + S0 (T2 >> 2)
C {F1,F0} = R1 T0 + R0 T1 + S1 T2
C T = R0 * (T2 & 3)
C
C Then accumulate as
C
C +--+--+--+
C |T |H1|H0|
C +--+--+--+
C   + |F1|F0|
C   --+--+--+--+
C |H2|H1|H0|
C +--+--+--+

C _poly1305_block (struct poly1305_ctx *ctx, const uint8_t m[16], 
unsigned hi)

PROLOGUE(_nettle_poly1305_block)
W64_ENTRY(3, 0)
push%r12
mov (%rsi), T0
mov 8(%rsi), T1
mov XREG(%rdx), XREG(T2)C Also zero extends

add P1305_H0 (CTX), T0
adc P1305_H1 (CTX), T1
adc P1305_H2 (CTX), T2

mov P1305_R1 (CTX), %rax
mul T0  C R1 T0
mov %rax, F0
mov %rdx, F1

mov T0, %raxC Last use of T0 input
mov P1305_R0 (CTX), T0
mul T0  C R0*T0
mov %rax, H0
mov %rdx, H1

mov T1, %rax
mul T0  C R0*T1
add %rax, F0
adc

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

>> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
>
> And I've now tried the same method for the x86_64 implementation. See
> attached file + needed patch to asm.m4. This gives 2.9 GByte/s. 
>
> I'm not entirely sure cycle numbers are accurate, with clock frequence
> not being fixed. I think the machine runs bechmarks at 2.1GHz, and then
> this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4
> instructions per cycle, 0.5 multiply instructions per cycle.
>
> This laptop has an AMD zen2 processor, which should be capable of
> issuing four instructions per cycle and complete one multiply
> instruction per cycle (according to
> https://gmplib.org/~tege/x86-timing.pdf). 
>
> This seems to indicate that on this hardware, speed is not limited by
> multiplier throughput, instead, the bottleneck is instruction
> decoding/issuing, with max four instructions per cycle.

Benchmarked also on my other nearby x86_64 machine (intel broadwell
processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect
it to be generally faster, and have pushed it to the master-updates
branch.

I haven't looked that carefully at what the old code was doing, but I
think the final folding for each block used a multiply instruction that
then depends on the previous ones for that block, increasing the per
block latency. With the new code, all multiplies done for a block are
independent of each other.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Maamoun TK
On Thu, Jan 27, 2022 at 11:28 PM Niels Möller  wrote:

> ni...@lysator.liu.se (Niels Möller) writes:
>
> >> Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
> >
> > And I've now tried the same method for the x86_64 implementation. See
> > attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
> >
> > I'm not entirely sure cycle numbers are accurate, with clock frequence
> > not being fixed. I think the machine runs bechmarks at 2.1GHz, and then
> > this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4
> > instructions per cycle, 0.5 multiply instructions per cycle.
> >
> > This laptop has an AMD zen2 processor, which should be capable of
> > issuing four instructions per cycle and complete one multiply
> > instruction per cycle (according to
> > https://gmplib.org/~tege/x86-timing.pdf).
> >
> > This seems to indicate that on this hardware, speed is not limited by
> > multiplier throughput, instead, the bottleneck is instruction
> > decoding/issuing, with max four instructions per cycle.
>
> Benchmarked also on my other nearby x86_64 machine (intel broadwell
> processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect
> it to be generally faster, and have pushed it to the master-updates
> branch.
>
> I haven't looked that carefully at what the old code was doing, but I
> think the final folding for each block used a multiply instruction that
> then depends on the previous ones for that block, increasing the per
> block latency. With the new code, all multiplies done for a block are
> independent of each other.
>

Great! I believe this is the best we can get for processing one block. I'm
trying to implement two-way interleaving using AVX extension and the main
instruction of interest here is 'vpmuludq' that does double multiply
operation, the main concern here is there's a shortage of XMM registers as
there are 16 of them, I'm working on addressing this issue by using memory
operands of key values for 'vpmuludq' and hope the processor cache do his
thing here. I'm expecting to complete the assembly implementation tomorrow.

regards,
Mamone


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-01-27 Thread Niels Möller
Maamoun TK  writes:

> Great! I believe this is the best we can get for processing one block.

One may be able to squeeze out one or two cycles more using the mulx
extension, which should make it possible to eliminate some of the move
instructions (I don't think moves cost any execution unit resources, but
they do consume decoding resources).

> I'm trying to implement two-way interleaving using AVX extension and
> the main instruction of interest here is 'vpmuludq' that does double
> multiply operation

My manual seems a bit confused if it's called pmuludq or vpmuludq. But
you're thinking of the instruction that does two 32x32 --> 64
multiplies? It will be interesting to see how that works out! It does
half the work compared to a 64 x 64 --> 128 multiply instruction, but
accumulation/folding may get more efficient by using vector registers.
(There seems to also be an avx variant doing four 32x32 --> 64
multiplies, using 256-bit registers).

> the main concern here is there's a shortage of XMM registers as
> there are 16 of them, I'm working on addressing this issue by using memory
> operands of key values for 'vpmuludq' and hope the processor cache do his
> thing here. 

Reading cached values from memory is usally cheap. So probably fine as
long as values modified are kept in registers.

> I'm expecting to complete the assembly implementation tomorrow.

If my analysis of the single-block code is right, I'd expect it to be
rather important to trim number of instructions per block.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-04-30 Thread Maamoun TK
I've added Poly1305 optimization based on radix 26 using AVX2 extension for
x86_64 architecture with fat build support, the patch yields significant
speedup compared to upstream.
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/46
I've also fixed the conflicts for PPC, S390x, and Arm64 patches of Poly1305.
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41

regards,
Mamone


On Fri, Jan 28, 2022 at 8:59 AM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > Great! I believe this is the best we can get for processing one block.
>
> One may be able to squeeze out one or two cycles more using the mulx
> extension, which should make it possible to eliminate some of the move
> instructions (I don't think moves cost any execution unit resources, but
> they do consume decoding resources).
>
> > I'm trying to implement two-way interleaving using AVX extension and
> > the main instruction of interest here is 'vpmuludq' that does double
> > multiply operation
>
> My manual seems a bit confused if it's called pmuludq or vpmuludq. But
> you're thinking of the instruction that does two 32x32 --> 64
> multiplies? It will be interesting to see how that works out! It does
> half the work compared to a 64 x 64 --> 128 multiply instruction, but
> accumulation/folding may get more efficient by using vector registers.
> (There seems to also be an avx variant doing four 32x32 --> 64
> multiplies, using 256-bit registers).
>
> > the main concern here is there's a shortage of XMM registers as
> > there are 16 of them, I'm working on addressing this issue by using
> memory
> > operands of key values for 'vpmuludq' and hope the processor cache do his
> > thing here.
>
> Reading cached values from memory is usally cheap. So probably fine as
> long as values modified are kept in registers.
>
> > I'm expecting to complete the assembly implementation tomorrow.
>
> If my analysis of the single-block code is right, I'd expect it to be
> rather important to trim number of instructions per block.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-02 Thread Niels Möller
Maamoun TK  writes:

> I've added Poly1305 optimization based on radix 26 using AVX2 extension for
> x86_64 architecture with fat build support, the patch yields significant
> speedup compared to upstream.
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/46

Cool. Do I get it right, that when AVX2 is enabled, also single block
poly1305 will use radix 2^26?

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-03 Thread Maamoun TK
On Tue, May 3, 2022 at 8:43 AM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > I've added Poly1305 optimization based on radix 26 using AVX2 extension
> for
> > x86_64 architecture with fat build support, the patch yields significant
> > speedup compared to upstream.
> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/46
>
> Cool. Do I get it right, that when AVX2 is enabled, also single block
> poly1305 will use radix 2^26?
>

hmm right, didn't cross my mind. I'll add 2^64 -> 2^26 procedure at
prologue of _nettle_poly1305_4core() and  2^26 -> 2^64 at epilogue to
workaround this.


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-03 Thread Niels Möller
Maamoun TK  writes:

> hmm right, didn't cross my mind. I'll add 2^64 -> 2^26 procedure at
> prologue of _nettle_poly1305_4core() and  2^26 -> 2^64 at epilogue to
> workaround this.

If possible, I think it would be nice to let subkeys stored in struct
poly1305_ctx stay in radix-2^64, and compute the radix-2^26 versions
into registers when needed in _nettle_poly1305_4core. And thuse treat
the subkeys stored in the struct as const even if it technically isn't
const declared.

I take it you need more radix-2^26 subkey constants than can fit in the
current struct layout anyway?

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-03 Thread Maamoun TK
On Tue, May 3, 2022 at 9:26 AM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > hmm right, didn't cross my mind. I'll add 2^64 -> 2^26 procedure at
> > prologue of _nettle_poly1305_4core() and  2^26 -> 2^64 at epilogue to
> > workaround this.
>
> If possible, I think it would be nice to let subkeys stored in struct
> poly1305_ctx stay in radix-2^64, and compute the radix-2^26 versions
> into registers when needed in _nettle_poly1305_4core. And thuse treat
> the subkeys stored in the struct as const even if it technically isn't
> const declared.
>
> I take it you need more radix-2^26 subkey constants than can fit in the
> current struct layout anyway?
>

It needs 3 powers of key, also the struct layout has to have even more
slots if we need to consider the pre-multiplied values of subkeys. We can
either extend the layout or use _nettle_poly1305_4core starting with 64
blocks and more as this function begins to consume less cycles than average
consumptions of the 2^64 version.


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-04 Thread Maamoun TK
On Tue, May 3, 2022 at 9:55 AM Maamoun TK  wrote:

> On Tue, May 3, 2022 at 9:26 AM Niels Möller  wrote:
>
>> Maamoun TK  writes:
>>
>> > hmm right, didn't cross my mind. I'll add 2^64 -> 2^26 procedure at
>> > prologue of _nettle_poly1305_4core() and  2^26 -> 2^64 at epilogue to
>> > workaround this.
>>
>> If possible, I think it would be nice to let subkeys stored in struct
>> poly1305_ctx stay in radix-2^64, and compute the radix-2^26 versions
>> into registers when needed in _nettle_poly1305_4core. And thuse treat
>> the subkeys stored in the struct as const even if it technically isn't
>> const declared.
>>
>> I take it you need more radix-2^26 subkey constants than can fit in the
>> current struct layout anyway?
>>
>
> It needs 3 powers of key, also the struct layout has to have even more
> slots if we need to consider the pre-multiplied values of subkeys.
>

I'm thinking of extending the structure layout of 'poly1305_ctx' to take in
pre-computed powers of key. How backward compatible would it be to append
additional key arrays to that structure?


> We can either extend the layout or use _nettle_poly1305_4core starting
> with 64 blocks and more as this function begins to consume less cycles than
> average consumptions of the 2^64 version.
>
>
>> Regards,
>> /Niels
>>
>> --
>> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
>> Internet email is subject to wholesale government surveillance.
>>
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-04 Thread Niels Möller
Maamoun TK  writes:

> I'm thinking of extending the structure layout of 'poly1305_ctx' to take in
> pre-computed powers of key. How backward compatible would it be to append
> additional key arrays to that structure?

It would break ABI compatibility, and require a new library soname.
Other than that, it would be API compatible, code using nettle could be
recompiled with no source code changes.

So definitely possible, but not something to do lightly. I think it's
best to postpone until have reason to break the ABI for other reasons.

>> We can either extend the layout or use _nettle_poly1305_4core starting
>> with 64 blocks and more as this function begins to consume less cycles than
>> average consumptions of the 2^64 version.

Do you need to do 64 blocks (16 iterations of the
_nettle_poly1305_4core loop) before it beats the single block radix-64
version? Then precomputations must be rather costly? But due to the abi
issues, I think we have to live with redoing them for each call.

I haven't looked closely at your code yet. It seems important that
_nettle_poly1305_4core can loop to process larger messages without
redoing precomputations. Can we use parallelism in the precomputation? I
imagine what needs to be done has the following steps:

1. Input is the key in radix 2^64, stored in r64[0] and r64[1] (the
   premultiplied values aren't helpful, I'm afraid).

2. Split into radix 2^26, five pieces, that goes into 32-bit (sub-)registers.
   Also premultiplied by 5 when useful. Call this K1.

3. Compute squared key, K2 = K1^2, and premultiplied by 5 when useful. In
   principle, squaring can be easier than multiply, 15 32x32 --> 64
   multiplies rather than 25, but may not apply here since we want to
   part of the reduction at the same time.

4. Compute K3 = K1 K2 and K4 = K2^2. If we don't try to do any special
   for the squaring, at least these two multiply operations could be
   done in a SIMD fashion.

Also, I think the result after this precomputation probably doesn't have
to be in canonical representation, with each digit in the range 0 <= x <
2^26, it can most likely be made to work with a somewhat larger range.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-06 Thread Maamoun TK
On Wed, May 4, 2022 at 7:47 PM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > I'm thinking of extending the structure layout of 'poly1305_ctx' to take
> in
> > pre-computed powers of key. How backward compatible would it be to append
> > additional key arrays to that structure?
>
> It would break ABI compatibility, and require a new library soname.
> Other than that, it would be API compatible, code using nettle could be
> recompiled with no source code changes.
>
> So definitely possible, but not something to do lightly. I think it's
> best to postpone until have reason to break the ABI for other reasons.
>
> >> We can either extend the layout or use _nettle_poly1305_4core starting
> >> with 64 blocks and more as this function begins to consume less cycles
> than
> >> average consumptions of the 2^64 version.
>
> Do you need to do 64 blocks (16 iterations of the
> _nettle_poly1305_4core loop) before it beats the single block radix-64
> version? Then precomputations must be rather costly? But due to the abi
> issues, I think we have to live with redoing them for each call.
>
> I haven't looked closely at your code yet. It seems important that
> _nettle_poly1305_4core can loop to process larger messages without
> redoing precomputations. Can we use parallelism in the precomputation? I
> imagine what needs to be done has the following steps:
>
> 1. Input is the key in radix 2^64, stored in r64[0] and r64[1] (the
>premultiplied values aren't helpful, I'm afraid).
>
> 2. Split into radix 2^26, five pieces, that goes into 32-bit
> (sub-)registers.
>Also premultiplied by 5 when useful. Call this K1.
>
> 3. Compute squared key, K2 = K1^2, and premultiplied by 5 when useful. In
>principle, squaring can be easier than multiply, 15 32x32 --> 64
>multiplies rather than 25, but may not apply here since we want to
>part of the reduction at the same time.
>
> 4. Compute K3 = K1 K2 and K4 = K2^2. If we don't try to do any special
>for the squaring, at least these two multiply operations could be
>done in a SIMD fashion.
>

I tried these steps to compute key powers. Squaring K2 = K1^2 does have
multiplications of 15 32x32 rather than 25 divided into 3 groups that
summed together. It also needs to pre-multiply key by 2 to apply perfect
square. Full reduction round has been applied on products as I couldn't
find easy way to partly reduce them then it computes K3 = K1 K2 and K4 = K2
K2 simultaneously in SIMD fashion.
I used XMM registers to reduce the transition bandwidth between stack and
move instructions. The result of mentioned improvements yield a sligh
speedup of pre-computations that has still been drawing more cycles than
2^64 version for less than 64 blocks as a whole process.

In case extending the layout of 'poly1305_ctx' structure is not an option,
I would suggest applying that threshold of message length in an
arch-specific manner. How do you think we can do that?

regards,
Mamone


> Also, I think the result after this precomputation probably doesn't have
> to be in canonical representation, with each digit in the range 0 <= x <
> 2^26, it can most likely be made to work with a somewhat larger range.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-07 Thread Niels Möller
Maamoun TK  writes:

> In case extending the layout of 'poly1305_ctx' structure is not an option,
> I would suggest applying that threshold of message length in an
> arch-specific manner. How do you think we can do that?

Based on how thresholds are handled in gmp, I'd suggest a

#define POLY1305_CORE4_THRESHOLD

which can be defined as an arch-dependent constant for non-fat builds,
and as an alias for a global variable in fat builds, the variable should
how some sane initial value, and be updated as part of the fat setup of
the function using that threshold. For thread safety, it would make some
sense with memory barriers to ensure that the threshold is updated
before the function pointer, but perhaps not strictly necessary. One
could consider setting different values depending on processor model,
but I think that's beyond an initial version.

It would be good to add some size argument to nettle-benchmark to
make it easier to choose right threshold. If we end up with more
thresholds like this, we could consider tuning them more automatically,
analogous to the gmp/tune/tuneup program. But for start, manual tuning
is good enough.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-09 Thread Maamoun TK
On Sat, May 7, 2022 at 10:10 AM Niels Möller  wrote:

> Maamoun TK  writes:
>
> > In case extending the layout of 'poly1305_ctx' structure is not an
> option,
> > I would suggest applying that threshold of message length in an
> > arch-specific manner. How do you think we can do that?
>
> Based on how thresholds are handled in gmp, I'd suggest a
>
> #define POLY1305_CORE4_THRESHOLD
>
> which can be defined as an arch-dependent constant for non-fat builds,
> and as an alias for a global variable in fat builds, the variable should
> how some sane initial value, and be updated as part of the fat setup of
> the function using that threshold. For thread safety, it would make some
> sense with memory barriers to ensure that the threshold is updated
> before the function pointer, but perhaps not strictly necessary. One
> could consider setting different values depending on processor model,
> but I think that's beyond an initial version.
>
> It would be good to add some size argument to nettle-benchmark to
> make it easier to choose right threshold. If we end up with more
> thresholds like this, we could consider tuning them more automatically,
> analogous to the gmp/tune/tuneup program. But for start, manual tuning
> is good enough.
>

I made Poly1305 core4 compatible with 2^64 version on x86_64 in addition to
settling a threshold of 64-block. On arm64 it seems a proper threshold for
Poly1305 core2 is 32-block so I settled that too. Both optimization cores
of PPC and s390x achieve relative performance speed for the minimum
supported message length (4-block) so there is no need for threshold in
this case.


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-14 Thread Niels Möller
Maamoun TK  writes:

>  I created merge requests that have improvements of Poly1305 for arm64,
> powerpc64, and s390x architectures by following using two-way interleaving.
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
> The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> and 382.65% speedup for s390x.

I've had a closer look at the ppc merge request #39.

I think it would be good to do the single block radix 2^44 version first
(I'm assuming that's in itself is an improvement over the C code, and
over using radix 2^64?). Is 44 bit pieces ideal (130 = 44+44+42), or
would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 +
56 + 18)?

For the 4-way code, the name and organization seems inspired by
chacha_4core, which is a bit different since it also has a four-block
output, and then the caller has to be aware. I think it would be better
to look at the recent ghash. Maybe one can have an internal
_poly1306_update, following similar conventions as _ghash_update? Then
the C code doesn't need to know how many blocks are done at a time,
which should make things a bit simpler (although the assembly code would
need logic to do left-over blocks, just like for ghash).

> OpenSSL is still ahead in terms of performance speed since it uses 4-way
> interleaving or maybe more!!
> Increasing the interleaving ways more than two has nothing to do with
> parallelism since the execution units are already saturated by using 2-ways
> for the three architectures. The reason behind the performance improvement
> is the number of execution times of reduction procedure is cutted by half
> for 4-way interleaving since the products of multiplying state parts by key
> can be combined before the reduction phase. Let me know if you are
> interested in doing that on nettle!

Good to know that 2-way is sufficient to saturate execution units. Going
to 4-way does have a startup cost for each call, since we don't have
space for extra pre-computed powers. But for large messages, we'll get
the best speed if we can make reduction as cheap as possible.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-05-28 Thread Maamoun TK
On Sat, May 14, 2022 at 8:07 PM Niels Möller  wrote:

> Maamoun TK  writes:
>
> >  I created merge requests that have improvements of Poly1305 for arm64,
> > powerpc64, and s390x architectures by following using two-way
> interleaving.
> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
> > The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> > and 382.65% speedup for s390x.
>
> I've had a closer look at the ppc merge request #39.
>
> I think it would be good to do the single block radix 2^44 version first
> (I'm assuming that's in itself is an improvement over the C code, and
> over using radix 2^64?).


I agree we should go with single block first. It seems radix 2^44 has a
drawback in terms of single block performance in comparison to C code which
is not the case for radix 2^64 that is superior in this matter. I got "657
Mbyte/s" update speed of 2^64 whereas C code (radix 2^26) produces "470
Mbyte/s" of update speed on POWER9. With that said, I've pushed a patch of
2^64 implementation for single block update with fat build support
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/47


> Is 44 bit pieces ideal (130 = 44+44+42), or
> would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 +
> 56 + 18)?
>

For multi-block processing, it seems to me 44 bit pieces are ideal. In case
B = 2^56 I'm trying to figure how to calculate B^3 R_1 5 = 2^38 R_1 5 which
doesn't fit in 64-bit since R_1 of degree 56. For 2^48, I don't see any
difference in equations when comparing with 2^44 for multiplication and
reduction phases.


> For the 4-way code, the name and organization seems inspired by
> chacha_4core, which is a bit different since it also has a four-block
> output, and then the caller has to be aware. I think it would be better
> to look at the recent ghash. Maybe one can have an internal
> _poly1306_update, following similar conventions as _ghash_update? Then
> the C code doesn't need to know how many blocks are done at a time,
> which should make things a bit simpler (although the assembly code would
> need logic to do left-over blocks, just like for ghash).
>

I agree, I've pushed a new MR
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 of
poly1306_update implementation for PowerPC based on radix 2^44 for
multi-block processing, and radix 2^64 to handle single-block update. The
threshold can fit within the assembly file which is in this case set to "12
blocks" since radix 2^64 implementation has relatively superior speed. I
like the structure so far, please take a look and let me know if it fits
well so I can implement it for other architectures.


> > OpenSSL is still ahead in terms of performance speed since it uses 4-way
> > interleaving or maybe more!!
> > Increasing the interleaving ways more than two has nothing to do with
> > parallelism since the execution units are already saturated by using
> 2-ways
> > for the three architectures. The reason behind the performance
> improvement
> > is the number of execution times of reduction procedure is cutted by half
> > for 4-way interleaving since the products of multiplying state parts by
> key
> > can be combined before the reduction phase. Let me know if you are
> > interested in doing that on nettle!
>
> Good to know that 2-way is sufficient to saturate execution units. Going
> to 4-way does have a startup cost for each call, since we don't have
> space for extra pre-computed powers. But for large messages, we'll get
> the best speed if we can make reduction as cheap as possible.
>

I understand 4-way has nothing to offer regarding 'vmsumudm' parallelism
for multiplication phase but as you've mentioned reduce the product per
4-blocks implies performance improvement besides increasing parallelism
level for side procedures like R64_TO_R44_4B macro which yield significant
enhancement (approximately double the performance) over 2-way
implementation on powerpc.

regards,
Mamone


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se


Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-10-13 Thread Maamoun TK
It seems Debian release cycle takes ~2 year for every new version recently
https://wiki.debian.org/DebianReleases so I pushed a MR that enables
testing power9-specific code
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/53 since it's too
early to have qemu v7+ on stable release.

Also, I think we're set to proceed with Poly1305 multi-block patch based on
radix 2^44 for PowerPC
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 to approve the
new layout of process multiple blocks.

regards,
Mamone

On Sun, May 29, 2022 at 4:17 AM Maamoun TK 
wrote:

> On Sat, May 14, 2022 at 8:07 PM Niels Möller  wrote:
>
>> Maamoun TK  writes:
>>
>> >  I created merge requests that have improvements of Poly1305 for arm64,
>> > powerpc64, and s390x architectures by following using two-way
>> interleaving.
>> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
>> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
>> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
>> > The patches have 41.88% speedup for arm64, 142.95% speedup for
>> powerpc64,
>> > and 382.65% speedup for s390x.
>>
>> I've had a closer look at the ppc merge request #39.
>>
>> I think it would be good to do the single block radix 2^44 version first
>> (I'm assuming that's in itself is an improvement over the C code, and
>> over using radix 2^64?).
>
>
> I agree we should go with single block first. It seems radix 2^44 has a
> drawback in terms of single block performance in comparison to C code which
> is not the case for radix 2^64 that is superior in this matter. I got "657
> Mbyte/s" update speed of 2^64 whereas C code (radix 2^26) produces "470
> Mbyte/s" of update speed on POWER9. With that said, I've pushed a patch of
> 2^64 implementation for single block update with fat build support
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/47
>
>
>> Is 44 bit pieces ideal (130 = 44+44+42), or
>> would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 +
>> 56 + 18)?
>>
>
> For multi-block processing, it seems to me 44 bit pieces are ideal. In
> case B = 2^56 I'm trying to figure how to calculate B^3 R_1 5 = 2^38 R_1 5
> which doesn't fit in 64-bit since R_1 of degree 56. For 2^48, I don't see
> any difference in equations when comparing with 2^44 for multiplication and
> reduction phases.
>
>
>> For the 4-way code, the name and organization seems inspired by
>> chacha_4core, which is a bit different since it also has a four-block
>> output, and then the caller has to be aware. I think it would be better
>> to look at the recent ghash. Maybe one can have an internal
>> _poly1306_update, following similar conventions as _ghash_update? Then
>> the C code doesn't need to know how many blocks are done at a time,
>> which should make things a bit simpler (although the assembly code would
>> need logic to do left-over blocks, just like for ghash).
>>
>
> I agree, I've pushed a new MR
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 of
> poly1306_update implementation for PowerPC based on radix 2^44 for
> multi-block processing, and radix 2^64 to handle single-block update. The
> threshold can fit within the assembly file which is in this case set to "12
> blocks" since radix 2^64 implementation has relatively superior speed. I
> like the structure so far, please take a look and let me know if it fits
> well so I can implement it for other architectures.
>
>
>> > OpenSSL is still ahead in terms of performance speed since it uses 4-way
>> > interleaving or maybe more!!
>> > Increasing the interleaving ways more than two has nothing to do with
>> > parallelism since the execution units are already saturated by using
>> 2-ways
>> > for the three architectures. The reason behind the performance
>> improvement
>> > is the number of execution times of reduction procedure is cutted by
>> half
>> > for 4-way interleaving since the products of multiplying state parts by
>> key
>> > can be combined before the reduction phase. Let me know if you are
>> > interested in doing that on nettle!
>>
>> Good to know that 2-way is sufficient to saturate execution units. Going
>> to 4-way does have a startup cost for each call, since we don't have
>> space for extra pre-computed powers. But for large messages, we'll get
>> the best speed if we can make reduction as cheap as possible.
>>
>
> I understand 4-way has nothing to offer regarding 'vmsumudm' parallelism
> for multiplication phase but as you've mentioned reduce the product per
> 4-blocks implies performance improvement besides increasing parallelism
> level for side procedures like R64_TO_R44_4B macro which yield significant
> enhancement (approximately double the performance) over 2-way
> implementation on powerpc.
>
> regards,
> Mamone
>
>
>> Regards,
>> /Niels
>>
>> --
>> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
>> Internet email is subject to wholesale government surveillance.
>>
>
_

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

2022-10-13 Thread Niels Möller
Maamoun TK  writes:

> It seems Debian release cycle takes ~2 year for every new version recently
> https://wiki.debian.org/DebianReleases so I pushed a MR that enables
> testing power9-specific code
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/53 since it's too
> early to have qemu v7+ on stable release.

Thanks for doing this. Merged now.

> Also, I think we're set to proceed with Poly1305 multi-block patch based on
> radix 2^44 for PowerPC
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 to approve the
> new layout of process multiple blocks.

I think it would be good to define the internal _poly1305_update with an
interface similar to _ghash_update, in particular, always doing complete
blocks. So we don't need any logic for partial blocks in the assembly
files.

It would also be interesting to do a C-implementation for base 2^44 (for
64-bit architectures), to see how that compares. I know I advocated base
2^64 earlier, at least for single blocks, but it might be that base 2^44
(including key setup with matching layout, similar to the base 2^26 C
implementation) will beat base 2^64 almost everywhere. I might be able
to try something out in the weekend or next week.

Regards, 
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se