Maamoun TK <maamoun...@googlemail.com> writes:

>  I created merge requests that have improvements of Poly1305 for arm64,
> powerpc64, and s390x architectures by following using two-way interleaving.
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
> The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> and 382.65% speedup for s390x.

I've had a closer look at the ppc merge request #39.

I think it would be good to do the single block radix 2^44 version first
(I'm assuming that's in itself is an improvement over the C code, and
over using radix 2^64?). Is 44 bit pieces ideal (130 = 44+44+42), or
would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 +
56 + 18)?

For the 4-way code, the name and organization seems inspired by
chacha_4core, which is a bit different since it also has a four-block
output, and then the caller has to be aware. I think it would be better
to look at the recent ghash. Maybe one can have an internal
_poly1306_update, following similar conventions as _ghash_update? Then
the C code doesn't need to know how many blocks are done at a time,
which should make things a bit simpler (although the assembly code would
need logic to do left-over blocks, just like for ghash).

> OpenSSL is still ahead in terms of performance speed since it uses 4-way
> interleaving or maybe more!!
> Increasing the interleaving ways more than two has nothing to do with
> parallelism since the execution units are already saturated by using 2-ways
> for the three architectures. The reason behind the performance improvement
> is the number of execution times of reduction procedure is cutted by half
> for 4-way interleaving since the products of multiplying state parts by key
> can be combined before the reduction phase. Let me know if you are
> interested in doing that on nettle!

Good to know that 2-way is sufficient to saturate execution units. Going
to 4-way does have a startup cost for each call, since we don't have
space for extra pre-computed powers. But for large messages, we'll get
the best speed if we can make reduction as cheap as possible.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to