Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

Niels Möller Wed, 19 Jan 2022 12:06:55 -0800

Maamoun TK <maamoun...@googlemail.com> writes:

> The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
> and 382.65% speedup for s390x.
>
> OpenSSL is still ahead in terms of performance speed since it uses 4-way
> interleaving or maybe more!!
> Increasing the interleaving ways more than two has nothing to do with
> parallelism since the execution units are already saturated by using 2-ways
> for the three architectures. The reason behind the performance improvement
> is the number of execution times of reduction procedure is cutted by half
> for 4-way interleaving since the products of multiplying state parts by key
> can be combined before the reduction phase. Let me know if you are
> interested in doing that on nettle!


Interesting. I haven't paid much attention to the poly1305
implementation since it was added back in 2013. The C implementation
doesn't try to use wider multiplication than 32x32 --> 64, which is poor
for 64-bit platforms. Maybe we could use unsigned __int128 if we can
write a configure test to check if it is available and likely to be
efficient?

For most efficient interleaving, I take it one should precompute some
powers of the key, similar to how it's done in the recent gcm code?

> It would be nice if the arm64 patch will be tested on big-endian mode since
> I don't have access to any big-endian variant for testing.

Merged this one too on a branch for ci testing.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

Reply via email to