Maamoun TK <maamoun...@googlemail.com> writes: > I created merge requests that have improvements of Poly1305 for arm64, > powerpc64, and s390x architectures by following using two-way interleaving. > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38 > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41 > The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, > and 382.65% speedup for s390x.
I've had a closer look at the ppc merge request #39. I think it would be good to do the single block radix 2^44 version first (I'm assuming that's in itself is an improvement over the C code, and over using radix 2^64?). Is 44 bit pieces ideal (130 = 44+44+42), or would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 + 56 + 18)? For the 4-way code, the name and organization seems inspired by chacha_4core, which is a bit different since it also has a four-block output, and then the caller has to be aware. I think it would be better to look at the recent ghash. Maybe one can have an internal _poly1306_update, following similar conventions as _ghash_update? Then the C code doesn't need to know how many blocks are done at a time, which should make things a bit simpler (although the assembly code would need logic to do left-over blocks, just like for ghash). > OpenSSL is still ahead in terms of performance speed since it uses 4-way > interleaving or maybe more!! > Increasing the interleaving ways more than two has nothing to do with > parallelism since the execution units are already saturated by using 2-ways > for the three architectures. The reason behind the performance improvement > is the number of execution times of reduction procedure is cutted by half > for 4-way interleaving since the products of multiplying state parts by key > can be combined before the reduction phase. Let me know if you are > interested in doing that on nettle! Good to know that 2-way is sufficient to saturate execution units. Going to 4-way does have a startup cost for each call, since we don't have space for extra pre-computed powers. But for large messages, we'll get the best speed if we can make reduction as cheap as possible. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se