Maamoun TK <maamoun...@googlemail.com> writes:

> I apologize for the delays. I pushed a patch that implements 4-way block
> processing of poly1305 using AVX2 instructions based on radix 26.
>
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/58

Let me see if I understand the main idea.

In radix 26 (or rather, 2^26), 130x130 -> 260 bit multiplication is 25
26x26 -> 52 bit multiplies. An one vpmuludq instruction does 4 32x32
->64 multiplies.

Then for 4-way poly1305, we need four such operations (each product
being one message block times a power of the key), which would be
precisely 25 vpmuludq after data has been layed out appropriately in the
registers. Followed by accumulation and reduction.

It's not entirely obvious what data layout should be used; it seems
natural that each register should hold pieces from 4 separate messages
or four separate key powers, but preferably so that products can be
accumulated directly with vpaddq, without shifting them around. And I
think your code does just that, it's just a bit difficult for me to see
how?

I think it should also work fine (since there are plenty of extra bits)
to premultiply certain of key-power pieces by 5, to get most of the
reduction for free (just accumulate those products with the others), you
probably do that as well?

BTW, for other uses ov AVX2, I suspect it's fairly low-hanging fruit to
adapt chacha, salsa20 and serpent to do multiple blocks using 256-bit
ymm registers, since they're designed to fit well with SIMD processing.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to