I've added Poly1305 optimization based on radix 26 using AVX2 extension for
x86_64 architecture with fat build support, the patch yields significant
speedup compared to upstream.
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/46
I've also fixed the conflicts for PPC, S390x, and Arm64 patches of Poly1305.
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41

regards,
Mamone


On Fri, Jan 28, 2022 at 8:59 AM Niels Möller <ni...@lysator.liu.se> wrote:

> Maamoun TK <maamoun...@googlemail.com> writes:
>
> > Great! I believe this is the best we can get for processing one block.
>
> One may be able to squeeze out one or two cycles more using the mulx
> extension, which should make it possible to eliminate some of the move
> instructions (I don't think moves cost any execution unit resources, but
> they do consume decoding resources).
>
> > I'm trying to implement two-way interleaving using AVX extension and
> > the main instruction of interest here is 'vpmuludq' that does double
> > multiply operation
>
> My manual seems a bit confused if it's called pmuludq or vpmuludq. But
> you're thinking of the instruction that does two 32x32 --> 64
> multiplies? It will be interesting to see how that works out! It does
> half the work compared to a 64 x 64 --> 128 multiply instruction, but
> accumulation/folding may get more efficient by using vector registers.
> (There seems to also be an avx variant doing four 32x32 --> 64
> multiplies, using 256-bit registers).
>
> > the main concern here is there's a shortage of XMM registers as
> > there are 16 of them, I'm working on addressing this issue by using
> memory
> > operands of key values for 'vpmuludq' and hope the processor cache do his
> > thing here.
>
> Reading cached values from memory is usally cheap. So probably fine as
> long as values modified are kept in registers.
>
> > I'm expecting to complete the assembly implementation tomorrow.
>
> If my analysis of the single-block code is right, I'd expect it to be
> rather important to trim number of instructions per block.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to