I've added Poly1305 optimization based on radix 26 using AVX2 extension for x86_64 architecture with fat build support, the patch yields significant speedup compared to upstream. https://git.lysator.liu.se/nettle/nettle/-/merge_requests/46 I've also fixed the conflicts for PPC, S390x, and Arm64 patches of Poly1305. https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
regards, Mamone On Fri, Jan 28, 2022 at 8:59 AM Niels Möller <ni...@lysator.liu.se> wrote: > Maamoun TK <maamoun...@googlemail.com> writes: > > > Great! I believe this is the best we can get for processing one block. > > One may be able to squeeze out one or two cycles more using the mulx > extension, which should make it possible to eliminate some of the move > instructions (I don't think moves cost any execution unit resources, but > they do consume decoding resources). > > > I'm trying to implement two-way interleaving using AVX extension and > > the main instruction of interest here is 'vpmuludq' that does double > > multiply operation > > My manual seems a bit confused if it's called pmuludq or vpmuludq. But > you're thinking of the instruction that does two 32x32 --> 64 > multiplies? It will be interesting to see how that works out! It does > half the work compared to a 64 x 64 --> 128 multiply instruction, but > accumulation/folding may get more efficient by using vector registers. > (There seems to also be an avx variant doing four 32x32 --> 64 > multiplies, using 256-bit registers). > > > the main concern here is there's a shortage of XMM registers as > > there are 16 of them, I'm working on addressing this issue by using > memory > > operands of key values for 'vpmuludq' and hope the processor cache do his > > thing here. > > Reading cached values from memory is usally cheap. So probably fine as > long as values modified are kept in registers. > > > I'm expecting to complete the assembly implementation tomorrow. > > If my analysis of the single-block code is right, I'd expect it to be > rather important to trim number of instructions per block. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. > _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se