On Sat, Oct 29, 2022 at 11:31 AM Niels Möller <ni...@lysator.liu.se> wrote:
> Maamoun TK <maamoun...@googlemail.com> writes: > > > I will give multiblock radix-2^64 a try on ppc to examine the result. For > > now, I'm trying to apply your previous note on radix 44 for ppc to > improve > > the speed of reduction phase. > > I think I'd like to merge the multi-block refactoring branch > (refactor-poly1305) before your radix 2^44 code. But that breaks current > power assembly, since that branch currently requires that any assembly > code for poly1305 implements both functions. I see three options: > > 1. Implement multi-block radix 2^64 code for ppc. Might not be well > spent time if it's going to be much slower than new radix 2^44? > > 2. Implement multi-block radix 2^64 in ppc assembly, but just as a loop > around the single block function (so no speedup). > I apologize for late reply, I don't feel well today. I don't understand the difference between the two options. And do you prefer to have the code of 2^64 for multi-block over 2^44? I mentioned the benchmark numbers of both radixes in MR description and previous message. Single-block (2^64) achieves 658.45 Mbyte/s on POWER9 2.2 GHz while multi-block (2^64) with a loop around it in assembly achieves 1002.27 Mbyte/s and multi-block (2^44) in assembly hit 2044.05 Mbyte/s under same circumstances. It's clear to me that radix 2^44 performs the best for multi-block but not sure if there are other considerations for that. 3. Arrange so that multiblock function is optional, so that some > configurations can use assembly sinble-block function, and let the C > multi-block function loop around it. > > What would you suggest? > Since multiblock function has considerable speedup for all archs that have assembly implementation, I'd suggest making multi-block function the default path. I mean 30% for x86_64 is not bad, not to mention we will be able to get our hands on significant speed up once we implement multiblock function based on radix-44 using 'VPMADD52HUQ/VPMADD52LUQ' of full 52-bit multiplication but unfortunately my current processor doesn't support AVX512VL and AVX512_IFMA extensions. regards, Mamone > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. > _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se