On Sat, Oct 29, 2022 at 11:31 AM Niels Möller <ni...@lysator.liu.se> wrote:

> Maamoun TK <maamoun...@googlemail.com> writes:
>
> > I will give multiblock radix-2^64 a try on ppc to examine the result. For
> > now, I'm trying to apply your previous note on radix 44 for ppc to
> improve
> > the speed of reduction phase.
>
> I think I'd like to merge the multi-block refactoring branch
> (refactor-poly1305) before your radix 2^44 code. But that breaks current
> power assembly, since that branch currently requires that any assembly
> code for poly1305 implements both functions. I see three options:
>
> 1. Implement multi-block radix 2^64 code for ppc. Might not be well
>    spent time if it's going to be much slower than new radix 2^44?
>
> 2. Implement multi-block radix 2^64 in ppc assembly, but just as a loop
>    around the single block function (so no speedup).
>

I apologize for late reply, I don't feel well today.
I don't understand the difference between the two options. And do you
prefer to have the code of 2^64 for multi-block over 2^44?
I mentioned the benchmark numbers of both radixes in MR description and
previous message. Single-block (2^64) achieves 658.45 Mbyte/s on POWER9 2.2
GHz while multi-block (2^64) with a loop around it in assembly achieves
1002.27 Mbyte/s and multi-block (2^44) in assembly hit 2044.05 Mbyte/s
under same circumstances. It's clear to me that radix 2^44 performs the
best for multi-block but not sure if there are other considerations for
that.

3. Arrange so that multiblock function is optional, so that some
>    configurations can use assembly sinble-block function, and let the C
>    multi-block function loop around it.
>
> What would you suggest?
>

Since multiblock function has considerable speedup for all archs that have
assembly implementation, I'd suggest making multi-block function the
default path. I mean 30% for x86_64 is not bad, not to mention we will be
able to get our hands on significant speed up once we implement multiblock
function based on radix-44 using 'VPMADD52HUQ/VPMADD52LUQ' of full 52-bit
multiplication but unfortunately my current processor doesn't support
AVX512VL and AVX512_IFMA extensions.

regards,
Mamone


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to