Maamoun TK <maamoun...@googlemail.com> writes:

> On Sat, Oct 29, 2022 at 11:31 AM Niels Möller <ni...@lysator.liu.se> wrote:
>
>> I think I'd like to merge the multi-block refactoring branch
>> (refactor-poly1305) before your radix 2^44 code. But that breaks current
>> power assembly, since that branch currently requires that any assembly
>> code for poly1305 implements both functions. I see three options:
>>
>> 1. Implement multi-block radix 2^64 code for ppc. Might not be well
>>    spent time if it's going to be much slower than new radix 2^44?
>>
>> 2. Implement multi-block radix 2^64 in ppc assembly, but just as a loop
>>    around the single block function (so no speedup).
>>
>
> I apologize for late reply, I don't feel well today.

I hope you feel better soon.

> I don't understand the difference between the two options.

I was thinking abotu (1) proper implementation, keeping state in
registers through out the loop, and (2) something as simple as possible,
with a loop just making an explicit call to the single-block function.

But I'm now leaning towards option 3: Somehow reorganize things so that
asm implementation of multi-block function is optional (could be a
separate file, or changes to how the HAVE_NATIVE_* constants are defined
(currently, they're only set for optional asm fils, not for asm files
replacing C files). I think that will also make it more straight forward
to measure performance benefit of implementing the multi-block function
when adding poly1305-assembly for other archs later.

> And do you prefer to have the code of 2^64 for multi-block over 2^44?

2^44 looks very promising, so I don't think we should spend much more
time on 2^64 for ppc. 

But I still want to find a way to merge the refactoring branch without
breaking the ppc build (in the current state, the branch fails with link
errors on ppc).

> I mentioned the benchmark numbers of both radixes in MR description and
> previous message. Single-block (2^64) achieves 658.45 Mbyte/s on POWER9 2.2
> GHz while multi-block (2^64) with a loop around it in assembly achieves
> 1002.27 Mbyte/s and multi-block (2^44) in assembly hit 2044.05 Mbyte/s
> under same circumstances. It's clear to me that radix 2^44 performs the
> best for multi-block but not sure if there are other considerations for
> that.

Sounds great! Did 2^44 beat 2^64 also in the single block case? Do we
need to consider keeping 2^64 for small-message performance? It will of
course make things simpler if we can switch to 2^44 for everything (on
ppc).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to