On Mon, Oct 24, 2022 at 9:44 PM Niels Möller <ni...@lysator.liu.se> wrote:

> Maamoun TK <maamoun...@googlemail.com> writes:
>
> > I think the design could be as simple as always padding each block with
> > 0x01 in _nettle_poly1305_update while keeping _nettle_poly1305_block that
> > is responsible for processing last block takes variable padding values (0
> > or 1). I committed an update in
> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 that
> applies
> > that design.
>
> I've tried out this refactoring on its own branch. There's a new
> _nettle_poly1305_update, in C only, which deals with partial blocks and
> is called from both poly1305-aes and chacha-poly1305.
>
> It calls a new function _nettle_poly1305_blocks, with the interface we
> have been discussing. And I've implemented the new function for x86_64.
> Conclusions,
>
> 1. There's some code duplication between _block and _blocks, which seems
>    hard to avoid (but *maybe* some m4 macrology for shared logic could
>    be a good idea).
>

I agree that some m4 macrology would eliminate code duplication here.
Having one file for _nettle_poly1305_block and _nettle_poly1305_blocks as
in the new branch will make it easy to apply.


> 2. When benchmarking on my laptop, it's 70% (!) faster. I had expected
>    only a minor improvement, and I'm not yet convinced it's too good to
>    be true, and tests need improvement. The numbers I get is a speed
>    increase from 3 GB/s to 5 GB/s for the poly1305 update function, or
>    44 cycles/block reduced to 25.
>

I did the benchmark on my laptop too. I got a speed of 3964.37 GB/s on
upstream and 5054.32 GB/s benchmarking poly1305 update on the new branch. I
wonder if the result numbers are truncated on your end because that would
keep the improvement on context with my test (25%).


>    If this improvement is real, my best explanation is that avoiding
>    load and store of the state between iterations makes out-of-order
>    execution across iterations work a *lot* better, e.g., letting the next
>    iteration's multiplies involving H0 and H1 start in parallel with the
>    final imul that H2 depends on.
>

I can relate. x86_64 arch does a good job at iteration prediction to
maximize parallelism when possible in loops.


> See https://git.lysator.liu.se/nettle/nettle/-/commits/refactor-poly1305.
>
> At the moment, the new _blocks method is not optional. It could be made
> optional with a bit configure hacking, but given the promising (and
> surprising, at least to me) results on x86_64, I think it would be good
> to try out adding it for ppc as well to see if it brings a small or
> large improvement. Do you already have multiblock radix-2^64 code in
> your merge request, or only the new radix 2^44 variant?
>

I will give multiblock radix-2^64 a try on ppc to examine the result. For
now, I'm trying to apply your previous note on radix 44 for ppc to improve
the speed of reduction phase.
The most interesting part is here

> To get the highest speed of this reduction, I think one should keep two
ideas in mind:

> 1. It's not necessary to have a unique and fully reduced representation,
>    one can allow hh0, hh1 and hh2 to be a bit or two larger than 44 bits.

ppc only offers instruction to shift 128-bit value by octet making
something like the following sketch is more ideal

  l0 = p0 & 0x3ffffffffff; h0 = (p0 >> 40) ...

I've thought of radix 2^48 but the fact that r^2_1' = r^2_1 * 5 << 14
(where r^2_1 is of degree 48) needs 65-bit to fit in make it a little
harder to apply.

regards,
Mamone


> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to