Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

Maamoun TK Fri, 06 May 2022 09:11:11 -0700

On Wed, May 4, 2022 at 7:47 PM Niels Möller <ni...@lysator.liu.se> wrote:


> Maamoun TK <maamoun...@googlemail.com> writes:
>
> > I'm thinking of extending the structure layout of 'poly1305_ctx' to take
> in
> > pre-computed powers of key. How backward compatible would it be to append
> > additional key arrays to that structure?
>
> It would break ABI compatibility, and require a new library soname.
> Other than that, it would be API compatible, code using nettle could be
> recompiled with no source code changes.
>
> So definitely possible, but not something to do lightly. I think it's
> best to postpone until have reason to break the ABI for other reasons.
>
> >> We can either extend the layout or use _nettle_poly1305_4core starting
> >> with 64 blocks and more as this function begins to consume less cycles
> than
> >> average consumptions of the 2^64 version.
>
> Do you need to do 64 blocks (16 iterations of the
> _nettle_poly1305_4core loop) before it beats the single block radix-64
> version? Then precomputations must be rather costly? But due to the abi
> issues, I think we have to live with redoing them for each call.
>
> I haven't looked closely at your code yet. It seems important that
> _nettle_poly1305_4core can loop to process larger messages without
> redoing precomputations. Can we use parallelism in the precomputation? I
> imagine what needs to be done has the following steps:
>
> 1. Input is the key in radix 2^64, stored in r64[0] and r64[1] (the
>    premultiplied values aren't helpful, I'm afraid).
>
> 2. Split into radix 2^26, five pieces, that goes into 32-bit
> (sub-)registers.
>    Also premultiplied by 5 when useful. Call this K1.
>
> 3. Compute squared key, K2 = K1^2, and premultiplied by 5 when useful. In
>    principle, squaring can be easier than multiply, 15 32x32 --> 64
>    multiplies rather than 25, but may not apply here since we want to
>    part of the reduction at the same time.
>
> 4. Compute K3 = K1 K2 and K4 = K2^2. If we don't try to do any special
>    for the squaring, at least these two multiply operations could be
>    done in a SIMD fashion.
>

I tried these steps to compute key powers. Squaring K2 = K1^2 does have
multiplications of 15 32x32 rather than 25 divided into 3 groups that
summed together. It also needs to pre-multiply key by 2 to apply perfect
square. Full reduction round has been applied on products as I couldn't
find easy way to partly reduce them then it computes K3 = K1 K2 and K4 = K2
K2 simultaneously in SIMD fashion.
I used XMM registers to reduce the transition bandwidth between stack and
move instructions. The result of mentioned improvements yield a sligh
speedup of pre-computations that has still been drawing more cycles than
2^64 version for less than 64 blocks as a whole process.

In case extending the layout of 'poly1305_ctx' structure is not an option,
I would suggest applying that threshold of message length in an
arch-specific manner. How do you think we can do that?

regards,
Mamone


> Also, I think the result after this precomputation probably doesn't have
> to be in canonical representation, with each digit in the range 0 <= x <
> 2^26, it can most likely be made to work with a somewhat larger range.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

Reply via email to