On Wed, May 4, 2022 at 7:47 PM Niels Möller <ni...@lysator.liu.se> wrote:
> Maamoun TK <maamoun...@googlemail.com> writes: > > > I'm thinking of extending the structure layout of 'poly1305_ctx' to take > in > > pre-computed powers of key. How backward compatible would it be to append > > additional key arrays to that structure? > > It would break ABI compatibility, and require a new library soname. > Other than that, it would be API compatible, code using nettle could be > recompiled with no source code changes. > > So definitely possible, but not something to do lightly. I think it's > best to postpone until have reason to break the ABI for other reasons. > > >> We can either extend the layout or use _nettle_poly1305_4core starting > >> with 64 blocks and more as this function begins to consume less cycles > than > >> average consumptions of the 2^64 version. > > Do you need to do 64 blocks (16 iterations of the > _nettle_poly1305_4core loop) before it beats the single block radix-64 > version? Then precomputations must be rather costly? But due to the abi > issues, I think we have to live with redoing them for each call. > > I haven't looked closely at your code yet. It seems important that > _nettle_poly1305_4core can loop to process larger messages without > redoing precomputations. Can we use parallelism in the precomputation? I > imagine what needs to be done has the following steps: > > 1. Input is the key in radix 2^64, stored in r64[0] and r64[1] (the > premultiplied values aren't helpful, I'm afraid). > > 2. Split into radix 2^26, five pieces, that goes into 32-bit > (sub-)registers. > Also premultiplied by 5 when useful. Call this K1. > > 3. Compute squared key, K2 = K1^2, and premultiplied by 5 when useful. In > principle, squaring can be easier than multiply, 15 32x32 --> 64 > multiplies rather than 25, but may not apply here since we want to > part of the reduction at the same time. > > 4. Compute K3 = K1 K2 and K4 = K2^2. If we don't try to do any special > for the squaring, at least these two multiply operations could be > done in a SIMD fashion. > I tried these steps to compute key powers. Squaring K2 = K1^2 does have multiplications of 15 32x32 rather than 25 divided into 3 groups that summed together. It also needs to pre-multiply key by 2 to apply perfect square. Full reduction round has been applied on products as I couldn't find easy way to partly reduce them then it computes K3 = K1 K2 and K4 = K2 K2 simultaneously in SIMD fashion. I used XMM registers to reduce the transition bandwidth between stack and move instructions. The result of mentioned improvements yield a sligh speedup of pre-computations that has still been drawing more cycles than 2^64 version for less than 64 blocks as a whole process. In case extending the layout of 'poly1305_ctx' structure is not an option, I would suggest applying that threshold of message length in an arch-specific manner. How do you think we can do that? regards, Mamone > Also, I think the result after this precomputation probably doesn't have > to be in canonical representation, with each digit in the range 0 <= x < > 2^26, it can most likely be made to work with a somewhat larger range. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. > _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se