Joachim Strömbergson <[email protected]> writes:

> The big difference is that you update the variables in a QR twice during
> the QR processing, but the QR is more regular and can easily (easier) be
> scheduled with fewer register active in a given cycle.

Sounds like I have to look closer at the chacha spec to understand the
details.

> This is why I got a bit curious when you Niels stated: "And the
> particular change from 12 to 14 might add significant complexity
> to an optimized implementations with 4-way unrolling"

There was no deep thought behind that comment. It's just that if an
assembly loop is unrolled 4 times, it simplifies the code if you can
assume that that the number of rounds you need is always divisible by 4.

Now, current salsa20 implementation don't do that, _salsa20_core seems
to support any even and non-zero number, for both C, x86_64 and arm
neon. And there's no obvious gain in doing more unrolling. Could
possibly make more sense for chacha, if each round is shorter in terms
of number of instructions, cycles, and dependencies.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to