-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aloha!

Niels Möller wrote:
> Benchmarking nettle's implementation on my office machine (core i5),
> 
> algorithm     cycles/byte salsa20             5.3 aes128              11 
> aes128               22 (openssl) 
> arcfour               7.5 arcfour             3.75 (openssl)

Side issue: Pretty big difference in performance also for arcfour.


> Anyway, getting back to chacha, it will be interesting to see how
> much faster chacha is than salsa20.

DJB and some other benchmarks shows anything from zero to 30% better
performance. The chacha paper states some ideas about the difference in
parallelability.


> If I remember the chacha changes correctly, one gets rid of a 
> permutation of the matrix, and I think some of the rotations in the 
> round function (done as movaps, pslld, psrld, pxor) can be replaced
> by a pshufd. I think that can reduce the instruction count for the
> round function by 25-50%, depending on how many of the rotations can
> be replaced (there ought to be at least one rotation left with a
> rotation count which isn't a multiple of 8).

The big difference is that you update the variables in a QR twice during
the QR processing, but the QR is more regular and can easily (easier) be
scheduled with fewer register active in a given cycle.

The DR processing is more regular to allow easier parallelism. The tight
spot is between QR3 and QR4 where x15 is used in both. Otherwise it is
really the 4 separate QRs in each half of the DR that provides parallelism.

This is why I got a bit curious when you Niels stated: "And the
particular change from 12 to 14 might add significant complexity
to an optimized implementations with 4-way unrolling"

If we constrain ourselves to an even number of rounds I have a bit of a
problem to see how that would add significant complexity since we still
will be doing the DR processing the same way. I guess I'm missing
something, but I have spent some time doodling and thinking on the
dependency constraints in ChaCha since I've done a HW implementation:

https://github.com/secworks/swchacha

The current implementation does only contain a single QR, but will be
extended with support for 2 and 4 parallel QRs. There is a good paper
[0] on HW implementation of Salsa20 and ChaCha that shows depencency
within the QR. Looking at the clock frequency achieved one can clearly
see when the dependency between QR3 and QR4 happens.

Oh, and in that paper Salsa20 is actually neck and neck with or slightly
faster than ChaCha. ;-)


[0] L. Henzen, F. Carbognani, N. Felber, and W. Fichtner. VLSI Hardware
Evaluation of the Stream Ciphers Salsa20 and ChaCha, and the Compression
Function Rumba.

- -- 
Med vänlig hälsning, Yours

Joachim Strömbergson - Alltid i harmonisk svängning.
========================================================================
 Joachim Strömbergson          Secworks AB          [email protected]
========================================================================
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlKqwQ0ACgkQZoPr8HT30QFC6ACfcp5RTbFmIPxgFBfuwQ9VlOvq
PKoAoJUE3pM/O/es3OWxR8J3pHheLhQt
=L3hD
-----END PGP SIGNATURE-----
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to