I'm looking at the skein code I wrote a while ago, and I might merge it
pretty soon, just need a little cleanup. I'm doing skein256 and skein512
(other variants, in particular, skein512-256, mey be of interest).

For skein256, it works quite well with two-way unrolling (by which I
mean that each loop iteration performs 8 mixing rounds, adding in
subkeys twice). 

I did an x86_64 assembly version, but I'm not being able to beat C code
compiled by gcc, so I think I'll scrap that. Which isn't so surprising,
since skein uses only operations that the C compiler knows well, and I'm
not trying anything clever with scheduling or register allocation.

For skein512, subkeys are accessed with mod 9 indexing, which is
challenging to do with high performance, if indexes need to be
constructed at runtime. I get pretty good performance with full
unrolling (so indices are constant), and I'm afraid we have to do either
that, or copy subkeys into an area where they are repeated multiple
times.

As I think I wrote earlier, skein looks similar in spirit to salsa20 and
chacha, but unlike those, it's doesn't fit well with simd instructions.
To use simd instructions for skein, one would like to put put 2 64-bit
values in one xmm register, but one then needs a way to rotate the two
halves with different shift counts, which I haven't found any good way
to do. Also the odd number of subkeys (five for skein256, and nine for
skein512, with the last subkey being the xor of all the other keys and a
magic constant), usued in a rotationg fashion, doesn't fit well with
storing keys in simd registers.

I'm benchmarking on a intel broadwell cpu (marketing name "core
i3-5010U") running at 2.1 GHz.

         Algorithm        mode Mbyte/s cycles/byte cycles/block
          skein256      update  242.45        8.26       264.33
          skein512      update  350.44        5.71       365.75

For comparison, timing for sha1, sha2 and sha3:

         Algorithm        mode Mbyte/s cycles/byte cycles/block
              sha1      update  326.40        6.14       392.69
      openssl sha1      update  560.98        3.57       228.48
            sha256      update  156.07       12.83       821.24
            sha512      update  252.54        7.93      1015.10
          sha3_224      update  161.90       12.37      1781.33
          sha3_256      update  152.83       13.10      1782.18
          sha3_384      update  117.06       17.11      1779.22
          sha3_512      update   80.52       24.87      1790.77

So skein512 is faster than both sha2 and sha3 (and one can also see that
for sha1 we currently lose to openssl). skein256 is fastet than sha3,
but slightly slower than sha512. So maybe we shouldn't do skein256 at
all, but skein512-256 (skein can be used with arbitrary output size).

Code size for is 408 bytes for skein256, and 3992 bytes for skein512
(which is completely unrolled). Counting only the main block processing
function.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to