"Christopher M. Riedl" <c...@linux.ibm.com> writes:

> An implementation combining AES+GCM _can potentially_ yield significant
> performance boosts by allowing for increased instruction parallelism, avoiding
> C-function call overhead, more flexibility in assembly fine-tuning, etc. This
> series provides such an implementation based on the existing optimized Nettle
> routines for POWER9 and later processors. Benchmark results on a POWER9
> Blackbird running at 3.5GHz are given at the end of this mail.

Benchmark results are impressive. If I get the numbers right, cycles per
block (16 bytes) is reduced from 40 to 22.5. You can run
nettle-benchmark with the flag -f 3.5e9 (for 3.5GHz clock frequency) to
get cycle numbers in the output.

I'm a bit conservative about about adding assembly code for combined
operations, since it can lead to an explosion in the amount of code to
maintain. So I'd like to understand a bit better where the 17.5 saved
cycles were spent. For the code on master, gcm_encrypt (with aes) is built from
these building blocks:

  * gcm_fill

    C code, essentially 2 64-bit stores per block. On little endian, it
    also needs some byte swapping.

  * aes_encrypt

    Using power assembly. Performance measured as the "aes128  ECB
    encrypt" line in nettle-benchmark output.

  * memxor3

    This is C code on power (and rather hairy C code). Performance can
    be measured with nettle-benchmark, and it's going to be a bit
    alignment dependent.

  * gcm_hash

    This uses power assembly. Performance is measured as the "gcm
    update" line in nettle-benchmark output. From your numbers, this
    seems to be 7.3 cycles per block.

So before going all the way with a combined aes_gcm function, I think
it's good to try to optimize the building blocks. Please benchmark
memxor3, to see if it could benefit from assembly implementation. If so,
that should give a nice speedup to several modes, not just gcm. (If you
implement memxor3, beware that it needs to support some overlap, to not
break in-place CBC decrypt).

Another potential overhead is that data is stored to memory when passed
between these functions. It seems we store a block 3 times, and loads a
block 4 times (the additional accesses should be cache friendly, but
wills till cost some execution resources). Optimizing that seems to need
some kind of combined function. But maybe it is sufficient to optimize
something a bit more general than aes gcm, e.g., aes ctr?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to