On 06/27/11 11:32 PM, Bill Sommerfeld wrote:
On 06/27/11 15:24, David Magda wrote:
Given the amount of transistors that are available nowadays I think
it'd be simpler to just create a series of SIMD instructions right
in/on general CPUs, and skip the whole co-processor angle.
see: http://en.wikipedia.org/wiki/AES_instruction_set

Present in many current Intel CPUs; also expected to be present in AMD's
"Bulldozer" based CPUs.

I recall seeing a blog comparing the existing Solaris hand-tuned AES assembler performance with the (then) new AES instruction version, where the Intel AES instructions only got you about a 30% performance increase. I've seen reports of better performance improvements, but usually by comparing with the performance on older processors which are going to be slower for additional reasons then just missing the AES instructions. Also, you could claim better performance improvement if you compared against a less efficient original implementation of AES. What this means is that a faster CPU may buy you more crypto performance than the AES instructions alone will do.

My understanding from reading the Intel AES instruction set (which I warn might not be completely correct) is that the AES encryption/decryption instruction is executed between 10 and 14 times (depending on key length) for each 128 bits (16 bytes) of data being encrypted/decrypted, so it's very much part of the regular instruction pipeline. The code will have to loop though this process multiple times to process a data block bigger than 16 bytes, i.e. a double nested loop, although I expect it's normally loop-unrolled a fair degree for optimisation purposes.

Conversely, the crypto units in the T-series processors are separate from the CPU, and do the encryption/decryption whilst the CPU is getting on with something else, and they do it much faster than it could be done on the CPU. Small blocks are normally a problem for crypto offload engines because the overhead of farming off the work to the engine and getting the result back often means that you can do the crypto on the CPU faster than the time it takes to get the crypto engine started and stopped. However, T-series crypto is particularly good at handling small blocks efficiently, such as around 1kbyte which you are likely to find in a network packet, as it is much closer coupled to the CPU than a PCI crypto card can be, and performance with small packets was key for the crypto networking support T-series was designed for. Of course, it handles crypto of large blocks just fine too.

--
Andrew
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to