Re: [PATCH 0/2] Sparc AES crypto opcode support.
This builds on top of the 7 patch series I sent the other day which laid the foundation for sparc crypto opcode support. The first patch plugs in optimized versions of key expansion and AES_{decrypt,encrypt}() The second patch is modelled on the AESNI support and explicitly optimizes ECB, CBC, CTR, OFB, and CFB modes. I'll do the remaining modes soon. I feel that we need to take a step back and reiterate. Preprocessor isn't mighty enough on Solaris and we have to come up with alternative solution. I can (and willing to:-) make a suggestion, but maybe not as soon as you [or somebody else] might anticipate... Meanwhile some side notes. What is rationale behind choosing interleave factor of two for parallelizable modes? Judging from aes-128 cbc encrypt benchmarks AES round instruction latency is 4. If processor can pair together two half-round instructions (I refer to fact that it takes two instructions to perform single round), then optimal interleave factor should be 4. Do you have performance metrics, specifically throughput, for instructions in question? Did you attempt higher interleave factor? Is ECB a must have? Are there critical applications? I mean it's probably lesser point to implement as much modes as possible... Speaking of modes. I haven't examined the DES and Camellia submission, but I'm going to push for sharing mode code between all submissions... __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/2] Sparc AES crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Fri, 28 Sep 2012 17:15:34 +0200 What is rationale behind choosing interleave factor of two for parallelizable modes? Judging from aes-128 cbc encrypt benchmarks AES round instruction latency is 4. If processor can pair together two half-round instructions (I refer to fact that it takes two instructions to perform single round), then optimal interleave factor should be 4. Do you have performance metrics, specifically throughput, for instructions in question? Did you attempt higher interleave factor? The AES round instruction latency is 3 cycles. We don't have enough registers to unroll it by another factor, unless we flush through the float registers reloading the KEY as we already have to do for the 256-bit case with the 2 way unroll factor. Is ECB a must have? Are there critical applications? I mean it's probably lesser point to implement as much modes as possible... It cost very little to support it, and I intend to support all modes explicitly with AES for completeness. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/2] Sparc AES crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Fri, 28 Sep 2012 17:15:34 +0200 Preprocessor isn't mighty enough on Solaris and we have to come up with alternative solution. Are you really sure Solaris's CPP can't do proper pasting? Perhaps there is a c99 mode or similar option that isn't being passed in CFLAGS that could be added? __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/2] Sparc AES crypto opcode support.
What is rationale behind choosing interleave factor of two for parallelizable modes? Judging from aes-128 cbc encrypt benchmarks AES round instruction latency is 4. If processor can pair together two half-round instructions (I refer to fact that it takes two instructions to perform single round), then optimal interleave factor should be 4. Do you have performance metrics, specifically throughput, for instructions in question? Did you attempt higher interleave factor? The AES round instruction latency is 3 cycles. As mentioned, the result looks more like 4, so it's either 4, or something holds it back (there might be room for improvement then), or I estimated it wrong. But question was if processor is capable of scheduling two independent ones at same time. If it is, then higher interleave is more appropriate and would still outweight losses from spilling key material and I reckon difference wouldn't be nominal. What would be absolutely best is to know how it would look in next generation, so that one can pick future-safe factor. I mean higher than optimal interleave factor doesn't have as much negative effect as lower than optimal one. We don't have enough registers to unroll it by another factor, aes01 %key0,%reg0,%reg1,%reg2 aes23 %key1,%reg0,%reg1,%reg1 1, not 3 aes01 %key2,%reg2,%reg1,%reg0 aes23 %key4,%reg2,%reg1,%reg1 allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I get it wrong? Or would 3-register arrangement like above not work? __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/2] Sparc AES crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Fri, 28 Sep 2012 21:00:18 +0200 We don't have enough registers to unroll it by another factor, aes01 %key0,%reg0,%reg1,%reg2 aes23 %key1,%reg0,%reg1,%reg1 1, not 3 aes01 %key2,%reg2,%reg1,%reg0 aes23 %key4,%reg2,%reg1,%reg1 allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I get it wrong? Or would 3-register arrangement like above not work? Thanks for your suggestions I'll look into this. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [PATCH 0/2] Sparc AES crypto opcode support.
From: Andy Polyakov ap...@openssl.org Date: Fri, 28 Sep 2012 21:00:18 +0200 aes01 %key0,%reg0,%reg1,%reg2 aes23 %key1,%reg0,%reg1,%reg1 1, not 3 aes01 %key2,%reg2,%reg1,%reg0 aes23 %key4,%reg2,%reg1,%reg1 allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I get it wrong? Or would 3-register arrangement like above not work? These instructions have a 3 cycle latency, for example: aes_eround01 %f8, %f0, %f2, %f4 aes_eround23 %f10, %f0, %f2, %f6 [stall] [stall] aes_eround01 %f12, %f4, %f6, %f0 aes_eround23 %f14, %f4, %f6, %f2 Whereas, of course: aes_eround01 %f8, %f0, %f2, %f4 aes_eround23 %f10, %f0, %f2, %f6 aes_eround01 %f8, %f56, %f58, %f60 aes_eround23 %f10, %f56, %f58, %f62 executes without any stall cycles. As does: aes_eround01 %f8, %f0, %f2, %f4 aes_eround23 %f10, %f0, %f2, %f6 aes_eround01 %f8, %f56, %f58, %f60 aes_eround23 %f10, %f56, %f58, %f62 aes_eround01 %f12, %f4, %f6, %f0 aes_eround23 %f14, %f4, %f6, %f2 aes_eround01 %f12, %f60, %f62, %f56 aes_eround23 %f14, %f60, %f62, %f58 which is why unrolling by a factor of 2 is optimal, at least from a scheduling viewpoint. The other issue is that fxor is really expensive. Unlike the crypto opcodes, they execute in the remote FPU so have a minimum 12 cycle latency, although they do pipeline. I tried many experiments using integer xor and movxtod instead of fxor but it turned out to be a wash. Another issue is that loads have a 4 cycle latency, but there really isn't a whole lot of room to move those around in the loops. As per the register usage it depends upon if we use float registers for the IV. Since movxtod/movdtox is relatively cheap (1 cycle and pairs with other instructions) if we really needed to gain some float register space back we could always use an integer register for the IV, some of my code already does this. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
[PATCH 0/2] Sparc AES crypto opcode support.
This builds on top of the 7 patch series I sent the other day which laid the foundation for sparc crypto opcode support. The first patch plugs in optimized versions of key expansion and AES_{decrypt,encrypt}() The second patch is modelled on the AESNI support and explicitly optimizes ECB, CBC, CTR, OFB, and CFB modes. I'll do the remaining modes soon. I've put this through a battery of tests, and in particular I hacked up a local copy of test/test_aesni (which doesn't seem to get run even on x86?) that uses the appropriate sparc environment variable to turn off crypto opcode usage. That script helped a lot during validation. The 35GB/sec benchmark result in the second patch is not a typo :-) Signed-off-by: David S. Miller da...@davemloft.net __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org