Re: [PATCH 0/2] Sparc AES crypto opcode support.

2012-09-28 Thread Andy Polyakov
 This builds on top of the 7 patch series I sent the other day which
 laid the foundation for sparc crypto opcode support.
 
 The first patch plugs in optimized versions of key expansion and
 AES_{decrypt,encrypt}()
 
 The second patch is modelled on the AESNI support and explicitly
 optimizes ECB, CBC, CTR, OFB, and CFB modes.  I'll do the remaining
 modes soon.

I feel that we need to take a step back and reiterate. Preprocessor
isn't mighty enough on Solaris and we have to come up with alternative
solution. I can (and willing to:-) make a suggestion, but maybe not as
soon as you [or somebody else] might anticipate...

Meanwhile some side notes.

What is rationale behind choosing interleave factor of two for
parallelizable modes? Judging from aes-128 cbc encrypt benchmarks AES
round instruction latency is 4. If processor can pair together two
half-round instructions (I refer to fact that it takes two instructions
to perform single round), then optimal interleave factor should be 4. Do
you have performance metrics, specifically throughput, for instructions
in question? Did you attempt higher interleave factor?

Is ECB a must have? Are there critical applications? I mean it's
probably lesser point to implement as much modes as possible...

Speaking of modes. I haven't examined the DES and Camellia submission,
but I'm going to push for sharing mode code between all submissions...

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/2] Sparc AES crypto opcode support.

2012-09-28 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Fri, 28 Sep 2012 17:15:34 +0200

 What is rationale behind choosing interleave factor of two for
 parallelizable modes? Judging from aes-128 cbc encrypt benchmarks AES
 round instruction latency is 4. If processor can pair together two
 half-round instructions (I refer to fact that it takes two instructions
 to perform single round), then optimal interleave factor should be 4. Do
 you have performance metrics, specifically throughput, for instructions
 in question? Did you attempt higher interleave factor?

The AES round instruction latency is 3 cycles.

We don't have enough registers to unroll it by another factor, unless
we flush through the float registers reloading the KEY as we already
have to do for the 256-bit case with the 2 way unroll factor.

 Is ECB a must have? Are there critical applications? I mean it's
 probably lesser point to implement as much modes as possible...

It cost very little to support it, and I intend to support all modes
explicitly with AES for completeness.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/2] Sparc AES crypto opcode support.

2012-09-28 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Fri, 28 Sep 2012 17:15:34 +0200

 Preprocessor isn't mighty enough on Solaris and we have to come up
 with alternative solution.

Are you really sure Solaris's CPP can't do proper pasting?

Perhaps there is a c99 mode or similar option that isn't being passed
in CFLAGS that could be added?
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/2] Sparc AES crypto opcode support.

2012-09-28 Thread Andy Polyakov

What is rationale behind choosing interleave factor of two for
parallelizable modes? Judging from aes-128 cbc encrypt benchmarks AES
round instruction latency is 4. If processor can pair together two
half-round instructions (I refer to fact that it takes two instructions
to perform single round), then optimal interleave factor should be 4. Do
you have performance metrics, specifically throughput, for instructions
in question? Did you attempt higher interleave factor?


The AES round instruction latency is 3 cycles.


As mentioned, the result looks more like 4, so it's either 4, or 
something holds it back (there might be room for improvement then), or I 
estimated it wrong. But question was if processor is capable of 
scheduling two independent ones at same time. If it is, then higher 
interleave is more appropriate and would still outweight losses from 
spilling key material and I reckon difference wouldn't be nominal. What 
would be absolutely best is to know how it would look in next 
generation, so that one can pick future-safe factor. I mean higher 
than optimal interleave factor doesn't have as much negative effect as 
lower than optimal one.



We don't have enough registers to unroll it by another factor,


aes01   %key0,%reg0,%reg1,%reg2
aes23   %key1,%reg0,%reg1,%reg1  1, not 3
aes01   %key2,%reg2,%reg1,%reg0
aes23   %key4,%reg2,%reg1,%reg1

allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I get 
it wrong? Or would 3-register arrangement like above not work?

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/2] Sparc AES crypto opcode support.

2012-09-28 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Fri, 28 Sep 2012 21:00:18 +0200

 We don't have enough registers to unroll it by another factor,
 
   aes01   %key0,%reg0,%reg1,%reg2
   aes23   %key1,%reg0,%reg1,%reg1  1, not 3
   aes01   %key2,%reg2,%reg1,%reg0
   aes23   %key4,%reg2,%reg1,%reg1
 
 allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I
 get it wrong? Or would 3-register arrangement like above not work?

Thanks for your suggestions I'll look into this.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: [PATCH 0/2] Sparc AES crypto opcode support.

2012-09-28 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Fri, 28 Sep 2012 21:00:18 +0200

   aes01   %key0,%reg0,%reg1,%reg2
   aes23   %key1,%reg0,%reg1,%reg1  1, not 3
   aes01   %key2,%reg2,%reg1,%reg0
   aes23   %key4,%reg2,%reg1,%reg1
 
 allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I
 get it wrong? Or would 3-register arrangement like above not work?

These instructions have a 3 cycle latency, for example:

aes_eround01  %f8,   %f0,  %f2,  %f4
aes_eround23  %f10,  %f0,  %f2,  %f6
[stall]
[stall]
aes_eround01  %f12,  %f4,  %f6,  %f0
aes_eround23  %f14,  %f4,  %f6,  %f2

Whereas, of course:

aes_eround01  %f8,   %f0,  %f2,  %f4
aes_eround23  %f10,  %f0,  %f2,  %f6
aes_eround01  %f8,  %f56, %f58, %f60
aes_eround23  %f10, %f56, %f58, %f62

executes without any stall cycles.  As does:

aes_eround01  %f8,   %f0,  %f2,  %f4
aes_eround23  %f10,  %f0,  %f2,  %f6
aes_eround01  %f8,  %f56, %f58, %f60
aes_eround23  %f10, %f56, %f58, %f62
aes_eround01  %f12,  %f4,  %f6,  %f0
aes_eround23  %f14,  %f4,  %f6,  %f2
aes_eround01  %f12, %f60, %f62,  %f56
aes_eround23  %f14, %f60, %f62,  %f58

which is why unrolling by a factor of 2 is optimal, at least
from a scheduling viewpoint.

The other issue is that fxor is really expensive.  Unlike the crypto
opcodes, they execute in the remote FPU so have a minimum 12 cycle
latency, although they do pipeline.  I tried many experiments using
integer xor and movxtod instead of fxor but it turned out to be a
wash.

Another issue is that loads have a 4 cycle latency, but there really
isn't a whole lot of room to move those around in the loops.

As per the register usage it depends upon if we use float registers
for the IV.  Since movxtod/movdtox is relatively cheap (1 cycle and
pairs with other instructions) if we really needed to gain some float
register space back we could always use an integer register for the
IV, some of my code already does this.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


[PATCH 0/2] Sparc AES crypto opcode support.

2012-09-21 Thread David Miller

This builds on top of the 7 patch series I sent the other day which
laid the foundation for sparc crypto opcode support.

The first patch plugs in optimized versions of key expansion and
AES_{decrypt,encrypt}()

The second patch is modelled on the AESNI support and explicitly
optimizes ECB, CBC, CTR, OFB, and CFB modes.  I'll do the remaining
modes soon.

I've put this through a battery of tests, and in particular I hacked
up a local copy of test/test_aesni (which doesn't seem to get run even
on x86?) that uses the appropriate sparc environment variable to turn
off crypto opcode usage.  That script helped a lot during validation.

The 35GB/sec benchmark result in the second patch is not a typo :-)

Signed-off-by: David S. Miller da...@davemloft.net
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org