From: Andy Polyakov <ap...@openssl.org>
Date: Fri, 28 Sep 2012 21:00:18 +0200

>       aes01   %key0,%reg0,%reg1,%reg2
>       aes23   %key1,%reg0,%reg1,%reg1 <<< 1, not 3
>       aes01   %key2,%reg2,%reg1,%reg0
>       aes23   %key4,%reg2,%reg1,%reg1
> 
> allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I
> get it wrong? Or would 3-register arrangement like above not work?

These instructions have a 3 cycle latency, for example:

        aes_eround01      %f8,   %f0,  %f2,  %f4
        aes_eround23      %f10,  %f0,  %f2,  %f6
        [stall]
        [stall]
        aes_eround01      %f12,  %f4,  %f6,  %f0
        aes_eround23      %f14,  %f4,  %f6,  %f2

Whereas, of course:

        aes_eround01      %f8,   %f0,  %f2,  %f4
        aes_eround23      %f10,  %f0,  %f2,  %f6
        aes_eround01      %f8,  %f56, %f58, %f60
        aes_eround23      %f10, %f56, %f58, %f62

executes without any stall cycles.  As does:

        aes_eround01      %f8,   %f0,  %f2,  %f4
        aes_eround23      %f10,  %f0,  %f2,  %f6
        aes_eround01      %f8,  %f56, %f58, %f60
        aes_eround23      %f10, %f56, %f58, %f62
        aes_eround01      %f12,  %f4,  %f6,  %f0
        aes_eround23      %f14,  %f4,  %f6,  %f2
        aes_eround01      %f12, %f60, %f62,  %f56
        aes_eround23      %f14, %f60, %f62,  %f58

which is why unrolling by a factor of 2 is optimal, at least
from a scheduling viewpoint.

The other issue is that fxor is really expensive.  Unlike the crypto
opcodes, they execute in the remote FPU so have a minimum 12 cycle
latency, although they do pipeline.  I tried many experiments using
integer xor and movxtod instead of fxor but it turned out to be a
wash.

Another issue is that loads have a 4 cycle latency, but there really
isn't a whole lot of room to move those around in the loops.

As per the register usage it depends upon if we use float registers
for the IV.  Since movxtod/movdtox is relatively cheap (1 cycle and
pairs with other instructions) if we really needed to gain some float
register space back we could always use an integer register for the
IV, some of my code already does this.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to