From: Andy Polyakov <ap...@openssl.org> Date: Fri, 28 Sep 2012 21:00:18 +0200
> aes01 %key0,%reg0,%reg1,%reg2 > aes23 %key1,%reg0,%reg1,%reg1 <<< 1, not 3 > aes01 %key2,%reg2,%reg1,%reg0 > aes23 %key4,%reg2,%reg1,%reg1 > > allows for 4x interleave up to 192-bit, right? 3*4+13*4=64? Or did I > get it wrong? Or would 3-register arrangement like above not work? These instructions have a 3 cycle latency, for example: aes_eround01 %f8, %f0, %f2, %f4 aes_eround23 %f10, %f0, %f2, %f6 [stall] [stall] aes_eround01 %f12, %f4, %f6, %f0 aes_eround23 %f14, %f4, %f6, %f2 Whereas, of course: aes_eround01 %f8, %f0, %f2, %f4 aes_eround23 %f10, %f0, %f2, %f6 aes_eround01 %f8, %f56, %f58, %f60 aes_eround23 %f10, %f56, %f58, %f62 executes without any stall cycles. As does: aes_eround01 %f8, %f0, %f2, %f4 aes_eround23 %f10, %f0, %f2, %f6 aes_eround01 %f8, %f56, %f58, %f60 aes_eround23 %f10, %f56, %f58, %f62 aes_eround01 %f12, %f4, %f6, %f0 aes_eround23 %f14, %f4, %f6, %f2 aes_eround01 %f12, %f60, %f62, %f56 aes_eround23 %f14, %f60, %f62, %f58 which is why unrolling by a factor of 2 is optimal, at least from a scheduling viewpoint. The other issue is that fxor is really expensive. Unlike the crypto opcodes, they execute in the remote FPU so have a minimum 12 cycle latency, although they do pipeline. I tried many experiments using integer xor and movxtod instead of fxor but it turned out to be a wash. Another issue is that loads have a 4 cycle latency, but there really isn't a whole lot of room to move those around in the loops. As per the register usage it depends upon if we use float registers for the IV. Since movxtod/movdtox is relatively cheap (1 cycle and pairs with other instructions) if we really needed to gain some float register space back we could always use an integer register for the IV, some of my code already does this. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org