aes-586.pl module is committed to CVS now [see
http://cvs.openssl.org/rlog?f=openssl/crypto/aes/asm/aes-586.pl]. Take
"Special note about instruction choice" in commentary section for
consideration even for AMD64. Merry Christmas to everybody:-) A.

hmmm... i seem to have done better by switching back to scaling :)

H-m-m-m... It's not like I just wrote the note off the top of my head... I actually benchmarked 9% improvement with off-by-2 shifts on P4 workstation available in *my* disposal... Two possibilities: 1) they've changed something between steppings and we have different steppings, 2) I've benchmarked an intermediate version [most likely prior folding reference to %esi in last steps]... This is exactly the kind of thing I hate about x86: it's virtually impossible to figure out in advance how it turns out in the very end... Well, I suppose I have to beat the retreat:-) Which leaves the question about why RC4_INT code was performing so poorly on P4 opened...


with the patch below i'm getting the following throughput improvements for aes-128-cbc 8192B buffer:

                     patch delta

p4-2 + 3.8%
I hardly get +2%, so most likely both 1) and 2) apply...
        p4-3            +11%
        p-m             + 8.8%
        k8              +12%
        efficeon        + 4.3%

... and on Pentium I get -10%:-( Of course one can argue that targeting out-of-date implementation is nothing but ridiculous:-) PIII gains ~4%...


the code is 229 bytes smaller in $small_footprint=1 ... i didn't look to see how much smaller it is for the fully unrolled variety (i would assume 1145 bytes or so). unfortunately this space improvement is hidden by the alignment pain caused by the placement of AES_Te and AES_Td :) i suggest moving both those tables to the top of the module so that their 64 byte alignment is taken care of once only.

Placement of the tables is constrained by perlasm. I mean at least as it is now. But I'm aware of the problem and keep it in mind for next time perlasm would require surgery.


here's an updated comparison versus the gladman code -- this is in cycles/byte for 8192 byte buffer (smaller is better):

                       openssl w/patch
                        small   large   gladman

        p4-2            31.7    26.1      27.3
        p4-3            32.3    32.9      18.7
        p-m             23.8    23.3      16.9
        k8              21.8    21.5      18.1
        efficeon        25.1    22.6      17.8

Once again I want to remind OpenSSL users, that it's perfectly expected that OpenSSL code performs slower than other assembler implementations. Unlike other implementations, aes-586.pl was excplicitly designed for usage in shared library context. This costs one extra register to be off-loaded to memory, which affects the performance. As most users are indirect gcc -fPIC users (there is no Linux vendor who would compile their libcrypto.so with Intel C), they should experience +2x performance improvement over previous version.


damn the p4 is a weird beast -- notice how the gladman code is better everywhere except p4-2 ... and p4-3 gladman is nearly twice as good as the
openssl code.

Yeah, the latter is really strange... They did fix shift and rotate in P4-3... PIC code must be getting bound to memory interface on P4, which is why OpenSSL gains nothing between P4-2 and P4-3...


i'm a bit disappointed with efficeon but i know what the problems are (efficeon lacks native bswap, so your "1%" estimation on the bswaps is more painful for efficeon, and the loop could be rotated differently).

Why? It exhibits approximately same coefficient as P-M and K8, so that extra register spill can still stand for most of it and bswap for "1%"...


fixing that is a more significant effort --

Not to mention that it can get constrained by possible FIPS certification: the code can end-up carved in stone as big-endian.


o movzbl is 3 bytes shorter than and $imm:

... but slow on old Pentium, which is the actual reason why I pretty much avoided movz. I'll see if it's possible to avoid performance degradation on Pentium and commit whatever appropriate...


o used a "gladman trick" on %edx in encode because it was easy
...
it's not easy to transform the other 3 registers in this way without major surgery around loop edges ... which will have to
wait for another rainy day.

Is there reason to believe that it would *significantly* improve performance? I mean I see no point hunting "idividual" percents... I realize it's fascinating, but let's keep trying to find the balance. Cheers. A.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [EMAIL PROTECTED]

Reply via email to