On Tue, 28 Dec 2004, Andy Polyakov wrote: > > > aes-586.pl module is committed to CVS now [see > > > http://cvs.openssl.org/rlog?f=openssl/crypto/aes/asm/aes-586.pl]. Take > > > "Special note about instruction choice" in commentary section for > > > consideration even for AMD64. Merry Christmas to everybody:-) A. > > > > hmmm... i seem to have done better by switching back to scaling :) > > H-m-m-m... It's not like I just wrote the note off the top of my head... I > actually benchmarked 9% improvement with off-by-2 shifts on P4 workstation > available in *my* disposal... Two possibilities: 1) they've changed something > between steppings and we have different steppings, 2) I've benchmarked an > intermediate version [most likely prior folding reference to %esi in last > steps]... This is exactly the kind of thing I hate about x86: it's virtually > impossible to figure out in advance how it turns out in the very end... Well, > I suppose I have to beat the retreat:-) Which leaves the question about why > RC4_INT code was performing so poorly on P4 opened...
yeah i was meaning to go back and re-evaluate the RC4_INT case -- there's one thing i know really specific to rc4 which isn't a factor in AES: aliasing. iirc the rc4 loop has two table lookups and one table update, and there's a 1 in 256 probability of each lookup aliasing with an earlier update -- yielding about a 1 in 128 probability of store-forwarding violation because the processor has speculated the loads ahead of the store. p4 really hates that stuff -- and the newer the p4 the more it hates it (i.e. the longer the pipeline is, the larger the penalty for the violation.) btw i might have a p4 like yours -- send me family/model/stepping... it's eax from cpuid level 1, or in /proc/cpuinfo on linux. i know i've got a model 0 here and i might have a model 1. > > with the patch below i'm getting the following throughput improvements for > > aes-128-cbc 8192B buffer: > > > > patch delta > > > > p4-2 + 3.8% > I hardly get +2%, so most likely both 1) and 2) apply... hmmmm... maybe i should review a few of the changes then -- there was some point where i had +10% on my p4-2... but i traded it for improvements on p4-3/p-m/k8 because those are going to be the common cores for the next few years. > > p4-3 +11% > > p-m + 8.8% > > k8 +12% > > efficeon + 4.3% > > ... and on Pentium I get -10%:-( Of course one can argue that targeting > out-of-date implementation is nothing but ridiculous:-) PIII gains ~4%... i've got lots of p3s, i was really hoping p-m would do the trick, i wasn't sure they changed anything related to what i was doing. > > here's an updated comparison versus the gladman code -- this is in > > cycles/byte for 8192 byte buffer (smaller is better): > > > > openssl w/patch > > small large gladman > > > > p4-2 31.7 26.1 27.3 > > p4-3 32.3 32.9 18.7 > > p-m 23.8 23.3 16.9 > > k8 21.8 21.5 18.1 > > efficeon 25.1 22.6 17.8 > > Once again I want to remind OpenSSL users, that it's perfectly expected that > OpenSSL code performs slower than other assembler implementations. Unlike > other implementations, aes-586.pl was excplicitly designed for usage in shared > library context. This costs one extra register to be off-loaded to memory, > which affects the performance. As most users are indirect gcc -fPIC users > (there is no Linux vendor who would compile their libcrypto.so with Intel C), > they should experience +2x performance improvement over previous version. ya sorry i meant to include a larger disclaimer here... i meant to say something along the lines of "large vs. gladman is the fairest comparison, and gladman may be unattainable in PIC code, but that seems more of a challenge than anything" :) > > i'm a bit disappointed with efficeon but i know what the problems are > > (efficeon lacks native bswap, so your "1%" estimation on the bswaps is more > > painful for efficeon, and the loop could be rotated differently). > > Why? It exhibits approximately same coefficient as P-M and K8, so that extra > register spill can still stand for most of it and bswap for "1%"... perf data shows efficeon spending at least 2% of its time in the rotates it has to use to emulate bswap -- this could be an artifact of "openssl speed aes-128-cbc" though. i.e. the heavier the weighting is towards 16 and 64 byte messages the more bswap occurs. > > fixing that is a more significant effort -- > > Not to mention that it can get constrained by possible FIPS certification: the > code can end-up carved in stone as big-endian. aha this i know nothing about -- where can i read up on FIPS certification restrictions / process? > > o used a "gladman trick" on %edx in encode because it was easy ... > > Is there reason to believe that it would *significantly* improve performance? > I mean I see no point hunting "idividual" percents... I realize it's > fascinating, but let's keep trying to find the balance. Cheers. A. oh i dunno, it's mostly the intellectual exercise of trying to accomplish what the fastest AES implementations accomplish but in PIC code which is balanced across a reasonable set of processors. going forward the most common intel cores will be dual-core descendants of p-m, and amd cores will be dual-core descendants of k8 ... by my data there's 27% and 16% to be had resp. on those cores. but then there's something to be said for your original code because it was fairly accessible (for someone who knows AES) before i started mucking it up :) -dean ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]