On Tue, 28 Dec 2004, Andy Polyakov wrote:

> > > aes-586.pl module is committed to CVS now [see
> > > http://cvs.openssl.org/rlog?f=openssl/crypto/aes/asm/aes-586.pl]. Take
> > > "Special note about instruction choice" in commentary section for
> > > consideration even for AMD64. Merry Christmas to everybody:-) A.
> > 
> > hmmm... i seem to have done better by switching back to scaling :)
> 
> H-m-m-m... It's not like I just wrote the note off the top of my head... I
> actually benchmarked 9% improvement with off-by-2 shifts on P4 workstation
> available in *my* disposal... Two possibilities: 1) they've changed something
> between steppings and we have different steppings, 2) I've benchmarked an
> intermediate version [most likely prior folding reference to %esi in last
> steps]... This is exactly the kind of thing I hate about x86: it's virtually
> impossible to figure out in advance how it turns out in the very end... Well,
> I suppose I have to beat the retreat:-) Which leaves the question about why
> RC4_INT code was performing so poorly on P4 opened...

yeah i was meaning to go back and re-evaluate the RC4_INT case -- there's 
one thing i know really specific to rc4 which isn't a factor in AES:  
aliasing.  iirc the rc4 loop has two table lookups and one table update, 
and there's a 1 in 256 probability of each lookup aliasing with an earlier 
update -- yielding about a 1 in 128 probability of store-forwarding 
violation because the processor has speculated the loads ahead of the 
store.

p4 really hates that stuff -- and the newer the p4 the more it hates it 
(i.e. the longer the pipeline is, the larger the penalty for the 
violation.)

btw i might have a p4 like yours -- send me family/model/stepping... it's 
eax from cpuid level 1, or in /proc/cpuinfo on linux.  i know i've got a 
model 0 here and i might have a model 1.


> > with the patch below i'm getting the following throughput improvements for
> > aes-128-cbc 8192B buffer:
> > 
> >                      patch delta
> > 
> >         p4-2            + 3.8%
>              I hardly get +2%, so most likely both 1) and 2) apply...

hmmmm... maybe i should review a few of the changes then -- there was some 
point where i had +10% on my p4-2... but i traded it for improvements on 
p4-3/p-m/k8 because those are going to be the common cores for the next 
few years.

> >         p4-3            +11%
> >         p-m             + 8.8%
> >         k8              +12%
> >         efficeon        + 4.3%
> 
> ... and on Pentium I get -10%:-( Of course one can argue that targeting
> out-of-date implementation is nothing but ridiculous:-) PIII gains ~4%...

i've got lots of p3s, i was really hoping p-m would do the trick, i wasn't 
sure they changed anything related to what i was doing.



> > here's an updated comparison versus the gladman code -- this is in
> > cycles/byte for 8192 byte buffer (smaller is better):
> > 
> >                    openssl w/patch
> >                     small   large   gladman
> > 
> >     p4-2            31.7    26.1      27.3
> >     p4-3            32.3    32.9      18.7
> >     p-m             23.8    23.3      16.9
> >     k8              21.8    21.5      18.1
> >     efficeon        25.1    22.6      17.8
> 
> Once again I want to remind OpenSSL users, that it's perfectly expected that
> OpenSSL code performs slower than other assembler implementations. Unlike
> other implementations, aes-586.pl was excplicitly designed for usage in shared
> library context. This costs one extra register to be off-loaded to memory,
> which affects the performance. As most users are indirect gcc -fPIC users
> (there is no Linux vendor who would compile their libcrypto.so with Intel C),
> they should experience +2x performance improvement over previous version.

ya sorry i meant to include a larger disclaimer here... i meant to say 
something along the lines of "large vs. gladman is the fairest comparison, 
and gladman may be unattainable in PIC code, but that seems more of a 
challenge than anything" :)


> > i'm a bit disappointed with efficeon but i know what the problems are
> > (efficeon lacks native bswap, so your "1%" estimation on the bswaps is more
> > painful for efficeon, and the loop could be rotated differently).
> 
> Why? It exhibits approximately same coefficient as P-M and K8, so that extra
> register spill can still stand for most of it and bswap for "1%"...

perf data shows efficeon spending at least 2% of its time in the rotates 
it has to use to emulate bswap -- this could be an artifact of "openssl 
speed aes-128-cbc" though.  i.e. the heavier the weighting is towards 16 
and 64 byte messages the more bswap occurs.


> > fixing that is a more significant effort --
> 
> Not to mention that it can get constrained by possible FIPS certification: the
> code can end-up carved in stone as big-endian.

aha this i know nothing about -- where can i read up on FIPS certification 
restrictions / process?


> > o   used a "gladman trick" on %edx in encode because it was easy
...
> 
> Is there reason to believe that it would *significantly* improve performance?
> I mean I see no point hunting "idividual" percents... I realize it's
> fascinating, but let's keep trying to find the balance. Cheers. A.

oh i dunno, it's mostly the intellectual exercise of trying to accomplish 
what the fastest AES implementations accomplish but in PIC code which is 
balanced across a reasonable set of processors.

going forward the most common intel cores will be dual-core descendants of 
p-m, and amd cores will be dual-core descendants of k8 ...  by my data 
there's 27% and 16% to be had resp. on those cores.

but then there's something to be said for your original code because it 
was fairly accessible (for someone who knows AES) before i started mucking 
it up :)

-dean
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to