On Fri, 5 Jan 2007, Andy Polyakov wrote:

> > there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit
> > to distinguish between two implementations of rc4... unfortunately this
> > fails to properly distinguish the cpus.  all dual core cpus (intel or amd)
> > report HT support even if they don't use symmetric-multithreading like some
> > p4 do.
> 
> So HT flag is no longer HyperThreading, but something else... Will look into
> it... There is another place HTT flag is checked and it's AES...

yeah HT flag now basically means "multi-threading or multi-core
package"... because when amd/intel went dual core they didn't want silly
license managers to charge for every core.

hmm i don't see any OPENSSL_ia32cap_P test for AES in 0.9.8d ... maybe
i should be looking at the cvs?  i'm seeing 17.5 cycles per byte for
aes-128-cbc on core2, which is pretty good.


> > it seems somewhat fortunate that core2 CPUs track the p4 behaviour
> > w.r.t. these two rc4 implementations.  here are the core2 results with the
> > stock code / HT test:
> > 
> > type             16 bytes     64 bytes    256 bytes   1024 bytes   8192
> > bytes
> > rc4             166799.58k   180552.87k   182437.93k   183381.67k
> > 183206.87k
> > 
> > and with cpuid test disabled:
> > 
> > type             16 bytes     64 bytes    256 bytes   1024 bytes   8192
> > bytes
> > rc4             123361.30k   128102.17k   129876.57k   128787.22k
> > 129419.95k
> > 
> > for the record, core2 64-bit code seriously underperforming the 32-bit
> > code...  here's the 32-bit results (with cpuid test enabled):
> > 
> > type             16 bytes     64 bytes    256 bytes   1024 bytes   8192
> > bytes
> > rc4             254164.64k   279901.10k   279364.38k   283617.62k
> > 276690.26k
> 
> ... The key feature in 32-bit code with cpuid test is that corresponding loop
> is not unrolled. Can you test following in *64-bit* build on Core2 hardware.
> Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154
> unconditional, i.e. replace jz to jmp. make, benchmark and report back. A.

small improvement...

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
rc4             174197.47k   182564.34k   184536.23k   185292.63k   186258.77k

i think this hints that the problem with the unrolled code is the manual
load/store alias avoidance -- there's fancy new hardware in core2 for
dealing with this (obviously it's not fancy enough :)... and it seems
the 32-bit code pushes the alias problem onto the hardware.

oh and i tried using cmove with no luck either.

bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case 
and it's still not performing like it does in 32-bit... maybe i screwed up 
though.

-dean
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to