On Fri, 5 Jan 2007, Andy Polyakov wrote: > > there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit > > to distinguish between two implementations of rc4... unfortunately this > > fails to properly distinguish the cpus. all dual core cpus (intel or amd) > > report HT support even if they don't use symmetric-multithreading like some > > p4 do. > > So HT flag is no longer HyperThreading, but something else... Will look into > it... There is another place HTT flag is checked and it's AES...
yeah HT flag now basically means "multi-threading or multi-core package"... because when amd/intel went dual core they didn't want silly license managers to charge for every core. hmm i don't see any OPENSSL_ia32cap_P test for AES in 0.9.8d ... maybe i should be looking at the cvs? i'm seeing 17.5 cycles per byte for aes-128-cbc on core2, which is pretty good. > > it seems somewhat fortunate that core2 CPUs track the p4 behaviour > > w.r.t. these two rc4 implementations. here are the core2 results with the > > stock code / HT test: > > > > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 > > bytes > > rc4 166799.58k 180552.87k 182437.93k 183381.67k > > 183206.87k > > > > and with cpuid test disabled: > > > > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 > > bytes > > rc4 123361.30k 128102.17k 129876.57k 128787.22k > > 129419.95k > > > > for the record, core2 64-bit code seriously underperforming the 32-bit > > code... here's the 32-bit results (with cpuid test enabled): > > > > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 > > bytes > > rc4 254164.64k 279901.10k 279364.38k 283617.62k > > 276690.26k > > ... The key feature in 32-bit code with cpuid test is that corresponding loop > is not unrolled. Can you test following in *64-bit* build on Core2 hardware. > Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154 > unconditional, i.e. replace jz to jmp. make, benchmark and report back. A. small improvement... type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes rc4 174197.47k 182564.34k 184536.23k 185292.63k 186258.77k i think this hints that the problem with the unrolled code is the manual load/store alias avoidance -- there's fancy new hardware in core2 for dealing with this (obviously it's not fancy enough :)... and it seems the 32-bit code pushes the alias problem onto the hardware. oh and i tried using cmove with no luck either. bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case and it's still not performing like it does in 32-bit... maybe i screwed up though. -dean ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]