>> As already mentioned programming SSEn+1 is not self-goal, all-round
>> performance is. The only thing that can make me consider SSE2 at this
>> point is performance numbers from Core2, which would be not worse than
>> say 16 cycles per byte, preferably with 256B table. So could you
>> *please* find possibility to run your benchmark on Core2? If you really
>> can't, as last resort send binary compiled *without* /MD to me (zip it
>> first). 
> 
> To give you the best to test it on machines you are familiar with, I have 
> done a small GhashBench.exe which outputs results similar to the previous 
> benchmarks.

Thanks. Two things. For me it printed 0.0 and 0.1 in Ctr columns. I
didn't try to figure out why, but MB/s columns appear sane, so I used
them to obtain cycles per byte. I was also running it in VMware and
values are not as monotone. This is one of reason for which I picked
just 20 and 512 blocks lines as representative for 256B and 4KB tables.
They were not abnormal in comparison to rest of table and it's safe to
assume that corresponding table setup overhead is just few percent.

On 2.4GHz Core2 I get:

 Blks Bytes XMM Ctr I32 Ctr XMM MB/s I32 MB/s
 ---- ----- ------- ------- -------- --------
   20   320     0.0     0.0     70.8     48.0
  512  8192     0.0     0.0     88.6     69.9

Or 32 cycles per byte for 256B and 27 for 4KB tables.

On 2.0GHz AMD64 I get:

 Blks Bytes XMM Ctr I32 Ctr XMM MB/s I32 MB/s
 ---- ----- ------- ------- -------- --------
   20   320     0.1     0.1     63.1     44.5
  512  8192     0.0     0.1     83.2     58.6

Or virtually same result as above: 32/24 cycles per byte...

I was expecting bigger difference between Core2 and AMD64 than none,
because Core2 SSE2 *is* faster than AMD's...

> Should note, that the results include the whole GHASH processing, including 
> the block loop, xor between blocks, and an optional zero-padding on the last 
> block if the input size does not fold into block-length.

My results are obtained as difference between GCM and "vanilla" counter
mode. So that they also cover block loop and xor between blocks. They
don't account for table setup nor final calculations, *but* they are
obtained for 1KB input or 64 blocks. So that if they shouldn't be
different from your by more than just few percent.

On additional note. Newly committed
http://cvs.openssl.org/rlog?f=openssl/crypto/modes/asm/ghash-x86_64.pl,
i.e. 64-bit counterpart of ghash-x86.pl, was observed to process one
byte in 10.2 cycles on AMD64 and in 16.4 on Core2. Pure integer code,
256B table. I have no data for P4/EMT64, but results won't be as
impressive, because its shifter is very sloooooooooooow. Cheers. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to