>> Could you also collect and submit results for say 512 blocks?
> 
>  512  8192   83378* 120096* 173951*    98.3*    68.2*    47.1*
> 1024 16384  168677* 243279* 349622*    97.1*    67.3*    46.9*

I.e. >20KB code, which is larger than I-cache, processes one byte in
32.5 cycles with 4KB table. OpenSSL module is 1.5KB and processes one
byte in 33 cycles with 256B table.

>> You should refer to your loops as SSE2 loops, not MMX. MMX is something
>> that can be executed on Pentium MMX. Note that when I said MMX I meant
>> MMX, not SSE or SSE2:-)
> 
> It was not clear to me that you really meant MMX and not XMM (aka SSEn). 
> Seriously, I don't think that it makes sense nowadays to implement 
> an optimized version of a 128-bit algorithm on 64-bit registers when 
> the machine also has eight (or 16 in x64) 128-bit registers.

As already mentioned programming SSEn+1 is not self-goal, all-round
performance is. The only thing that can make me consider SSE2 at this
point is performance numbers from Core2, which would be not worse than
say 16 cycles per byte, preferably with 256B table. So could you
*please* find possibility to run your benchmark on Core2? If you really
can't, as last resort send binary compiled *without* /MD to me (zip it
first). Cheers. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to