> as I had mixed up two loop rounds in the 256 bytes table 
> mode, in both cases (256 and 4k) the loop has 15 rounds 
> (plus a reduced round at the end).

In other words it's effectively number of instructions per *byte*. Fair
enough. You failed to answer which compiler is it.

> So I can use one 
> reduction modulo table for both modes, saving a separate 
> lookup-xor on the second nibble. With this optimization 
> these are the current instruction counts:
> 
> MMX_4K_LOOP:    14 instr + jmp
> MMX_256B_LOOP:  26 instr + jmp

You should refer to your loops as SSE2 loops, not MMX. MMX is something
that can be executed on Pentium MMX. Note that when I said MMX I meant
MMX, not SSE or SSE2:-)

> I32_4K_LOOP:    30 instr + jmp
> I32_256B_LOOP:  63 instr + jmp

> Having this in the I32 too, the SSE improvement has dropped 
> slightly below 50%. In the table below there are the execution 
> times for MMX enrolled, un-enrolled and I32 and the resulting 
> data throuput in MB/s. The entries with an * are computed with 
> the 4k table, as there is a treshold in the source when to use 
> the bigger table.
> 
> As I do not have the latest equipment, these values are calculated 
> under Win32 on an Irwindale Xeon at 3,2 GHz using the win calls
> QueryPerformanceCounter() / QueryPerformanceFrequency().

I.e. 32-bit code running on P4-based core (I choose to refer to Intel
CPU family 15 as P4).

> Blk Byte  MXe µs  MXu µs  I32 µS MXe MB/s MXu MB/s I32 MB/s
> ¯¯¯ ¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯
>   0    0      56      55      81
>   1   16     268     320     538     59.7     50.0     29.7
>  21  336    4738    6439    8455*    70.9     52.2     39.7

My result (collected with help of simple benchmark committed in
http://cvs.openssl.org/chngview?cn=19406) don't account for pre-computed
table setup, but given that setup time is less than single
multiplication, above don't seem very impressive... Indeed if we take 21
blocks line (as last with 256-byte table) SSE2 loop processed single
byte in 3.2GHz/70MBps=45 cycles... My code delivers 33 cycles by byte...
Situation can be reverse on Core2, as SSE2 performance on P4 is far from
impressive. Don't you have any opportunity to test?

I'll publish my code shortly. BTW, I didn't even try to unroll MMX loop,
I don't think it will give me more than 5%... I don't understand why
it's so big difference in your case. Never trust the compiler:-) A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to