> as I had mixed up two loop rounds in the 256 bytes table > mode, in both cases (256 and 4k) the loop has 15 rounds > (plus a reduced round at the end).
In other words it's effectively number of instructions per *byte*. Fair enough. You failed to answer which compiler is it. > So I can use one > reduction modulo table for both modes, saving a separate > lookup-xor on the second nibble. With this optimization > these are the current instruction counts: > > MMX_4K_LOOP: 14 instr + jmp > MMX_256B_LOOP: 26 instr + jmp You should refer to your loops as SSE2 loops, not MMX. MMX is something that can be executed on Pentium MMX. Note that when I said MMX I meant MMX, not SSE or SSE2:-) > I32_4K_LOOP: 30 instr + jmp > I32_256B_LOOP: 63 instr + jmp > Having this in the I32 too, the SSE improvement has dropped > slightly below 50%. In the table below there are the execution > times for MMX enrolled, un-enrolled and I32 and the resulting > data throuput in MB/s. The entries with an * are computed with > the 4k table, as there is a treshold in the source when to use > the bigger table. > > As I do not have the latest equipment, these values are calculated > under Win32 on an Irwindale Xeon at 3,2 GHz using the win calls > QueryPerformanceCounter() / QueryPerformanceFrequency(). I.e. 32-bit code running on P4-based core (I choose to refer to Intel CPU family 15 as P4). > Blk Byte MXe µs MXu µs I32 µS MXe MB/s MXu MB/s I32 MB/s > ¯¯¯ ¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ > 0 0 56 55 81 > 1 16 268 320 538 59.7 50.0 29.7 > 21 336 4738 6439 8455* 70.9 52.2 39.7 My result (collected with help of simple benchmark committed in http://cvs.openssl.org/chngview?cn=19406) don't account for pre-computed table setup, but given that setup time is less than single multiplication, above don't seem very impressive... Indeed if we take 21 blocks line (as last with 256-byte table) SSE2 loop processed single byte in 3.2GHz/70MBps=45 cycles... My code delivers 33 cycles by byte... Situation can be reverse on Core2, as SSE2 performance on P4 is far from impressive. Don't you have any opportunity to test? I'll publish my code shortly. BTW, I didn't even try to unroll MMX loop, I don't think it will give me more than 5%... I don't understand why it's so big difference in your case. Never trust the compiler:-) A. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
