> Could you also collect and submit results for say 512 blocks? With a small change in the enrolled version, (bswap-load four source bytes at once and extract each with a >>=8), the results have slightly changed. (Though the results were taken form a x1000 test loop, running 100 times with removal of the outliers, they still vary slightly between measurements.)
Blks Bytes XMe µs XMu µs I32 µS XMe MB/s XMu MB/s I32 MB/s ¯¯¯¯ ¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ 0 0 50 55 70 1 16 265 322 531 60.4 49.7 30.1 2 32 481 609 968 66.5 52.5 33.1 3 48 684 871 1455 70.2 55.1 33.0 4 64 894 1242 1890 71.6 51.5 33.9 5 80 1099 1543 2355 72.8 51.8 34.0 6 96 1312 1827 2819 73.2 52.5 34.1 7 112 1504 2105 3278 74.5 53.2 34.2 8 128 1705 2363 3727 75.1 54.2 34.3 9 144 1910 2625 4206 75.4 54.9 34.2 10 160 2124 3034 4646 75.3 52.7 34.4 11 176 2324 3161 5072* 75.7 55.7 34.7* 12 192 2540 3663 5409* 75.6 52.4 35.5* 13 208 2733 3893 5750* 76.1 53.4 36.2* 14 224 2946 4151 6080* 76.0 54.0 36.8* 15 240 3143 4531 6419* 76.4 53.0 37.4* 16 256 3352 4938 6756* 76.4 51.8 37.9* 17 272 3565 5224 7093* 76.3 52.1 38.3* 18 288 3762 5454 7429* 76.6 52.8 38.8* 19 304 3947 5730 7766* 77.0 53.1 39.1* 20 320 4170 6053 8099* 76.7 52.9 39.5* 21 336 4372 6329 8432* 76.9 53.1 39.8* 22 352 4588 6616 8773* 76.7 53.2 40.1* 23 368 4786 6843 9104* 76.9 53.8 40.4* 24 384 4989 7026 9438* 77.0 54.7 40.7* 25 400 5174* 6932* 9777* 77.3* 57.7* 40.9* 26 416 5334* 7152* 10112* 78.0* 58.2* 41.1* 27 432 5492* 7410* 10439* 78.7* 58.3* 41.4* 28 448 5654* 7653* 10781* 79.2* 58.5* 41.6* 29 464 5813* 7869* 11124* 79.8* 59.0* 41.7* 30 480 5974* 8095* 11459* 80.3* 59.3* 41.9* 31 496 6135* 8374* 11794* 80.8* 59.2* 42.1* 32 512 6291* 8624* 12130* 81.4* 59.4* 42.2* 64 1024 11407* 16054* 22868* 89.8* 63.8* 44.8* 128 2048 21636* 31061* 44366* 94.7* 65.9* 46.2* 256 4096 42108* 59644* 87391* 97.3* 68.7* 46.9* 512 8192 83378* 120096* 173951* 98.3* 68.2* 47.1* 1024 16384 168677* 243279* 349622* 97.1* 67.3* 46.9* > In other words it's effectively number of instructions per *byte*. Fair > enough. You failed to answer which compiler is it. It is Visual Studio 2008 with cl.exe version 15.00.30729.01 > You should refer to your loops as SSE2 loops, not MMX. MMX is something > that can be executed on Pentium MMX. Note that when I said MMX I meant > MMX, not SSE or SSE2:-) It was not clear to me that you really meant MMX and not XMM (aka SSEn). Seriously, I don't think that it makes sense nowadays to implement an optimized version of a 128-bit algorithm on 64-bit registers when the machine also has eight (or 16 in x64) 128-bit registers. Even more, these eight registers can hold the 8 bit-shifted results H[128],H[64],..H[2],H[1] to compose the remaining 248 entries within one xor and one movdqa for each 16-bytes result. So the XMM registers perfectly meet the targets. > I.e. 32-bit code running on P4-based core (I choose to refer to Intel > CPU family 15 as P4). My Irwindale seems to be very similar to the family 15 as P4. [http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Irwindale.22_.28standard-voltage.2C_90_nm.29] [http://en.wikipedia.org/wiki/List_of_Intel_Pentium_4_microprocessors#Prescott_2M_.2890.C2.A0nm.29] (The registered ECC DP interface might slow down the Xeon slightly.) > I'll publish my code shortly. BTW, I didn't even try to unroll MMX loop, > I don't think it will give me more than 5%... I don't understand why > it's so big difference in your case. Never trust the compiler:-) This is mostly a result from the compiler's instruction re-ordering: While it loads the next source operand bytes, (by which an indexing into the table will be done next), leading to a cache-miss, it fills out the cache load time with completing the last round. In the sample below I have added a column to which round an instruction belongs. In the mixed-up mode there is always a minimum of three unaffected instructions between a move from memory to a register and the register use. Besides that, the parallel working ALUs have less idles. 0..3 mov EAX, DWORD PTR xmY[esp+4328] -1 movdqa xmm3, xmm0 -1 pinsrw xmm3, edx, 7 -1 pxor xmm1, xmm3 -1 pxor xmm1, xmm2 -1 pextrw ecx, xmm1, 0 -1 and ecx, 000000FFH -1 movzx EDX, WORD PTR pwRed[ecx*2] 0..3 bswap EAX 0 movzx ecx, al -1 movdqa xmm2, xmm1 -1 psrldq xmm2, 1 0 shl ecx, 4 0 movdqa XMM1, XMMWORD PTR xmH[esp+ecx+4320] -1 movdqa xmm3, xmm0 -1 pinsrw xmm3, EDX, 7 -1 pxor xmm2, xmm3 0 pxor xmm2, XMM1 0 pextrw edx, xmm2, 0 0 and edx, 000000FFH 0 movzx ECX, WORD PTR pwRed[edx*2] +1 shr eax, 8 +1 movzx edx, al 0 movdqa xmm1, xmm2 +1 shl edx, 4 +1 movdqa XMM2, XMMWORD PTR xmH[esp+edx+4320] 0 psrldq xmm1, 1 0 movdqa xmm3, xmm0 0 pinsrw xmm3, ECX, 7 0 pxor xmm1, xmm3 +1 pxor xmm1, XMM2 +2 shr eax, 8 0 movdqa xmm2, xmm1 ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org