> Could you also collect and submit results for say 512 blocks?

With a small change in the enrolled version, (bswap-load four 
source bytes at once and extract each with a >>=8), the results 
have slightly changed. (Though the results were taken form a 
x1000 test loop, running 100 times with removal of the outliers,
they still vary slightly between measurements.)

Blks Bytes  XMe µs  XMu µs  I32 µS XMe MB/s XMu MB/s I32 MB/s
¯¯¯¯ ¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯
   0     0      50      55      70
   1    16     265     322     531     60.4     49.7     30.1
   2    32     481     609     968     66.5     52.5     33.1
   3    48     684     871    1455     70.2     55.1     33.0
   4    64     894    1242    1890     71.6     51.5     33.9
   5    80    1099    1543    2355     72.8     51.8     34.0
   6    96    1312    1827    2819     73.2     52.5     34.1
   7   112    1504    2105    3278     74.5     53.2     34.2
   8   128    1705    2363    3727     75.1     54.2     34.3
   9   144    1910    2625    4206     75.4     54.9     34.2
  10   160    2124    3034    4646     75.3     52.7     34.4
  11   176    2324    3161    5072*    75.7     55.7     34.7*
  12   192    2540    3663    5409*    75.6     52.4     35.5*
  13   208    2733    3893    5750*    76.1     53.4     36.2*
  14   224    2946    4151    6080*    76.0     54.0     36.8*
  15   240    3143    4531    6419*    76.4     53.0     37.4*
  16   256    3352    4938    6756*    76.4     51.8     37.9*
  17   272    3565    5224    7093*    76.3     52.1     38.3*
  18   288    3762    5454    7429*    76.6     52.8     38.8*
  19   304    3947    5730    7766*    77.0     53.1     39.1*
  20   320    4170    6053    8099*    76.7     52.9     39.5*
  21   336    4372    6329    8432*    76.9     53.1     39.8*
  22   352    4588    6616    8773*    76.7     53.2     40.1*
  23   368    4786    6843    9104*    76.9     53.8     40.4*
  24   384    4989    7026    9438*    77.0     54.7     40.7*
  25   400    5174*   6932*   9777*    77.3*    57.7*    40.9*
  26   416    5334*   7152*  10112*    78.0*    58.2*    41.1*
  27   432    5492*   7410*  10439*    78.7*    58.3*    41.4*
  28   448    5654*   7653*  10781*    79.2*    58.5*    41.6*
  29   464    5813*   7869*  11124*    79.8*    59.0*    41.7*
  30   480    5974*   8095*  11459*    80.3*    59.3*    41.9*
  31   496    6135*   8374*  11794*    80.8*    59.2*    42.1*
  32   512    6291*   8624*  12130*    81.4*    59.4*    42.2*
  64  1024   11407*  16054*  22868*    89.8*    63.8*    44.8*
 128  2048   21636*  31061*  44366*    94.7*    65.9*    46.2*
 256  4096   42108*  59644*  87391*    97.3*    68.7*    46.9*
 512  8192   83378* 120096* 173951*    98.3*    68.2*    47.1*
1024 16384  168677* 243279* 349622*    97.1*    67.3*    46.9*

> In other words it's effectively number of instructions per *byte*. Fair
> enough. You failed to answer which compiler is it.

It is Visual Studio 2008 with cl.exe version 15.00.30729.01

> You should refer to your loops as SSE2 loops, not MMX. MMX is something
> that can be executed on Pentium MMX. Note that when I said MMX I meant
> MMX, not SSE or SSE2:-)

It was not clear to me that you really meant MMX and not XMM (aka SSEn). 
Seriously, I don't think that it makes sense nowadays to implement 
an optimized version of a 128-bit algorithm on 64-bit registers when 
the machine also has eight (or 16 in x64) 128-bit registers. 

Even more, these eight registers can hold the 8 bit-shifted results 
H[128],H[64],..H[2],H[1] to compose the remaining 248 entries within 
one xor and one movdqa for each 16-bytes result. So the XMM registers 
perfectly meet the targets. 

> I.e. 32-bit code running on P4-based core (I choose to refer to Intel
> CPU family 15 as P4).

My Irwindale seems to be very similar to the family 15 as P4.
(The registered ECC DP interface might slow down the Xeon slightly.)

> I'll publish my code shortly. BTW, I didn't even try to unroll MMX loop,
> I don't think it will give me more than 5%... I don't understand why
> it's so big difference in your case. Never trust the compiler:-)

This is mostly a result from the compiler's instruction re-ordering:
While it loads the next source operand bytes, (by which an indexing 
into the table will be done next), leading to a cache-miss, it fills 
out the cache load time with completing the last round. In the sample 
below I have added a column to which round an instruction belongs. In 
the mixed-up mode there is always a minimum of three unaffected 
instructions between a move from memory to a register and the register 
use. Besides that, the parallel working ALUs have less idles. 

0..3    mov     EAX,  DWORD PTR xmY[esp+4328]
-1      movdqa  xmm3, xmm0
-1      pinsrw  xmm3, edx, 7
-1      pxor    xmm1, xmm3
-1      pxor    xmm1, xmm2
-1      pextrw  ecx,  xmm1, 0
-1      and     ecx,  000000FFH
-1      movzx   EDX,  WORD PTR pwRed[ecx*2]
0..3    bswap   EAX
0       movzx   ecx,  al
-1      movdqa  xmm2, xmm1
-1      psrldq  xmm2, 1
0       shl     ecx,  4
0       movdqa  XMM1, XMMWORD PTR xmH[esp+ecx+4320]
-1      movdqa  xmm3, xmm0
-1      pinsrw  xmm3, EDX, 7
-1      pxor    xmm2, xmm3
0       pxor    xmm2, XMM1
0       pextrw  edx,  xmm2, 0
0       and     edx,  000000FFH
0       movzx   ECX,  WORD PTR pwRed[edx*2]
+1      shr     eax,  8
+1      movzx   edx,  al
0       movdqa  xmm1, xmm2
+1      shl     edx,  4
+1      movdqa  XMM2, XMMWORD PTR xmH[esp+edx+4320]
0       psrldq  xmm1, 1
0       movdqa  xmm3, xmm0
0       pinsrw  xmm3, ECX, 7
0       pxor    xmm1, xmm3
+1      pxor    xmm1, XMM2
+2      shr     eax,  8
0       movdqa  xmm2, xmm1

OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to