On Fri, 1 Apr 2005, Matthew Dillon wrote:

:>    The use of the XMM registers is a cpu optimization.  Modern CPUs,
:>    especially AMD Athlon and Opterons, are more efficient with 128 bit
:>    moves then with 64 bit moves.   I experimented with all sorts of
:>    configurations, including the use of special data caching instructions,
:>    but they had so many special cases and degenerate conditions that
:>    I found that simply using straight XMM instructions, reading as big
:>    a glob as possible, then writing the glob, was by far the best solution.
:
:Are you sure about that?  The amd64 optimization manual says (essentially)

This is in 25112.PDF section 5.16 ("Interleave Loads and Stores", with 128 bits of loads followed by 128 bits of stores).

:that big globs are bad, and my benchmarks confirm this.  The best glob size
:is 128 bits according to my benchmarks.  This can be obtained using 2
:...
:
:Unfortunately (since I want to avoid using both MMX and XMM), I haven't
:managed to make copying through 64-integer registers work as well.
:Copying 128 bits at a time using 2 pairs of movq's through integer
:registers gives only 7.9GB/sec.  movq through MMX is never that slow.
:However, movdqu through xmm is even slower (7.4GB/sec).

I forgot many of my earlier conclusions when I wrote the above. The speeds between 7.4GB/sec and 12.9GB/sec for the fully (L1) cached case are almost irrelevant. They basically just tell how well we have used the instruction bandwidth. Plain movsq uses it better and gets 15.9GB/sec. I believe 15.9GB/sec is from saturating the L1 cache. The CPU is an Athlon64 and its clock frequency is 1994 MHz, and I think the max L1 cache bandwidth is with a 16-byte load and store per cycle; 16*1994*10^6 is 15.95GB/sec (disk manufacturers GB's).

Plain movsq is best here for many other cases too...

:
:The fully cached case is too unrepresentative of normal use, and normal
:(partially cached) use is hard to bencmark, so I normally benchmark
:the fully uncached case.  For that, movnt* is best for benchmarks but
:not for general use, and it hardly matters which registers are used.

   Yah, I'm pretty sure.  I tested the fully cached (L1), partially
   cached (L2), and the fully uncached cases.   I don't have a logic

By the partially cached case, I meant the case where some of the source and/or target addresses are in the L1 or L2 cache, but you don't really the chance that they are there (or should be there after the copy), so you can only guess the best strategy.

   analyzer but what I think is happening is that the cpu's write buffer
   is messing around with the reads and causing extra RAS cycles to occur.
   I also tested using various combinations of movdqa, movntdq, and
   prefetcha.

Somehow I'm only seeing small variations from different strategies now, with all tests done in userland on an Athlon64 system (and on athlonXP systems for reference). Using XMM or MMX can be twice as fast on the AthlonXPs, but movsq is absolutely the fastest in many cases on the Athlon64, and is < 5% slower than the fastest in all cases (except for the fully uncached case since it can't do nontemporal stores), so it is the best general method.

...
   I also think there might be some odd instruction pipeline effects
   that skew the results when only one or two instructions are between
   the load into an %xmm register and the store from the same register.
   I tried using 2, 4, and 8 XMM registers.  8 XMM registers seemed to
   work the best.

I'm getting only small variations from different load/store patterns.


Of course, I primarily tested on an Athlon 64 3200+, so YMMV. (One of the first Athlon 64's, so it has a 1MB L2 cache).

My test system is very similar:

%%%
CPU: AMD Athlon(tm) 64 Processor 3400+ (1994.33-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0xf48  Stepping = 8
  
Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
  AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow+,3DNow>
L1 2MB data TLB: 8 entries, fully associative
L1 2MB instruction TLB: 8 entries, fully associative
L1 4KB data TLB: 32 entries, fully associative
L1 4KB instruction TLB: 32 entries, fully associative
L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative
L2 2MB unified TLB: 0 entries, disabled/not present
L2 4KB data TLB: 512 entries, 4-way associative
L2 4KB instruction TLB: 512 entries, 4-way associative
L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative
%%%

   The prefetchnta I have commented out seemed to improve performance,
   but it requires 3dNOW and I didn't want to NOT have an MMX copy mode
   for cpu's with MMX but without 3dNOW.  Prefetching less then 128 bytes
   did not help, and prefetching greater then 128 bytes (e.g. 256(%esi))
   seemed to cause extra RAS cycles.  It was unbelievably finicky, not at
   all what I expected.

Prefetching is showing some very good effects here, but there are MD complications: - the Athlon[32] optimization manual says that block prefetch is sometimes better than prefetchnta, and gives examples. The reason is that you can schedule the block prefetch. - alc@ and/or the Athlon64 optimization manual say that prefetchnta now works better. - testing shows that prefetchnta does work better on my Athlon64 in some cases, but in the partially cached case (source in the L2 cache) it reduces the bandwidth by almost a factor of 2:

%%%
copyH: 2562223788 B/s ( 390253 us) (778523741 tsc) (movntps)
copyI: 1269129646 B/s ( 787875 us) (1571812294 tsc) (movntps with prefetchnta)
copyJ: 2513196704 B/s ( 397866 us) (793703852 tsc) (movntps with block prefetch)
copyN: 2562020272 B/s ( 390284 us) (778737276 tsc) (movntq)
copyO: 1279569209 B/s ( 781447 us) (1559037466 tsc) (movntq with prefetchnta)
copyP: 2561869298 B/s ( 390307 us) (778732346 tsc) (movntq with block prefetch)
%%%

The machine has PC2700 memory so we can hope for a copy bandwidth of
nearly 2.7GB/sec for repeatedly copying a buffer of size 160K as the
benchmark does, since the buffer should stay in the L2 cache.  We
actually get 2.5+GB/sec here and for all bzero benchmarks using movnt*,
but when we use prefetchnta we get about half this, and not much more
than for the fully uncached case (1.2GB/sec).

The corresponding speeds for the fully uncached case (copying 1600K) are:

%%%
copyH: 1061395711 B/s ( 941613 us) (1879293692 tsc) (movntps)
copyI: 1246904647 B/s ( 801524 us) (1599118394 tsc) (movntps with prefetchnta)
copyJ: 1227740822 B/s ( 814035 us) (1624787631 tsc) (movntps with block 
prefetch)
copyN: 1049642023 B/s ( 952157 us) (1900292204 tsc) (movntq)
copyO: 1247088242 B/s ( 801406 us) (1598888249 tsc) (movntq with prefetchnta)
copyP: 1226714585 B/s ( 814716 us) (1625985669 tsc) (movntq with block prefetch)
%%%

For the fully uncached case, the speeds for simple copying methods are all
about 0.64GB/sec on this machine, and sophisticated methods that don't use
nontemporal writes only improve this to 0.68GB/sec.

Bruce
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to