include

Bruce Evans Sun, 23 Jun 2013 01:35:10 -0700

On Sun, 23 Jun 2013, Konstantin Belousov wrote:

On Sat, Jun 22, 2013 at 01:37:58PM +1000, Bruce Evans wrote:

On Sat, 22 Jun 2013, I wrote:

...
Here are considerably expanded tests, with noninline tests dropped.
Summary of times on Athlon64:

simple increment:                               4-7 cycles (1)
simple increment preceded by feature test:      5-8 cycles (1)
simple 32-bit increment:                        4-7 cycles (2)
correct 32-bit increment (addl to mem):         5.5-7 cycles (3)
inlined critical section:                       8.5 cycles (4)
better inlined critical section:                7 cycles (5)
correct unsigned 32-bit inc of 64-bit counter:  4-7 cycles (6)
"improve" previous to allow immediate operand:  5+ cycles
correct signed 32-bit inc of 64-bit counter:    8.5-9 cycles (7)
correct 64-bit inc of 64-bit counter:           8-9 cycles (8)
-current method (cmpxchg8b):                   18 cycles


corei7 (freefall) has about the same timing as Athlon64, but core2
(ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg.

You only tested 32 bit, right ? Note that core2-class machines have
at least one cycle penalty for decoding any instruction with REX prefix.


Yes, since the 64-bit case works more or less correctly.  I tested the
32-bit binary on 64-bit systems.

(4) The critical section method is quite fast when inlined.
(5) The critical section method is even faster when optimized.  This is
   what should be used if you don't want the complications for the
   daemon.


Oops, I forgot that critical sections are much slower in -current than
in my version.  They probably take 20-40 cycles for the best case, and
can't easily be tested in userland since they disable interrupts in
hardware.  My versions disable interrupts in software.

The critical sections do not disable the interrupts.  Only the thread
local counter is incremented.  Leaving the section could be complicated
though.


Yes, as I noticed later, it was only an old version of FreeBSD that disabled
interrupts.

The critical section method (or disabling interrupts, which is probably
faster) only works for the non-SMP case, but old CPUs that don't have
cmpxchg8b probably don't support SMP.

Further tests confirm that incl and incq are pipelined normally on at
least corei7 and core2.  In the loop test, freefall can do 4 independent
addq's to memory faster than it can do 1 :-).  It can do 6 independent
addq's to memory in the same time that it can do 1.  After that, the
loop overhead prevents geting the complete bandwidth of the memory
system.  However, 6 addq's to the same memory location take a little
more than 6 times longer than 1.  Multiple increments of the same counter
one after the other are probably rare, but the counter API makes it harder
to coaelsce them if they occur, and the implementation using independent
asms ensures that the compiler cannot coalesce them.


I think that the naive looping on Core i7+ to measure the latencies
of the instructions simply does not work.  The Nehalem hardware started
to provide the loop detector for the instruction queue inside the decoder,
which essentially makes the short loops executed from the microopcode
form.  We never use counters in the tight loops.


The test results show it working perfectly in the test environment.
Any loop detection just does a better job of running all the loop
instructions in parallel with the instructions being timed.  But even
old CPUs can run all the loop instructions in parallel with high-latency
instructions like add-to-memory.  Also, add-to-memory instructions have
strict ordering requirements on x86.  The order must be load-modify-store.
That's for storing to a single location.  This prevents the loop running
in parallel with itself, so its latency timing works especially simply.

However, the test environment isn't normal.  It is important to understand
that some of the time is latency that doesn't matter in the normal
environment.  But in the complicated versions with cmpxchg8b's or critical
sections, there are lots of extra instructions and ordering requirements
whose latencies can't be hidden.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Re: svn commit: r252032 - head/sys/amd64/include

Reply via email to