On Fri, 6 Jul 2018, John Baldwin wrote:

On 7/6/18 8:52 AM, Rodney W. Grimes wrote:
...
Trivial to fix this with
+#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE) || 
!defined(KLD_UP_MODULES)

This is not worth it.  Note that we already use LOCK always in userland
which is probably far more prevalent than the use in modules.

Previously atomics in modules were _function calls_ just to avoid the LOCK.
Having the LOCK prefix present even on UP is probably far more efficient
than a function call.

No, the lock prefix is less efficient.

IIRC, on very old systems (~PPro), lock prefixes cost 20 cycles in the UP
case.  On AthlonXP, they cost about 19 cycles, but function calls (written
in C) only cost about 6 cycles.  This depends on pipelining, and my
test is perhaps too simple since it uses a loop where the pipelinig
works especially well (it executes 2 or 3 function calls in parallel).

Actually timing on AthlonXP UP:
- asm loop: 2 cycles/iteration
- "incl mem" in asm loop: 5.85 cycles (but with less alignment, only 3.25
  cycles)
- "lock; incl mem" in asm loop: 18.9 cycles
- function call in C loop to C function doing "incl mem" in asm: 8.35 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 24.95
  cycles.

Newer CPUs have better pipelining.  On Haswell, this gives the strange
behaviour that the function call written in C is slightly faster than
inline code written in asm:

Actual timing on Haswell SMP:
- asm loop: 1.16 cycles/iteration
- "incl mem" in asm loop: 6.95 cycles
- "lock; incl mem" in asm loop: 19.00 cycles
- function call in C loop to C function doing "incl mem" in asm: 6 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 26.00
  cycles.

The C code with the function call executes:

loop:
        call    incl
        incl:
                pushl   %ebp
                movl    %ebp,%esp
                [lock;] incl mem
                leave
                ret
        incl    %ebx
        cmpl    $4080000000-1,%ebx
        jbe     done

I didn't even compile with -fframe-pointer or try clang which would do
excessive unrolling.  -fframe-pointer takes 3 extra instructions in
incl, but these take no extra time.

In non-benchmark use, there would be more args for the function call so
and the scheduling would be very different so the timing might be very
different.  I expect the function call would be insignificantly slower
except in micro-benchmarks,

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to