On Fri, 6 Jul 2018, John Baldwin wrote:
On 7/6/18 8:52 AM, Rodney W. Grimes wrote:
...
Trivial to fix this with
+#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE) ||
!defined(KLD_UP_MODULES)
This is not worth it. Note that we already use LOCK always in userland
which is probably far more prevalent than the use in modules.
Previously atomics in modules were _function calls_ just to avoid the LOCK.
Having the LOCK prefix present even on UP is probably far more efficient
than a function call.
No, the lock prefix is less efficient.
IIRC, on very old systems (~PPro), lock prefixes cost 20 cycles in the UP
case. On AthlonXP, they cost about 19 cycles, but function calls (written
in C) only cost about 6 cycles. This depends on pipelining, and my
test is perhaps too simple since it uses a loop where the pipelinig
works especially well (it executes 2 or 3 function calls in parallel).
Actually timing on AthlonXP UP:
- asm loop: 2 cycles/iteration
- "incl mem" in asm loop: 5.85 cycles (but with less alignment, only 3.25
cycles)
- "lock; incl mem" in asm loop: 18.9 cycles
- function call in C loop to C function doing "incl mem" in asm: 8.35 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 24.95
cycles.
Newer CPUs have better pipelining. On Haswell, this gives the strange
behaviour that the function call written in C is slightly faster than
inline code written in asm:
Actual timing on Haswell SMP:
- asm loop: 1.16 cycles/iteration
- "incl mem" in asm loop: 6.95 cycles
- "lock; incl mem" in asm loop: 19.00 cycles
- function call in C loop to C function doing "incl mem" in asm: 6 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 26.00
cycles.
The C code with the function call executes:
loop:
call incl
incl:
pushl %ebp
movl %ebp,%esp
[lock;] incl mem
leave
ret
incl %ebx
cmpl $4080000000-1,%ebx
jbe done
I didn't even compile with -fframe-pointer or try clang which would do
excessive unrolling. -fframe-pointer takes 3 extra instructions in
incl, but these take no extra time.
In non-benchmark use, there would be more args for the function call so
and the scheduling would be very different so the timing might be very
different. I expect the function call would be insignificantly slower
except in micro-benchmarks,
Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"