On Mon, Feb 23, 2015 at 09:43:40AM -0800, Andi Kleen wrote:
> On Mon, Feb 23, 2015 at 06:04:36PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2015 at 05:38:55PM -0800, Andi Kleen wrote:
> > 
> > > This patch moves the MSR functions out of line. A MSR access is typically
> > > 40-100 cycles or even slower, a call is a few cycles at best, so the
> > > additional function call is not really significant.
> > 
> > If I look at the below PDF a CALL+PUSH EBP+MOV RSP,RBP+ ... +POP+RET
> > ends up being 5+1.5+0.5+ .. + 1.5+8 = 16.5 + .. cycles.
> 
> You cannot just add up the latency cycles. The CPU runs all of this 
> in parallel. 
> 
> Latency cycles would only be interesting if these instructions were
> on the critical path for computing the result, which they are not. 
> 
> It should be a few cycles overhead.

I thought that since CALL touches RSP, PUSH touches RSP, MOV RSP,
(obviously) touches RSP, POP touches RSP and well, RET does too. There
were strong dependencies on the instructions and there would be little
room to parallelize things.

I'm glad you so patiently educated me on the wonders of modern
architectures and how it can indeed do all this in parallel.

Still, I wondered, so I ran me a little test. Note that I used a
serializing instruction (LOCK XCHG) because WRMSR is too.

I see a ~14 cycle difference between the inline and noinline version.

If I substitute the LOCK XCHG with XADD, I get to 1,5 cycles in
difference, so clearly there is some magic happening, but serializing
instructions wreck it.

Anybody can explain how such RSP deps get magiced away?

---

root@ivb-ep:~# cat call.c

#define __always_inline         inline __attribute__((always_inline))
#define  noinline                       __attribute__((noinline))

static int
#ifdef FOO
noinline
#else
__always_inline
#endif
xchg(int *ptr, int val)
{
        asm volatile ("LOCK xchgl %0, %1\n"
                        : "+r" (val), "+m" (*(ptr))
                        : : "memory", "cc");
        return val;
}

void main(void)
{
        int val = 0, old;

        for (int i = 0; i < 1000000000; i++)
                old = xchg(&val, i);
}

root@ivb-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -DFOO -o call call.c
root@ivb-ep:~# objdump -D call | awk '/<[^>]*>:/ {p=0} /<main>:/ {p=1} 
/<xchg>:/ {p=1} { if (p) print $0 }'
00000000004003e0 <main>:
  4003e0:       55                      push   %rbp
  4003e1:       48 89 e5                mov    %rsp,%rbp
  4003e4:       53                      push   %rbx
  4003e5:       31 db                   xor    %ebx,%ebx
  4003e7:       48 83 ec 18             sub    $0x18,%rsp
  4003eb:       c7 45 e0 00 00 00 00    movl   $0x0,-0x20(%rbp)
  4003f2:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  4003f8:       48 8d 7d e0             lea    -0x20(%rbp),%rdi
  4003fc:       89 de                   mov    %ebx,%esi
  4003fe:       83 c3 01                add    $0x1,%ebx
  400401:       e8 fa 00 00 00          callq  400500 <xchg>
  400406:       81 fb 00 ca 9a 3b       cmp    $0x3b9aca00,%ebx
  40040c:       75 ea                   jne    4003f8 <main+0x18>
  40040e:       48 83 c4 18             add    $0x18,%rsp
  400412:       5b                      pop    %rbx
  400413:       5d                      pop    %rbp
  400414:       c3                      retq   

0000000000400500 <xchg>:
  400500:       55                      push   %rbp
  400501:       89 f0                   mov    %esi,%eax
  400503:       48 89 e5                mov    %rsp,%rbp
  400506:       f0 87 07                lock xchg %eax,(%rdi)
  400509:       5d                      pop    %rbp
  40050a:       c3                      retq   
  40050b:       90                      nop
  40050c:       90                      nop
  40050d:       90                      nop
  40050e:       90                      nop
  40050f:       90                      nop

root@ivb-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -o call-inline call.c
root@ivb-ep:~# objdump -D call-inline | awk '/<[^>]*>:/ {p=0} /<main>:/ {p=1} 
/<xchg>:/ {p=1} { if (p) print $0 }'
00000000004003e0 <main>:
  4003e0:       55                      push   %rbp
  4003e1:       31 c0                   xor    %eax,%eax
  4003e3:       48 89 e5                mov    %rsp,%rbp
  4003e6:       c7 45 f0 00 00 00 00    movl   $0x0,-0x10(%rbp)
  4003ed:       0f 1f 00                nopl   (%rax)
  4003f0:       89 c2                   mov    %eax,%edx
  4003f2:       f0 87 55 f0             lock xchg %edx,-0x10(%rbp)
  4003f6:       83 c0 01                add    $0x1,%eax
  4003f9:       3d 00 ca 9a 3b          cmp    $0x3b9aca00,%eax
  4003fe:       75 f0                   jne    4003f0 <main+0x10>
  400400:       5d                      pop    %rbp
  400401:       c3                      retq   

root@ivb-ep:~# perf stat -e "cycles:u" ./call

 Performance counter stats for './call':

    36,309,274,162      cycles:u                 

      10.561819310 seconds time elapsed

root@ivb-ep:~# perf stat -e "cycles:u" ./call-inline 

 Performance counter stats for './call-inline':

    22,004,045,745      cycles:u                 

       6.498271508 seconds time elapsed



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to