Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread kbuild test robot
Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on v4.19-rc7 next-20181011]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/x86-tsc-use-real-seqcount_latch-in-cyc2ns_read_begin/20181012-065625
config: x86_64-randconfig-x000-201840 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   arch/x86/kernel/tsc.c: In function 'cyc2ns_read_begin':
>> arch/x86/kernel/tsc.c:69:33: warning: passing argument 1 of 
>> 'raw_read_seqcount_latch' discards 'const' qualifier from pointer target 
>> type [-Wdiscarded-qualifiers]
  seq = raw_read_seqcount_latch(>seq);
^
   In file included from include/linux/time.h:6:0,
from include/linux/ktime.h:24,
from include/linux/timer.h:6,
from include/linux/workqueue.h:9,
from include/linux/rhashtable-types.h:15,
from include/linux/ipc.h:7,
from include/uapi/linux/sem.h:5,
from include/linux/sem.h:5,
from include/linux/sched.h:15,
from arch/x86/kernel/tsc.c:4:
   include/linux/seqlock.h:279:19: note: expected 'seqcount_t * {aka struct 
seqcount *}' but argument is of type 'const seqcount_t * {aka const struct 
seqcount *}'
static inline int raw_read_seqcount_latch(seqcount_t *s)
  ^~~

vim +69 arch/x86/kernel/tsc.c

59  
60  void cyc2ns_read_begin(struct cyc2ns_data *data)
61  {
62  const struct cyc2ns *c2n;
63  int seq;
64  
65  preempt_disable_notrace();
66  
67  c2n = this_cpu_ptr();
68  do {
  > 69  seq = raw_read_seqcount_latch(>seq);
70  *data = c2n->data[seq & 1];
71  } while (read_seqcount_retry(>seq, seq));
72  }
73  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread kbuild test robot
Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on v4.19-rc7 next-20181011]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/x86-tsc-use-real-seqcount_latch-in-cyc2ns_read_begin/20181012-065625
config: x86_64-randconfig-x000-201840 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   arch/x86/kernel/tsc.c: In function 'cyc2ns_read_begin':
>> arch/x86/kernel/tsc.c:69:33: warning: passing argument 1 of 
>> 'raw_read_seqcount_latch' discards 'const' qualifier from pointer target 
>> type [-Wdiscarded-qualifiers]
  seq = raw_read_seqcount_latch(>seq);
^
   In file included from include/linux/time.h:6:0,
from include/linux/ktime.h:24,
from include/linux/timer.h:6,
from include/linux/workqueue.h:9,
from include/linux/rhashtable-types.h:15,
from include/linux/ipc.h:7,
from include/uapi/linux/sem.h:5,
from include/linux/sem.h:5,
from include/linux/sched.h:15,
from arch/x86/kernel/tsc.c:4:
   include/linux/seqlock.h:279:19: note: expected 'seqcount_t * {aka struct 
seqcount *}' but argument is of type 'const seqcount_t * {aka const struct 
seqcount *}'
static inline int raw_read_seqcount_latch(seqcount_t *s)
  ^~~

vim +69 arch/x86/kernel/tsc.c

59  
60  void cyc2ns_read_begin(struct cyc2ns_data *data)
61  {
62  const struct cyc2ns *c2n;
63  int seq;
64  
65  preempt_disable_notrace();
66  
67  c2n = this_cpu_ptr();
68  do {
  > 69  seq = raw_read_seqcount_latch(>seq);
70  *data = c2n->data[seq & 1];
71  } while (read_seqcount_retry(>seq, seq));
72  }
73  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Peter Zijlstra
On Thu, Oct 11, 2018 at 08:00:42AM -0700, Eric Dumazet wrote:
> Yes, but the code size is bigger (I have looked at the disassembly)
> 
> All these %gs plus offset add up

> Total length : 0xA7 bytes

effective length: 0x78 bytes

> 02a0 :
>  2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
>  2a5: 48 83 e4 f0  and$0xfff0,%rsp
>  2a9: 41 ff 72 f8  pushq  -0x8(%r10)
>  2ad: 55push   %rbp
>  2ae: 48 89 e5  mov%rsp,%rbp
>  2b1: 41 52push   %r10

   2b3: 66 66 66 66 90  k8_nop5_atomic

>  2b8: 0f 31rdtsc
>  2ba: 48 c1 e2 20  shl$0x20,%rdx
>  2be: 48 09 c2  or %rax,%rdx
>  2c1: 49 89 d2  mov%rdx,%r10
>  2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9
>   2c7: R_X86_64_32S .data..percpu..shared_aligned
>  2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2
> 
>   2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
>  2d2: 89 c1mov%eax,%ecx
>  2d4: 83 e1 01  and$0x1,%ecx
>  2d7: 48 c1 e1 04  shl$0x4,%rcx
>  2db: 4c 01 c9  add%r9,%rcx
>  2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi
>  2e3: 65 8b 31  mov%gs:(%rcx),%esi
>  2e6: 65 8b 49 04  mov%gs:0x4(%rcx),%ecx
>  2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2
> 
>  2f1: 00
>   2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
>  2f2: 44 39 c0  cmp%r8d,%eax
>  2f5: 75 d4jne2cb 
>  2f7: 89 f6mov%esi,%esi
>  2f9: 48 89 f0  mov%rsi,%rax
>  2fc: 49 f7 e2  mul%r10
>  2ff: 48 0f ad d0  shrd   %cl,%rdx,%rax
>  303: 48 d3 ea  shr%cl,%rdx
>  306: f6 c1 40  test   $0x40,%cl
>  309: 48 0f 45 c2  cmovne %rdx,%rax
>  30d: 48 01 f8  add%rdi,%rax
>  310: 41 5apop%r10
>  312: 5dpop%rbp
>  313: 49 8d 62 f8  lea-0x8(%r10),%rsp
>  317: c3retq
> 
> New version :
> 
> Total length = 0x91 bytes

effective: 0x71

> 
> 02a0 :
>  2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
>  2a5: 48 83 e4 f0  and$0xfff0,%rsp
>  2a9: 41 ff 72 f8  pushq  -0x8(%r10)
>  2ad: 55push   %rbp
>  2ae: 48 89 e5  mov%rsp,%rbp
>  2b1: 41 52push   %r10

   2b3: 66 66 66 66 90  k8_nop5_atomic

>  2b8: 0f 31rdtsc
>  2ba: 48 c1 e2 20  shl$0x20,%rdx
>  2be: 48 09 c2  or %rax,%rdx
>  2c1: 49 89 d1  mov%rdx,%r9
>  2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8
>   2c7: R_X86_64_32S .data..percpu..shared_aligned
>  2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3
> 
>  2d2: 00
>   2cf: R_X86_64_PC32 this_cpu_off-0x4
>  2d3: 41 8b 40 20  mov0x20(%r8),%eax
>  2d7: 89 c6mov%eax,%esi
>  2d9: 83 e6 01  and$0x1,%esi
>  2dc: 48 c1 e6 04  shl$0x4,%rsi
>  2e0: 4c 01 c6  add%r8,%rsi
>  2e3: 8b 3emov(%rsi),%edi
>  2e5: 8b 4e 04  mov0x4(%rsi),%ecx
>  2e8: 48 8b 76 08  mov0x8(%rsi),%rsi
>  2ec: 41 3b 40 20  cmp0x20(%r8),%eax
>  2f0: 75 e1jne2d3 
>  2f2: 48 89 f8  mov%rdi,%rax
>  2f5: 49 f7 e1  mul%r9
>  2f8: 48 0f ad d0  shrd   %cl,%rdx,%rax
>  2fc: 48 d3 ea  shr%cl,%rdx
>  2ff: f6 c1 40  test   $0x40,%cl
>  302: 48 0f 45 c2  cmovne %rdx,%rax
>  306: 48 01 f0  add%rsi,%rax
>  309: 41 5apop%r10
>  30b: 5dpop%rbp
>  30c: 49 8d 62 f8  lea-0x8(%r10),%rsp
>  310: c3retq

Ah, right you are. But my version only touches the one cacheline,
whereas yours will do that extra cpu offset load, which might or might
not be hot.

Difficult..

You have some weird stack setup though.. mine doesn't have that:

$ objdump -dr defconfig-build/arch/x86/kernel/tsc.o | awk 
'/:$/ {p=1} /^$/ {p=0} {if (p) print $0}'
0b00 :
b00:   55  push   %rbp
b01:   48 89 e5mov%rsp,%rbp
b04:   66 66 66 66 90  k8_nop5_atomic
b09:   0f 31   rdtsc
b0b:   48 c1 e2 20 shl$0x20,%rdx
b0f:   48 09 c2or %rax,%rdx
b12:   49 89 d2mov%rdx,%r10
b15:   49 c7 c1 00 00 00 00mov$0x0,%r9
b18: R_X86_64_32S   

Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Peter Zijlstra
On Thu, Oct 11, 2018 at 08:00:42AM -0700, Eric Dumazet wrote:
> Yes, but the code size is bigger (I have looked at the disassembly)
> 
> All these %gs plus offset add up

> Total length : 0xA7 bytes

effective length: 0x78 bytes

> 02a0 :
>  2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
>  2a5: 48 83 e4 f0  and$0xfff0,%rsp
>  2a9: 41 ff 72 f8  pushq  -0x8(%r10)
>  2ad: 55push   %rbp
>  2ae: 48 89 e5  mov%rsp,%rbp
>  2b1: 41 52push   %r10

   2b3: 66 66 66 66 90  k8_nop5_atomic

>  2b8: 0f 31rdtsc
>  2ba: 48 c1 e2 20  shl$0x20,%rdx
>  2be: 48 09 c2  or %rax,%rdx
>  2c1: 49 89 d2  mov%rdx,%r10
>  2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9
>   2c7: R_X86_64_32S .data..percpu..shared_aligned
>  2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2
> 
>   2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
>  2d2: 89 c1mov%eax,%ecx
>  2d4: 83 e1 01  and$0x1,%ecx
>  2d7: 48 c1 e1 04  shl$0x4,%rcx
>  2db: 4c 01 c9  add%r9,%rcx
>  2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi
>  2e3: 65 8b 31  mov%gs:(%rcx),%esi
>  2e6: 65 8b 49 04  mov%gs:0x4(%rcx),%ecx
>  2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2
> 
>  2f1: 00
>   2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
>  2f2: 44 39 c0  cmp%r8d,%eax
>  2f5: 75 d4jne2cb 
>  2f7: 89 f6mov%esi,%esi
>  2f9: 48 89 f0  mov%rsi,%rax
>  2fc: 49 f7 e2  mul%r10
>  2ff: 48 0f ad d0  shrd   %cl,%rdx,%rax
>  303: 48 d3 ea  shr%cl,%rdx
>  306: f6 c1 40  test   $0x40,%cl
>  309: 48 0f 45 c2  cmovne %rdx,%rax
>  30d: 48 01 f8  add%rdi,%rax
>  310: 41 5apop%r10
>  312: 5dpop%rbp
>  313: 49 8d 62 f8  lea-0x8(%r10),%rsp
>  317: c3retq
> 
> New version :
> 
> Total length = 0x91 bytes

effective: 0x71

> 
> 02a0 :
>  2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
>  2a5: 48 83 e4 f0  and$0xfff0,%rsp
>  2a9: 41 ff 72 f8  pushq  -0x8(%r10)
>  2ad: 55push   %rbp
>  2ae: 48 89 e5  mov%rsp,%rbp
>  2b1: 41 52push   %r10

   2b3: 66 66 66 66 90  k8_nop5_atomic

>  2b8: 0f 31rdtsc
>  2ba: 48 c1 e2 20  shl$0x20,%rdx
>  2be: 48 09 c2  or %rax,%rdx
>  2c1: 49 89 d1  mov%rdx,%r9
>  2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8
>   2c7: R_X86_64_32S .data..percpu..shared_aligned
>  2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3
> 
>  2d2: 00
>   2cf: R_X86_64_PC32 this_cpu_off-0x4
>  2d3: 41 8b 40 20  mov0x20(%r8),%eax
>  2d7: 89 c6mov%eax,%esi
>  2d9: 83 e6 01  and$0x1,%esi
>  2dc: 48 c1 e6 04  shl$0x4,%rsi
>  2e0: 4c 01 c6  add%r8,%rsi
>  2e3: 8b 3emov(%rsi),%edi
>  2e5: 8b 4e 04  mov0x4(%rsi),%ecx
>  2e8: 48 8b 76 08  mov0x8(%rsi),%rsi
>  2ec: 41 3b 40 20  cmp0x20(%r8),%eax
>  2f0: 75 e1jne2d3 
>  2f2: 48 89 f8  mov%rdi,%rax
>  2f5: 49 f7 e1  mul%r9
>  2f8: 48 0f ad d0  shrd   %cl,%rdx,%rax
>  2fc: 48 d3 ea  shr%cl,%rdx
>  2ff: f6 c1 40  test   $0x40,%cl
>  302: 48 0f 45 c2  cmovne %rdx,%rax
>  306: 48 01 f0  add%rsi,%rax
>  309: 41 5apop%r10
>  30b: 5dpop%rbp
>  30c: 49 8d 62 f8  lea-0x8(%r10),%rsp
>  310: c3retq

Ah, right you are. But my version only touches the one cacheline,
whereas yours will do that extra cpu offset load, which might or might
not be hot.

Difficult..

You have some weird stack setup though.. mine doesn't have that:

$ objdump -dr defconfig-build/arch/x86/kernel/tsc.o | awk 
'/:$/ {p=1} /^$/ {p=0} {if (p) print $0}'
0b00 :
b00:   55  push   %rbp
b01:   48 89 e5mov%rsp,%rbp
b04:   66 66 66 66 90  k8_nop5_atomic
b09:   0f 31   rdtsc
b0b:   48 c1 e2 20 shl$0x20,%rdx
b0f:   48 09 c2or %rax,%rdx
b12:   49 89 d2mov%rdx,%r10
b15:   49 c7 c1 00 00 00 00mov$0x0,%r9
b18: R_X86_64_32S   

Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Eric Dumazet
On Thu, Oct 11, 2018 at 12:31 AM Peter Zijlstra  wrote:
>
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> >
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> >
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
>
> Does not for me, that is, the resulting asm is actually larger

[resent in non html]

I counted the number of bytes of text, and really the after my patch
code size is smaller.

%gs prefixes are one-byte, but the 32bit offsets are adding up.

Prior version (with resinstaed loop, thanks to one  smp_rmb()

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 
b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..d4ad30bbaa0cc8bf81da0272829da26f229734bf
100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -71,7 +71,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
  data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
  data->cyc2ns_mul= this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
  data->cyc2ns_shift  = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
-
+ smp_rmb();
  } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
 }


Total length : 0xA7 bytes

02a0 :
 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
 2a5: 48 83 e4 f0  and$0xfff0,%rsp
 2a9: 41 ff 72 f8  pushq  -0x8(%r10)
 2ad: 55push   %rbp
 2ae: 48 89 e5  mov%rsp,%rbp
 2b1: 41 52push   %r10
 2b3: e9 60 00 00 00jmpq   318 
 2b8: 0f 31rdtsc
 2ba: 48 c1 e2 20  shl$0x20,%rdx
 2be: 48 09 c2  or %rax,%rdx
 2c1: 49 89 d2  mov%rdx,%r10
 2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9
2c7: R_X86_64_32S .data..percpu..shared_aligned
 2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2

2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
 2d2: 89 c1mov%eax,%ecx
 2d4: 83 e1 01  and$0x1,%ecx
 2d7: 48 c1 e1 04  shl$0x4,%rcx
 2db: 4c 01 c9  add%r9,%rcx
 2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi
 2e3: 65 8b 31  mov%gs:(%rcx),%esi
 2e6: 65 8b 49 04  mov%gs:0x4(%rcx),%ecx
 2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2

 2f1: 00
2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
 2f2: 44 39 c0  cmp%r8d,%eax
 2f5: 75 d4jne2cb 
 2f7: 89 f6mov%esi,%esi
 2f9: 48 89 f0  mov%rsi,%rax
 2fc: 49 f7 e2  mul%r10
 2ff: 48 0f ad d0  shrd   %cl,%rdx,%rax
 303: 48 d3 ea  shr%cl,%rdx
 306: f6 c1 40  test   $0x40,%cl
 309: 48 0f 45 c2  cmovne %rdx,%rax
 30d: 48 01 f8  add%rdi,%rax
 310: 41 5apop%r10
 312: 5dpop%rbp
 313: 49 8d 62 f8  lea-0x8(%r10),%rsp
 317: c3retq
 318: 48 69 05 00 00 00 00 imul   $0x3d0900,0x0(%rip),%rax
# 323 
 31f: 00 09 3d 00
31b: R_X86_64_PC32 jiffies_64-0x8
 323: 41 5apop%r10
 325: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx
 32c: f7 c2 ff
 32f: 5dpop%rbp
 330: 49 8d 62 f8  lea-0x8(%r10),%rsp
 334: 48 01 d0  add%rdx,%rax
 337: c3retq

New version (my cleanup patch)

Total length = 0x91 bytes

02a0 :
 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
 2a5: 48 83 e4 f0  and$0xfff0,%rsp
 2a9: 41 ff 72 f8  pushq  -0x8(%r10)
 2ad: 55push   %rbp
 2ae: 48 89 e5  mov%rsp,%rbp
 2b1: 41 52push   %r10
 2b3: e9 59 00 00 00jmpq   311 
 2b8: 0f 31rdtsc
 2ba: 48 c1 e2 20  shl$0x20,%rdx
 2be: 48 09 c2  or %rax,%rdx
 2c1: 49 89 d1  mov%rdx,%r9
 2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8
2c7: R_X86_64_32S .data..percpu..shared_aligned
 2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3

 2d2: 00
2cf: R_X86_64_PC32 this_cpu_off-0x4
 2d3: 41 8b 40 20  mov0x20(%r8),%eax
 2d7: 89 c6mov%eax,%esi
 2d9: 83 e6 01  and$0x1,%esi
 2dc: 48 c1 e6 04  shl$0x4,%rsi
 2e0: 4c 01 c6  add%r8,%rsi
 2e3: 8b 3emov(%rsi),%edi
 2e5: 8b 4e 04  mov0x4(%rsi),%ecx
 2e8: 48 8b 76 08  mov

Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Eric Dumazet
On Thu, Oct 11, 2018 at 12:31 AM Peter Zijlstra  wrote:
>
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> >
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> >
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
>
> Does not for me, that is, the resulting asm is actually larger

[resent in non html]

I counted the number of bytes of text, and really the after my patch
code size is smaller.

%gs prefixes are one-byte, but the 32bit offsets are adding up.

Prior version (with resinstaed loop, thanks to one  smp_rmb()

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 
b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..d4ad30bbaa0cc8bf81da0272829da26f229734bf
100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -71,7 +71,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
  data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
  data->cyc2ns_mul= this_cpu_read(cyc2ns.data[idx].cyc2ns_mul);
  data->cyc2ns_shift  = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift);
-
+ smp_rmb();
  } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
 }


Total length : 0xA7 bytes

02a0 :
 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
 2a5: 48 83 e4 f0  and$0xfff0,%rsp
 2a9: 41 ff 72 f8  pushq  -0x8(%r10)
 2ad: 55push   %rbp
 2ae: 48 89 e5  mov%rsp,%rbp
 2b1: 41 52push   %r10
 2b3: e9 60 00 00 00jmpq   318 
 2b8: 0f 31rdtsc
 2ba: 48 c1 e2 20  shl$0x20,%rdx
 2be: 48 09 c2  or %rax,%rdx
 2c1: 49 89 d2  mov%rdx,%r10
 2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9
2c7: R_X86_64_32S .data..percpu..shared_aligned
 2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2

2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
 2d2: 89 c1mov%eax,%ecx
 2d4: 83 e1 01  and$0x1,%ecx
 2d7: 48 c1 e1 04  shl$0x4,%rcx
 2db: 4c 01 c9  add%r9,%rcx
 2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi
 2e3: 65 8b 31  mov%gs:(%rcx),%esi
 2e6: 65 8b 49 04  mov%gs:0x4(%rcx),%ecx
 2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2

 2f1: 00
2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c
 2f2: 44 39 c0  cmp%r8d,%eax
 2f5: 75 d4jne2cb 
 2f7: 89 f6mov%esi,%esi
 2f9: 48 89 f0  mov%rsi,%rax
 2fc: 49 f7 e2  mul%r10
 2ff: 48 0f ad d0  shrd   %cl,%rdx,%rax
 303: 48 d3 ea  shr%cl,%rdx
 306: f6 c1 40  test   $0x40,%cl
 309: 48 0f 45 c2  cmovne %rdx,%rax
 30d: 48 01 f8  add%rdi,%rax
 310: 41 5apop%r10
 312: 5dpop%rbp
 313: 49 8d 62 f8  lea-0x8(%r10),%rsp
 317: c3retq
 318: 48 69 05 00 00 00 00 imul   $0x3d0900,0x0(%rip),%rax
# 323 
 31f: 00 09 3d 00
31b: R_X86_64_PC32 jiffies_64-0x8
 323: 41 5apop%r10
 325: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx
 32c: f7 c2 ff
 32f: 5dpop%rbp
 330: 49 8d 62 f8  lea-0x8(%r10),%rsp
 334: 48 01 d0  add%rdx,%rax
 337: c3retq

New version (my cleanup patch)

Total length = 0x91 bytes

02a0 :
 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10
 2a5: 48 83 e4 f0  and$0xfff0,%rsp
 2a9: 41 ff 72 f8  pushq  -0x8(%r10)
 2ad: 55push   %rbp
 2ae: 48 89 e5  mov%rsp,%rbp
 2b1: 41 52push   %r10
 2b3: e9 59 00 00 00jmpq   311 
 2b8: 0f 31rdtsc
 2ba: 48 c1 e2 20  shl$0x20,%rdx
 2be: 48 09 c2  or %rax,%rdx
 2c1: 49 89 d1  mov%rdx,%r9
 2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8
2c7: R_X86_64_32S .data..percpu..shared_aligned
 2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3

 2d2: 00
2cf: R_X86_64_PC32 this_cpu_off-0x4
 2d3: 41 8b 40 20  mov0x20(%r8),%eax
 2d7: 89 c6mov%eax,%esi
 2d9: 83 e6 01  and$0x1,%esi
 2dc: 48 c1 e6 04  shl$0x4,%rsi
 2e0: 4c 01 c6  add%r8,%rsi
 2e3: 8b 3emov(%rsi),%edi
 2e5: 8b 4e 04  mov0x4(%rsi),%ecx
 2e8: 48 8b 76 08  mov

Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Peter Zijlstra
On Thu, Oct 11, 2018 at 09:31:33AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> > 
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> > 
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
> 
> Does not for me, that is, the resulting asm is actually larger
> 
> You're quite right the loop went missing; no idea wth that compiler is
> smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
> think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
> will return the same value. But this_cpu_read() is an asm() statement,
> it _should_ not assume such.
> 
> We assume that this_cpu_read() implies READ_ONCE() in a number of
> locations, this really should not happen.
> 
> The reason it was written using this_cpu_read() is so that it can use
> %gs: prefixed instructions and avoid ever loading that percpu offset and
> doing manual address computation.
> 
> Let me prod at this with a sharp stick.

OK, so apart from the inlining being crap, which is fixed by:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6490f618e096..638491062fea 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -57,7 +57,7 @@ struct cyc2ns {
 
 static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);
 
-void cyc2ns_read_begin(struct cyc2ns_data *data)
+void __always_inline cyc2ns_read_begin(struct cyc2ns_data *data)
 {
int seq, idx;
 
@@ -74,7 +74,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
} while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
 }
 
-void cyc2ns_read_end(void)
+void __always_inline cyc2ns_read_end(void)
 {
preempt_enable_notrace();
 }
@@ -103,7 +103,7 @@ void cyc2ns_read_end(void)
  *  -johns...@us.ibm.com "math is hard, lets go shopping!"
  */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
struct cyc2ns_data data;
unsigned long long ns;


That gets us:

native_sched_clock:
pushq   %rbp#
movq%rsp, %rbp  #,

 ... jump label ...

rdtsc
salq$32, %rdx   #, tmp110
orq %rax, %rdx  # low, tmp110
movl %gs:cyc2ns+32(%rip),%ecx   # cyc2ns.seq.sequence, pfo_ret__
andl$1, %ecx#, idx
salq$4, %rcx#, tmp116
movl %gs:cyc2ns(%rcx),%esi  # cyc2ns.data[idx_14].cyc2ns_mul, 
pfo_ret__
movl%esi, %esi  # pfo_ret__, pfo_ret__
movq%rsi, %rax  # pfo_ret__, tmp133
mulq%rdx# _23
movq %gs:cyc2ns+8(%rcx),%rdi# cyc2ns.data[idx_14].cyc2ns_offset, 
pfo_ret__
addq$cyc2ns, %rcx   #, tmp117
movl %gs:4(%rcx),%ecx   # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__
shrdq   %rdx, %rax  # pfo_ret__,, tmp134
shrq%cl, %rdx   # pfo_ret__,
testb   $64, %cl#, pfo_ret__
cmovne  %rdx, %rax  #,, tmp134
addq%rdi, %rax  # pfo_ret__, 
popq%rbp#
ret


If we then fix the percpu mess, with:

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index e9202a0de8f0..1a19d11cfbbd 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -185,22 +185,22 @@ do {  
\
typeof(var) pfo_ret__;  \
switch (sizeof(var)) {  \
case 1: \
-   asm(op "b "__percpu_arg(1)",%0" \
+   asm volatile(op "b "__percpu_arg(1)",%0"\
: "=q" (pfo_ret__)  \
: "m" (var));   \
break;  \
case 2: \
-   asm(op "w "__percpu_arg(1)",%0" \
+   asm volatile(op "w "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__)  \
: "m" (var));   \
break;  \
case 4: \
-   asm(op "l "__percpu_arg(1)",%0" \
+   asm volatile(op "l "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__)  \
: "m" (var));   \
break;  \
case 8: \
-   asm(op "q "__percpu_arg(1)",%0" \
+   

Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Peter Zijlstra
On Thu, Oct 11, 2018 at 09:31:33AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> > While looking at native_sched_clock() disassembly I had
> > the surprise to see the compiler (gcc 7.3 here) had
> > optimized out the loop, meaning the code is broken.
> > 
> > Using the documented and approved API not only fixes the bug,
> > it also makes the code more readable.
> > 
> > Replacing five this_cpu_read() by one this_cpu_ptr() makes
> > the generated code smaller.
> 
> Does not for me, that is, the resulting asm is actually larger
> 
> You're quite right the loop went missing; no idea wth that compiler is
> smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
> think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
> will return the same value. But this_cpu_read() is an asm() statement,
> it _should_ not assume such.
> 
> We assume that this_cpu_read() implies READ_ONCE() in a number of
> locations, this really should not happen.
> 
> The reason it was written using this_cpu_read() is so that it can use
> %gs: prefixed instructions and avoid ever loading that percpu offset and
> doing manual address computation.
> 
> Let me prod at this with a sharp stick.

OK, so apart from the inlining being crap, which is fixed by:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6490f618e096..638491062fea 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -57,7 +57,7 @@ struct cyc2ns {
 
 static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);
 
-void cyc2ns_read_begin(struct cyc2ns_data *data)
+void __always_inline cyc2ns_read_begin(struct cyc2ns_data *data)
 {
int seq, idx;
 
@@ -74,7 +74,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data)
} while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence)));
 }
 
-void cyc2ns_read_end(void)
+void __always_inline cyc2ns_read_end(void)
 {
preempt_enable_notrace();
 }
@@ -103,7 +103,7 @@ void cyc2ns_read_end(void)
  *  -johns...@us.ibm.com "math is hard, lets go shopping!"
  */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
struct cyc2ns_data data;
unsigned long long ns;


That gets us:

native_sched_clock:
pushq   %rbp#
movq%rsp, %rbp  #,

 ... jump label ...

rdtsc
salq$32, %rdx   #, tmp110
orq %rax, %rdx  # low, tmp110
movl %gs:cyc2ns+32(%rip),%ecx   # cyc2ns.seq.sequence, pfo_ret__
andl$1, %ecx#, idx
salq$4, %rcx#, tmp116
movl %gs:cyc2ns(%rcx),%esi  # cyc2ns.data[idx_14].cyc2ns_mul, 
pfo_ret__
movl%esi, %esi  # pfo_ret__, pfo_ret__
movq%rsi, %rax  # pfo_ret__, tmp133
mulq%rdx# _23
movq %gs:cyc2ns+8(%rcx),%rdi# cyc2ns.data[idx_14].cyc2ns_offset, 
pfo_ret__
addq$cyc2ns, %rcx   #, tmp117
movl %gs:4(%rcx),%ecx   # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__
shrdq   %rdx, %rax  # pfo_ret__,, tmp134
shrq%cl, %rdx   # pfo_ret__,
testb   $64, %cl#, pfo_ret__
cmovne  %rdx, %rax  #,, tmp134
addq%rdi, %rax  # pfo_ret__, 
popq%rbp#
ret


If we then fix the percpu mess, with:

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index e9202a0de8f0..1a19d11cfbbd 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -185,22 +185,22 @@ do {  
\
typeof(var) pfo_ret__;  \
switch (sizeof(var)) {  \
case 1: \
-   asm(op "b "__percpu_arg(1)",%0" \
+   asm volatile(op "b "__percpu_arg(1)",%0"\
: "=q" (pfo_ret__)  \
: "m" (var));   \
break;  \
case 2: \
-   asm(op "w "__percpu_arg(1)",%0" \
+   asm volatile(op "w "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__)  \
: "m" (var));   \
break;  \
case 4: \
-   asm(op "l "__percpu_arg(1)",%0" \
+   asm volatile(op "l "__percpu_arg(1)",%0"\
: "=r" (pfo_ret__)  \
: "m" (var));   \
break;  \
case 8: \
-   asm(op "q "__percpu_arg(1)",%0" \
+   

Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Peter Zijlstra
On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> While looking at native_sched_clock() disassembly I had
> the surprise to see the compiler (gcc 7.3 here) had
> optimized out the loop, meaning the code is broken.
> 
> Using the documented and approved API not only fixes the bug,
> it also makes the code more readable.
> 
> Replacing five this_cpu_read() by one this_cpu_ptr() makes
> the generated code smaller.

Does not for me, that is, the resulting asm is actually larger

You're quite right the loop went missing; no idea wth that compiler is
smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
will return the same value. But this_cpu_read() is an asm() statement,
it _should_ not assume such.

We assume that this_cpu_read() implies READ_ONCE() in a number of
locations, this really should not happen.

The reason it was written using this_cpu_read() is so that it can use
%gs: prefixed instructions and avoid ever loading that percpu offset and
doing manual address computation.

Let me prod at this with a sharp stick.


Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()

2018-10-11 Thread Peter Zijlstra
On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote:
> While looking at native_sched_clock() disassembly I had
> the surprise to see the compiler (gcc 7.3 here) had
> optimized out the loop, meaning the code is broken.
> 
> Using the documented and approved API not only fixes the bug,
> it also makes the code more readable.
> 
> Replacing five this_cpu_read() by one this_cpu_ptr() makes
> the generated code smaller.

Does not for me, that is, the resulting asm is actually larger

You're quite right the loop went missing; no idea wth that compiler is
smoking (gcc-8.2 for me). In order to eliminate that loop it needs to
think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence)
will return the same value. But this_cpu_read() is an asm() statement,
it _should_ not assume such.

We assume that this_cpu_read() implies READ_ONCE() in a number of
locations, this really should not happen.

The reason it was written using this_cpu_read() is so that it can use
%gs: prefixed instructions and avoid ever loading that percpu offset and
doing manual address computation.

Let me prod at this with a sharp stick.