Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
Hi Eric, I love your patch! Perhaps something to improve: [auto build test WARNING on tip/x86/core] [also build test WARNING on v4.19-rc7 next-20181011] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Eric-Dumazet/x86-tsc-use-real-seqcount_latch-in-cyc2ns_read_begin/20181012-065625 config: x86_64-randconfig-x000-201840 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): arch/x86/kernel/tsc.c: In function 'cyc2ns_read_begin': >> arch/x86/kernel/tsc.c:69:33: warning: passing argument 1 of >> 'raw_read_seqcount_latch' discards 'const' qualifier from pointer target >> type [-Wdiscarded-qualifiers] seq = raw_read_seqcount_latch(>seq); ^ In file included from include/linux/time.h:6:0, from include/linux/ktime.h:24, from include/linux/timer.h:6, from include/linux/workqueue.h:9, from include/linux/rhashtable-types.h:15, from include/linux/ipc.h:7, from include/uapi/linux/sem.h:5, from include/linux/sem.h:5, from include/linux/sched.h:15, from arch/x86/kernel/tsc.c:4: include/linux/seqlock.h:279:19: note: expected 'seqcount_t * {aka struct seqcount *}' but argument is of type 'const seqcount_t * {aka const struct seqcount *}' static inline int raw_read_seqcount_latch(seqcount_t *s) ^~~ vim +69 arch/x86/kernel/tsc.c 59 60 void cyc2ns_read_begin(struct cyc2ns_data *data) 61 { 62 const struct cyc2ns *c2n; 63 int seq; 64 65 preempt_disable_notrace(); 66 67 c2n = this_cpu_ptr(); 68 do { > 69 seq = raw_read_seqcount_latch(>seq); 70 *data = c2n->data[seq & 1]; 71 } while (read_seqcount_retry(>seq, seq)); 72 } 73 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
Hi Eric, I love your patch! Perhaps something to improve: [auto build test WARNING on tip/x86/core] [also build test WARNING on v4.19-rc7 next-20181011] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Eric-Dumazet/x86-tsc-use-real-seqcount_latch-in-cyc2ns_read_begin/20181012-065625 config: x86_64-randconfig-x000-201840 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): arch/x86/kernel/tsc.c: In function 'cyc2ns_read_begin': >> arch/x86/kernel/tsc.c:69:33: warning: passing argument 1 of >> 'raw_read_seqcount_latch' discards 'const' qualifier from pointer target >> type [-Wdiscarded-qualifiers] seq = raw_read_seqcount_latch(>seq); ^ In file included from include/linux/time.h:6:0, from include/linux/ktime.h:24, from include/linux/timer.h:6, from include/linux/workqueue.h:9, from include/linux/rhashtable-types.h:15, from include/linux/ipc.h:7, from include/uapi/linux/sem.h:5, from include/linux/sem.h:5, from include/linux/sched.h:15, from arch/x86/kernel/tsc.c:4: include/linux/seqlock.h:279:19: note: expected 'seqcount_t * {aka struct seqcount *}' but argument is of type 'const seqcount_t * {aka const struct seqcount *}' static inline int raw_read_seqcount_latch(seqcount_t *s) ^~~ vim +69 arch/x86/kernel/tsc.c 59 60 void cyc2ns_read_begin(struct cyc2ns_data *data) 61 { 62 const struct cyc2ns *c2n; 63 int seq; 64 65 preempt_disable_notrace(); 66 67 c2n = this_cpu_ptr(); 68 do { > 69 seq = raw_read_seqcount_latch(>seq); 70 *data = c2n->data[seq & 1]; 71 } while (read_seqcount_retry(>seq, seq)); 72 } 73 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Thu, Oct 11, 2018 at 08:00:42AM -0700, Eric Dumazet wrote: > Yes, but the code size is bigger (I have looked at the disassembly) > > All these %gs plus offset add up > Total length : 0xA7 bytes effective length: 0x78 bytes > 02a0 : > 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 > 2a5: 48 83 e4 f0 and$0xfff0,%rsp > 2a9: 41 ff 72 f8 pushq -0x8(%r10) > 2ad: 55push %rbp > 2ae: 48 89 e5 mov%rsp,%rbp > 2b1: 41 52push %r10 2b3: 66 66 66 66 90 k8_nop5_atomic > 2b8: 0f 31rdtsc > 2ba: 48 c1 e2 20 shl$0x20,%rdx > 2be: 48 09 c2 or %rax,%rdx > 2c1: 49 89 d2 mov%rdx,%r10 > 2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9 > 2c7: R_X86_64_32S .data..percpu..shared_aligned > 2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2 > > 2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c > 2d2: 89 c1mov%eax,%ecx > 2d4: 83 e1 01 and$0x1,%ecx > 2d7: 48 c1 e1 04 shl$0x4,%rcx > 2db: 4c 01 c9 add%r9,%rcx > 2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi > 2e3: 65 8b 31 mov%gs:(%rcx),%esi > 2e6: 65 8b 49 04 mov%gs:0x4(%rcx),%ecx > 2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2 > > 2f1: 00 > 2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c > 2f2: 44 39 c0 cmp%r8d,%eax > 2f5: 75 d4jne2cb > 2f7: 89 f6mov%esi,%esi > 2f9: 48 89 f0 mov%rsi,%rax > 2fc: 49 f7 e2 mul%r10 > 2ff: 48 0f ad d0 shrd %cl,%rdx,%rax > 303: 48 d3 ea shr%cl,%rdx > 306: f6 c1 40 test $0x40,%cl > 309: 48 0f 45 c2 cmovne %rdx,%rax > 30d: 48 01 f8 add%rdi,%rax > 310: 41 5apop%r10 > 312: 5dpop%rbp > 313: 49 8d 62 f8 lea-0x8(%r10),%rsp > 317: c3retq > > New version : > > Total length = 0x91 bytes effective: 0x71 > > 02a0 : > 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 > 2a5: 48 83 e4 f0 and$0xfff0,%rsp > 2a9: 41 ff 72 f8 pushq -0x8(%r10) > 2ad: 55push %rbp > 2ae: 48 89 e5 mov%rsp,%rbp > 2b1: 41 52push %r10 2b3: 66 66 66 66 90 k8_nop5_atomic > 2b8: 0f 31rdtsc > 2ba: 48 c1 e2 20 shl$0x20,%rdx > 2be: 48 09 c2 or %rax,%rdx > 2c1: 49 89 d1 mov%rdx,%r9 > 2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8 > 2c7: R_X86_64_32S .data..percpu..shared_aligned > 2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3 > > 2d2: 00 > 2cf: R_X86_64_PC32 this_cpu_off-0x4 > 2d3: 41 8b 40 20 mov0x20(%r8),%eax > 2d7: 89 c6mov%eax,%esi > 2d9: 83 e6 01 and$0x1,%esi > 2dc: 48 c1 e6 04 shl$0x4,%rsi > 2e0: 4c 01 c6 add%r8,%rsi > 2e3: 8b 3emov(%rsi),%edi > 2e5: 8b 4e 04 mov0x4(%rsi),%ecx > 2e8: 48 8b 76 08 mov0x8(%rsi),%rsi > 2ec: 41 3b 40 20 cmp0x20(%r8),%eax > 2f0: 75 e1jne2d3 > 2f2: 48 89 f8 mov%rdi,%rax > 2f5: 49 f7 e1 mul%r9 > 2f8: 48 0f ad d0 shrd %cl,%rdx,%rax > 2fc: 48 d3 ea shr%cl,%rdx > 2ff: f6 c1 40 test $0x40,%cl > 302: 48 0f 45 c2 cmovne %rdx,%rax > 306: 48 01 f0 add%rsi,%rax > 309: 41 5apop%r10 > 30b: 5dpop%rbp > 30c: 49 8d 62 f8 lea-0x8(%r10),%rsp > 310: c3retq Ah, right you are. But my version only touches the one cacheline, whereas yours will do that extra cpu offset load, which might or might not be hot. Difficult.. You have some weird stack setup though.. mine doesn't have that: $ objdump -dr defconfig-build/arch/x86/kernel/tsc.o | awk '/:$/ {p=1} /^$/ {p=0} {if (p) print $0}' 0b00 : b00: 55 push %rbp b01: 48 89 e5mov%rsp,%rbp b04: 66 66 66 66 90 k8_nop5_atomic b09: 0f 31 rdtsc b0b: 48 c1 e2 20 shl$0x20,%rdx b0f: 48 09 c2or %rax,%rdx b12: 49 89 d2mov%rdx,%r10 b15: 49 c7 c1 00 00 00 00mov$0x0,%r9 b18: R_X86_64_32S
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Thu, Oct 11, 2018 at 08:00:42AM -0700, Eric Dumazet wrote: > Yes, but the code size is bigger (I have looked at the disassembly) > > All these %gs plus offset add up > Total length : 0xA7 bytes effective length: 0x78 bytes > 02a0 : > 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 > 2a5: 48 83 e4 f0 and$0xfff0,%rsp > 2a9: 41 ff 72 f8 pushq -0x8(%r10) > 2ad: 55push %rbp > 2ae: 48 89 e5 mov%rsp,%rbp > 2b1: 41 52push %r10 2b3: 66 66 66 66 90 k8_nop5_atomic > 2b8: 0f 31rdtsc > 2ba: 48 c1 e2 20 shl$0x20,%rdx > 2be: 48 09 c2 or %rax,%rdx > 2c1: 49 89 d2 mov%rdx,%r10 > 2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9 > 2c7: R_X86_64_32S .data..percpu..shared_aligned > 2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2 > > 2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c > 2d2: 89 c1mov%eax,%ecx > 2d4: 83 e1 01 and$0x1,%ecx > 2d7: 48 c1 e1 04 shl$0x4,%rcx > 2db: 4c 01 c9 add%r9,%rcx > 2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi > 2e3: 65 8b 31 mov%gs:(%rcx),%esi > 2e6: 65 8b 49 04 mov%gs:0x4(%rcx),%ecx > 2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2 > > 2f1: 00 > 2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c > 2f2: 44 39 c0 cmp%r8d,%eax > 2f5: 75 d4jne2cb > 2f7: 89 f6mov%esi,%esi > 2f9: 48 89 f0 mov%rsi,%rax > 2fc: 49 f7 e2 mul%r10 > 2ff: 48 0f ad d0 shrd %cl,%rdx,%rax > 303: 48 d3 ea shr%cl,%rdx > 306: f6 c1 40 test $0x40,%cl > 309: 48 0f 45 c2 cmovne %rdx,%rax > 30d: 48 01 f8 add%rdi,%rax > 310: 41 5apop%r10 > 312: 5dpop%rbp > 313: 49 8d 62 f8 lea-0x8(%r10),%rsp > 317: c3retq > > New version : > > Total length = 0x91 bytes effective: 0x71 > > 02a0 : > 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 > 2a5: 48 83 e4 f0 and$0xfff0,%rsp > 2a9: 41 ff 72 f8 pushq -0x8(%r10) > 2ad: 55push %rbp > 2ae: 48 89 e5 mov%rsp,%rbp > 2b1: 41 52push %r10 2b3: 66 66 66 66 90 k8_nop5_atomic > 2b8: 0f 31rdtsc > 2ba: 48 c1 e2 20 shl$0x20,%rdx > 2be: 48 09 c2 or %rax,%rdx > 2c1: 49 89 d1 mov%rdx,%r9 > 2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8 > 2c7: R_X86_64_32S .data..percpu..shared_aligned > 2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3 > > 2d2: 00 > 2cf: R_X86_64_PC32 this_cpu_off-0x4 > 2d3: 41 8b 40 20 mov0x20(%r8),%eax > 2d7: 89 c6mov%eax,%esi > 2d9: 83 e6 01 and$0x1,%esi > 2dc: 48 c1 e6 04 shl$0x4,%rsi > 2e0: 4c 01 c6 add%r8,%rsi > 2e3: 8b 3emov(%rsi),%edi > 2e5: 8b 4e 04 mov0x4(%rsi),%ecx > 2e8: 48 8b 76 08 mov0x8(%rsi),%rsi > 2ec: 41 3b 40 20 cmp0x20(%r8),%eax > 2f0: 75 e1jne2d3 > 2f2: 48 89 f8 mov%rdi,%rax > 2f5: 49 f7 e1 mul%r9 > 2f8: 48 0f ad d0 shrd %cl,%rdx,%rax > 2fc: 48 d3 ea shr%cl,%rdx > 2ff: f6 c1 40 test $0x40,%cl > 302: 48 0f 45 c2 cmovne %rdx,%rax > 306: 48 01 f0 add%rsi,%rax > 309: 41 5apop%r10 > 30b: 5dpop%rbp > 30c: 49 8d 62 f8 lea-0x8(%r10),%rsp > 310: c3retq Ah, right you are. But my version only touches the one cacheline, whereas yours will do that extra cpu offset load, which might or might not be hot. Difficult.. You have some weird stack setup though.. mine doesn't have that: $ objdump -dr defconfig-build/arch/x86/kernel/tsc.o | awk '/:$/ {p=1} /^$/ {p=0} {if (p) print $0}' 0b00 : b00: 55 push %rbp b01: 48 89 e5mov%rsp,%rbp b04: 66 66 66 66 90 k8_nop5_atomic b09: 0f 31 rdtsc b0b: 48 c1 e2 20 shl$0x20,%rdx b0f: 48 09 c2or %rax,%rdx b12: 49 89 d2mov%rdx,%r10 b15: 49 c7 c1 00 00 00 00mov$0x0,%r9 b18: R_X86_64_32S
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Thu, Oct 11, 2018 at 12:31 AM Peter Zijlstra wrote: > > On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote: > > While looking at native_sched_clock() disassembly I had > > the surprise to see the compiler (gcc 7.3 here) had > > optimized out the loop, meaning the code is broken. > > > > Using the documented and approved API not only fixes the bug, > > it also makes the code more readable. > > > > Replacing five this_cpu_read() by one this_cpu_ptr() makes > > the generated code smaller. > > Does not for me, that is, the resulting asm is actually larger [resent in non html] I counted the number of bytes of text, and really the after my patch code size is smaller. %gs prefixes are one-byte, but the 32bit offsets are adding up. Prior version (with resinstaed loop, thanks to one smp_rmb() diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..d4ad30bbaa0cc8bf81da0272829da26f229734bf 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -71,7 +71,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data) data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); data->cyc2ns_mul= this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); data->cyc2ns_shift = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); - + smp_rmb(); } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); } Total length : 0xA7 bytes 02a0 : 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 2a5: 48 83 e4 f0 and$0xfff0,%rsp 2a9: 41 ff 72 f8 pushq -0x8(%r10) 2ad: 55push %rbp 2ae: 48 89 e5 mov%rsp,%rbp 2b1: 41 52push %r10 2b3: e9 60 00 00 00jmpq 318 2b8: 0f 31rdtsc 2ba: 48 c1 e2 20 shl$0x20,%rdx 2be: 48 09 c2 or %rax,%rdx 2c1: 49 89 d2 mov%rdx,%r10 2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9 2c7: R_X86_64_32S .data..percpu..shared_aligned 2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2 2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c 2d2: 89 c1mov%eax,%ecx 2d4: 83 e1 01 and$0x1,%ecx 2d7: 48 c1 e1 04 shl$0x4,%rcx 2db: 4c 01 c9 add%r9,%rcx 2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi 2e3: 65 8b 31 mov%gs:(%rcx),%esi 2e6: 65 8b 49 04 mov%gs:0x4(%rcx),%ecx 2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2 2f1: 00 2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c 2f2: 44 39 c0 cmp%r8d,%eax 2f5: 75 d4jne2cb 2f7: 89 f6mov%esi,%esi 2f9: 48 89 f0 mov%rsi,%rax 2fc: 49 f7 e2 mul%r10 2ff: 48 0f ad d0 shrd %cl,%rdx,%rax 303: 48 d3 ea shr%cl,%rdx 306: f6 c1 40 test $0x40,%cl 309: 48 0f 45 c2 cmovne %rdx,%rax 30d: 48 01 f8 add%rdi,%rax 310: 41 5apop%r10 312: 5dpop%rbp 313: 49 8d 62 f8 lea-0x8(%r10),%rsp 317: c3retq 318: 48 69 05 00 00 00 00 imul $0x3d0900,0x0(%rip),%rax # 323 31f: 00 09 3d 00 31b: R_X86_64_PC32 jiffies_64-0x8 323: 41 5apop%r10 325: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx 32c: f7 c2 ff 32f: 5dpop%rbp 330: 49 8d 62 f8 lea-0x8(%r10),%rsp 334: 48 01 d0 add%rdx,%rax 337: c3retq New version (my cleanup patch) Total length = 0x91 bytes 02a0 : 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 2a5: 48 83 e4 f0 and$0xfff0,%rsp 2a9: 41 ff 72 f8 pushq -0x8(%r10) 2ad: 55push %rbp 2ae: 48 89 e5 mov%rsp,%rbp 2b1: 41 52push %r10 2b3: e9 59 00 00 00jmpq 311 2b8: 0f 31rdtsc 2ba: 48 c1 e2 20 shl$0x20,%rdx 2be: 48 09 c2 or %rax,%rdx 2c1: 49 89 d1 mov%rdx,%r9 2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8 2c7: R_X86_64_32S .data..percpu..shared_aligned 2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3 2d2: 00 2cf: R_X86_64_PC32 this_cpu_off-0x4 2d3: 41 8b 40 20 mov0x20(%r8),%eax 2d7: 89 c6mov%eax,%esi 2d9: 83 e6 01 and$0x1,%esi 2dc: 48 c1 e6 04 shl$0x4,%rsi 2e0: 4c 01 c6 add%r8,%rsi 2e3: 8b 3emov(%rsi),%edi 2e5: 8b 4e 04 mov0x4(%rsi),%ecx 2e8: 48 8b 76 08 mov
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Thu, Oct 11, 2018 at 12:31 AM Peter Zijlstra wrote: > > On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote: > > While looking at native_sched_clock() disassembly I had > > the surprise to see the compiler (gcc 7.3 here) had > > optimized out the loop, meaning the code is broken. > > > > Using the documented and approved API not only fixes the bug, > > it also makes the code more readable. > > > > Replacing five this_cpu_read() by one this_cpu_ptr() makes > > the generated code smaller. > > Does not for me, that is, the resulting asm is actually larger [resent in non html] I counted the number of bytes of text, and really the after my patch code size is smaller. %gs prefixes are one-byte, but the 32bit offsets are adding up. Prior version (with resinstaed loop, thanks to one smp_rmb() diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index b52bd2b6cdb443ba0c89d78aaa52b02b82a10b6e..d4ad30bbaa0cc8bf81da0272829da26f229734bf 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -71,7 +71,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data) data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); data->cyc2ns_mul= this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); data->cyc2ns_shift = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); - + smp_rmb(); } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); } Total length : 0xA7 bytes 02a0 : 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 2a5: 48 83 e4 f0 and$0xfff0,%rsp 2a9: 41 ff 72 f8 pushq -0x8(%r10) 2ad: 55push %rbp 2ae: 48 89 e5 mov%rsp,%rbp 2b1: 41 52push %r10 2b3: e9 60 00 00 00jmpq 318 2b8: 0f 31rdtsc 2ba: 48 c1 e2 20 shl$0x20,%rdx 2be: 48 09 c2 or %rax,%rdx 2c1: 49 89 d2 mov%rdx,%r10 2c4: 49 c7 c1 00 00 00 00 mov$0x0,%r9 2c7: R_X86_64_32S .data..percpu..shared_aligned 2cb: 65 8b 05 00 00 00 00 mov%gs:0x0(%rip),%eax# 2d2 2ce: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c 2d2: 89 c1mov%eax,%ecx 2d4: 83 e1 01 and$0x1,%ecx 2d7: 48 c1 e1 04 shl$0x4,%rcx 2db: 4c 01 c9 add%r9,%rcx 2de: 65 48 8b 79 08mov%gs:0x8(%rcx),%rdi 2e3: 65 8b 31 mov%gs:(%rcx),%esi 2e6: 65 8b 49 04 mov%gs:0x4(%rcx),%ecx 2ea: 65 44 8b 05 00 00 00 mov%gs:0x0(%rip),%r8d# 2f2 2f1: 00 2ee: R_X86_64_PC32 .data..percpu..shared_aligned+0x1c 2f2: 44 39 c0 cmp%r8d,%eax 2f5: 75 d4jne2cb 2f7: 89 f6mov%esi,%esi 2f9: 48 89 f0 mov%rsi,%rax 2fc: 49 f7 e2 mul%r10 2ff: 48 0f ad d0 shrd %cl,%rdx,%rax 303: 48 d3 ea shr%cl,%rdx 306: f6 c1 40 test $0x40,%cl 309: 48 0f 45 c2 cmovne %rdx,%rax 30d: 48 01 f8 add%rdi,%rax 310: 41 5apop%r10 312: 5dpop%rbp 313: 49 8d 62 f8 lea-0x8(%r10),%rsp 317: c3retq 318: 48 69 05 00 00 00 00 imul $0x3d0900,0x0(%rip),%rax # 323 31f: 00 09 3d 00 31b: R_X86_64_PC32 jiffies_64-0x8 323: 41 5apop%r10 325: 48 ba 00 b8 64 d9 45 movabs $0xffc2f745d964b800,%rdx 32c: f7 c2 ff 32f: 5dpop%rbp 330: 49 8d 62 f8 lea-0x8(%r10),%rsp 334: 48 01 d0 add%rdx,%rax 337: c3retq New version (my cleanup patch) Total length = 0x91 bytes 02a0 : 2a0: 4c 8d 54 24 08lea0x8(%rsp),%r10 2a5: 48 83 e4 f0 and$0xfff0,%rsp 2a9: 41 ff 72 f8 pushq -0x8(%r10) 2ad: 55push %rbp 2ae: 48 89 e5 mov%rsp,%rbp 2b1: 41 52push %r10 2b3: e9 59 00 00 00jmpq 311 2b8: 0f 31rdtsc 2ba: 48 c1 e2 20 shl$0x20,%rdx 2be: 48 09 c2 or %rax,%rdx 2c1: 49 89 d1 mov%rdx,%r9 2c4: 49 c7 c0 00 00 00 00 mov$0x0,%r8 2c7: R_X86_64_32S .data..percpu..shared_aligned 2cb: 65 4c 03 05 00 00 00 add%gs:0x0(%rip),%r8# 2d3 2d2: 00 2cf: R_X86_64_PC32 this_cpu_off-0x4 2d3: 41 8b 40 20 mov0x20(%r8),%eax 2d7: 89 c6mov%eax,%esi 2d9: 83 e6 01 and$0x1,%esi 2dc: 48 c1 e6 04 shl$0x4,%rsi 2e0: 4c 01 c6 add%r8,%rsi 2e3: 8b 3emov(%rsi),%edi 2e5: 8b 4e 04 mov0x4(%rsi),%ecx 2e8: 48 8b 76 08 mov
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Thu, Oct 11, 2018 at 09:31:33AM +0200, Peter Zijlstra wrote: > On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote: > > While looking at native_sched_clock() disassembly I had > > the surprise to see the compiler (gcc 7.3 here) had > > optimized out the loop, meaning the code is broken. > > > > Using the documented and approved API not only fixes the bug, > > it also makes the code more readable. > > > > Replacing five this_cpu_read() by one this_cpu_ptr() makes > > the generated code smaller. > > Does not for me, that is, the resulting asm is actually larger > > You're quite right the loop went missing; no idea wth that compiler is > smoking (gcc-8.2 for me). In order to eliminate that loop it needs to > think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence) > will return the same value. But this_cpu_read() is an asm() statement, > it _should_ not assume such. > > We assume that this_cpu_read() implies READ_ONCE() in a number of > locations, this really should not happen. > > The reason it was written using this_cpu_read() is so that it can use > %gs: prefixed instructions and avoid ever loading that percpu offset and > doing manual address computation. > > Let me prod at this with a sharp stick. OK, so apart from the inlining being crap, which is fixed by: diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 6490f618e096..638491062fea 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -57,7 +57,7 @@ struct cyc2ns { static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns); -void cyc2ns_read_begin(struct cyc2ns_data *data) +void __always_inline cyc2ns_read_begin(struct cyc2ns_data *data) { int seq, idx; @@ -74,7 +74,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data) } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); } -void cyc2ns_read_end(void) +void __always_inline cyc2ns_read_end(void) { preempt_enable_notrace(); } @@ -103,7 +103,7 @@ void cyc2ns_read_end(void) * -johns...@us.ibm.com "math is hard, lets go shopping!" */ -static inline unsigned long long cycles_2_ns(unsigned long long cyc) +static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc) { struct cyc2ns_data data; unsigned long long ns; That gets us: native_sched_clock: pushq %rbp# movq%rsp, %rbp #, ... jump label ... rdtsc salq$32, %rdx #, tmp110 orq %rax, %rdx # low, tmp110 movl %gs:cyc2ns+32(%rip),%ecx # cyc2ns.seq.sequence, pfo_ret__ andl$1, %ecx#, idx salq$4, %rcx#, tmp116 movl %gs:cyc2ns(%rcx),%esi # cyc2ns.data[idx_14].cyc2ns_mul, pfo_ret__ movl%esi, %esi # pfo_ret__, pfo_ret__ movq%rsi, %rax # pfo_ret__, tmp133 mulq%rdx# _23 movq %gs:cyc2ns+8(%rcx),%rdi# cyc2ns.data[idx_14].cyc2ns_offset, pfo_ret__ addq$cyc2ns, %rcx #, tmp117 movl %gs:4(%rcx),%ecx # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__ shrdq %rdx, %rax # pfo_ret__,, tmp134 shrq%cl, %rdx # pfo_ret__, testb $64, %cl#, pfo_ret__ cmovne %rdx, %rax #,, tmp134 addq%rdi, %rax # pfo_ret__, popq%rbp# ret If we then fix the percpu mess, with: diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index e9202a0de8f0..1a19d11cfbbd 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -185,22 +185,22 @@ do { \ typeof(var) pfo_ret__; \ switch (sizeof(var)) { \ case 1: \ - asm(op "b "__percpu_arg(1)",%0" \ + asm volatile(op "b "__percpu_arg(1)",%0"\ : "=q" (pfo_ret__) \ : "m" (var)); \ break; \ case 2: \ - asm(op "w "__percpu_arg(1)",%0" \ + asm volatile(op "w "__percpu_arg(1)",%0"\ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 4: \ - asm(op "l "__percpu_arg(1)",%0" \ + asm volatile(op "l "__percpu_arg(1)",%0"\ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 8: \ - asm(op "q "__percpu_arg(1)",%0" \ +
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Thu, Oct 11, 2018 at 09:31:33AM +0200, Peter Zijlstra wrote: > On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote: > > While looking at native_sched_clock() disassembly I had > > the surprise to see the compiler (gcc 7.3 here) had > > optimized out the loop, meaning the code is broken. > > > > Using the documented and approved API not only fixes the bug, > > it also makes the code more readable. > > > > Replacing five this_cpu_read() by one this_cpu_ptr() makes > > the generated code smaller. > > Does not for me, that is, the resulting asm is actually larger > > You're quite right the loop went missing; no idea wth that compiler is > smoking (gcc-8.2 for me). In order to eliminate that loop it needs to > think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence) > will return the same value. But this_cpu_read() is an asm() statement, > it _should_ not assume such. > > We assume that this_cpu_read() implies READ_ONCE() in a number of > locations, this really should not happen. > > The reason it was written using this_cpu_read() is so that it can use > %gs: prefixed instructions and avoid ever loading that percpu offset and > doing manual address computation. > > Let me prod at this with a sharp stick. OK, so apart from the inlining being crap, which is fixed by: diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 6490f618e096..638491062fea 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -57,7 +57,7 @@ struct cyc2ns { static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns); -void cyc2ns_read_begin(struct cyc2ns_data *data) +void __always_inline cyc2ns_read_begin(struct cyc2ns_data *data) { int seq, idx; @@ -74,7 +74,7 @@ void cyc2ns_read_begin(struct cyc2ns_data *data) } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); } -void cyc2ns_read_end(void) +void __always_inline cyc2ns_read_end(void) { preempt_enable_notrace(); } @@ -103,7 +103,7 @@ void cyc2ns_read_end(void) * -johns...@us.ibm.com "math is hard, lets go shopping!" */ -static inline unsigned long long cycles_2_ns(unsigned long long cyc) +static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc) { struct cyc2ns_data data; unsigned long long ns; That gets us: native_sched_clock: pushq %rbp# movq%rsp, %rbp #, ... jump label ... rdtsc salq$32, %rdx #, tmp110 orq %rax, %rdx # low, tmp110 movl %gs:cyc2ns+32(%rip),%ecx # cyc2ns.seq.sequence, pfo_ret__ andl$1, %ecx#, idx salq$4, %rcx#, tmp116 movl %gs:cyc2ns(%rcx),%esi # cyc2ns.data[idx_14].cyc2ns_mul, pfo_ret__ movl%esi, %esi # pfo_ret__, pfo_ret__ movq%rsi, %rax # pfo_ret__, tmp133 mulq%rdx# _23 movq %gs:cyc2ns+8(%rcx),%rdi# cyc2ns.data[idx_14].cyc2ns_offset, pfo_ret__ addq$cyc2ns, %rcx #, tmp117 movl %gs:4(%rcx),%ecx # cyc2ns.data[idx_14].cyc2ns_shift, pfo_ret__ shrdq %rdx, %rax # pfo_ret__,, tmp134 shrq%cl, %rdx # pfo_ret__, testb $64, %cl#, pfo_ret__ cmovne %rdx, %rax #,, tmp134 addq%rdi, %rax # pfo_ret__, popq%rbp# ret If we then fix the percpu mess, with: diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index e9202a0de8f0..1a19d11cfbbd 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -185,22 +185,22 @@ do { \ typeof(var) pfo_ret__; \ switch (sizeof(var)) { \ case 1: \ - asm(op "b "__percpu_arg(1)",%0" \ + asm volatile(op "b "__percpu_arg(1)",%0"\ : "=q" (pfo_ret__) \ : "m" (var)); \ break; \ case 2: \ - asm(op "w "__percpu_arg(1)",%0" \ + asm volatile(op "w "__percpu_arg(1)",%0"\ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 4: \ - asm(op "l "__percpu_arg(1)",%0" \ + asm volatile(op "l "__percpu_arg(1)",%0"\ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 8: \ - asm(op "q "__percpu_arg(1)",%0" \ +
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote: > While looking at native_sched_clock() disassembly I had > the surprise to see the compiler (gcc 7.3 here) had > optimized out the loop, meaning the code is broken. > > Using the documented and approved API not only fixes the bug, > it also makes the code more readable. > > Replacing five this_cpu_read() by one this_cpu_ptr() makes > the generated code smaller. Does not for me, that is, the resulting asm is actually larger You're quite right the loop went missing; no idea wth that compiler is smoking (gcc-8.2 for me). In order to eliminate that loop it needs to think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence) will return the same value. But this_cpu_read() is an asm() statement, it _should_ not assume such. We assume that this_cpu_read() implies READ_ONCE() in a number of locations, this really should not happen. The reason it was written using this_cpu_read() is so that it can use %gs: prefixed instructions and avoid ever loading that percpu offset and doing manual address computation. Let me prod at this with a sharp stick.
Re: [PATCH] x86/tsc: use real seqcount_latch in cyc2ns_read_begin()
On Wed, Oct 10, 2018 at 05:33:36PM -0700, Eric Dumazet wrote: > While looking at native_sched_clock() disassembly I had > the surprise to see the compiler (gcc 7.3 here) had > optimized out the loop, meaning the code is broken. > > Using the documented and approved API not only fixes the bug, > it also makes the code more readable. > > Replacing five this_cpu_read() by one this_cpu_ptr() makes > the generated code smaller. Does not for me, that is, the resulting asm is actually larger You're quite right the loop went missing; no idea wth that compiler is smoking (gcc-8.2 for me). In order to eliminate that loop it needs to think that two consecutive loads of this_cpu_read(cyc2ns.seq.sequence) will return the same value. But this_cpu_read() is an asm() statement, it _should_ not assume such. We assume that this_cpu_read() implies READ_ONCE() in a number of locations, this really should not happen. The reason it was written using this_cpu_read() is so that it can use %gs: prefixed instructions and avoid ever loading that percpu offset and doing manual address computation. Let me prod at this with a sharp stick.