Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 4:20 PM Alexey Dobriyan wrote: > > This is Broadwell Xeon E5-2620 v4. > Which is somewhat strange indeed because it should be modern enough. Yeah, odd. Here's the benchmark I used: #define SIZE 4068 int main(int argc, char **argv) { int i; unsigned char buffer[SIZE], *p; for (i = 0; i < 100; i++) asm volatile( "1: movq %[zero],(%[mem]); addq %[eight],%[mem]; decl %[count]; jne 1b" : [mem] "=r" (p) : [zero] "i" (0l), [eight] "i" (8l), "0" (buffer), [count] "r" (SIZE/8)); } where you can change that "i" for [zero] and [eight] to be "r" to get the register version. I just timed it, because I'm lazy and perf seemed to be overkill. It might be some very specific loop buffer issue or something. Or maybe my benchmark above is broken, I didn't really verify that the end result was any good (I just did an objdump to verify the asm code superficially). Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 4:20 PM Alexey Dobriyan wrote: > > This is Broadwell Xeon E5-2620 v4. > Which is somewhat strange indeed because it should be modern enough. Yeah, odd. Here's the benchmark I used: #define SIZE 4068 int main(int argc, char **argv) { int i; unsigned char buffer[SIZE], *p; for (i = 0; i < 100; i++) asm volatile( "1: movq %[zero],(%[mem]); addq %[eight],%[mem]; decl %[count]; jne 1b" : [mem] "=r" (p) : [zero] "i" (0l), [eight] "i" (8l), "0" (buffer), [count] "r" (SIZE/8)); } where you can change that "i" for [zero] and [eight] to be "r" to get the register version. I just timed it, because I'm lazy and perf seemed to be overkill. It might be some very specific loop buffer issue or something. Or maybe my benchmark above is broken, I didn't really verify that the end result was any good (I just did an objdump to verify the asm code superficially). Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 05, 2018 at 04:04:37PM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds > wrote: > > > > On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan wrote: > > > > > > On my potato performance increase is 33%, sheesh. > > > And CPU starts doing 3 instructions per cycle vs 2. > > > > Whee. That's a shockingly big difference. On my CPU (i7-6700K) it > > makes absolutely no difference whether the values are integers or in > > registers. > > In fact, looking at Agner Fog's instruction lists, I don't see any CPU > where it would make a difference, except for the P4 (where the > immediate looks like it's a bad idea because it's an extra uop, but it > might pack fine and not be noticeable). > > But maybe I'm missing something subtle. What CPU, out of morbid interest? This is Broadwell Xeon E5-2620 v4. Which is somewhat strange indeed because it should be modern enough.
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 05, 2018 at 04:04:37PM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds > wrote: > > > > On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan wrote: > > > > > > On my potato performance increase is 33%, sheesh. > > > And CPU starts doing 3 instructions per cycle vs 2. > > > > Whee. That's a shockingly big difference. On my CPU (i7-6700K) it > > makes absolutely no difference whether the values are integers or in > > registers. > > In fact, looking at Agner Fog's instruction lists, I don't see any CPU > where it would make a difference, except for the P4 (where the > immediate looks like it's a bad idea because it's an extra uop, but it > might pack fine and not be noticeable). > > But maybe I'm missing something subtle. What CPU, out of morbid interest? This is Broadwell Xeon E5-2620 v4. Which is somewhat strange indeed because it should be modern enough.
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds wrote: > > On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan wrote: > > > > On my potato performance increase is 33%, sheesh. > > And CPU starts doing 3 instructions per cycle vs 2. > > Whee. That's a shockingly big difference. On my CPU (i7-6700K) it > makes absolutely no difference whether the values are integers or in > registers. In fact, looking at Agner Fog's instruction lists, I don't see any CPU where it would make a difference, except for the P4 (where the immediate looks like it's a bad idea because it's an extra uop, but it might pack fine and not be noticeable). But maybe I'm missing something subtle. What CPU, out of morbid interest? Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 4:01 PM Linus Torvalds wrote: > > On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan wrote: > > > > On my potato performance increase is 33%, sheesh. > > And CPU starts doing 3 instructions per cycle vs 2. > > Whee. That's a shockingly big difference. On my CPU (i7-6700K) it > makes absolutely no difference whether the values are integers or in > registers. In fact, looking at Agner Fog's instruction lists, I don't see any CPU where it would make a difference, except for the P4 (where the immediate looks like it's a bad idea because it's an extra uop, but it might pack fine and not be noticeable). But maybe I'm missing something subtle. What CPU, out of morbid interest? Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan wrote: > > On my potato performance increase is 33%, sheesh. > And CPU starts doing 3 instructions per cycle vs 2. Whee. That's a shockingly big difference. On my CPU (i7-6700K) it makes absolutely no difference whether the values are integers or in registers. Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 3:41 PM Alexey Dobriyan wrote: > > On my potato performance increase is 33%, sheesh. > And CPU starts doing 3 instructions per cycle vs 2. Whee. That's a shockingly big difference. On my CPU (i7-6700K) it makes absolutely no difference whether the values are integers or in registers. Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 05, 2018 at 10:32:55AM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan wrote: > > > > Tested? :^) I had P4 maybe ~15(?) years ago. > > Did you EVEN test it on what you have today? > > Do you have any numbers at all, in other words? > > Micro-optimizations need numbers. Otherwise they aren't > micro-optimizations, they are just "change code randomly". On my potato performance increase is 33%, sheesh. And CPU starts doing 3 instructions per cycle vs 2. benchmark is "clear_user(p + 4096 - 4068, 4068)" 4068 comes from booting Debian 8 with printk. f0(4068) (old clear_user) $ taskset -c 15 perf stat -r 16 ./a.out Performance counter stats for './a.out' (16 runs): 2033.189084 task-clock (msec) #1.000 CPUs utilized ( +- 0.41% ) 2 context-switches #0.001 K/sec ( +- 11.11% ) 0 cpu-migrations#0.000 K/sec 46 page-faults #0.023 K/sec ( +- 0.91% ) 4,268,425,486 cycles#2.099 GHz ( +- 0.41% ) 8,672,326,256 instructions #2.03 insn per cycle ( +- 0.00% ) 2,169,900,710 branches # 1067.240 M/sec ( +- 0.00% ) 4,226,258 branch-misses #0.19% of all branches ( +- 0.01% ) 2.033700109 seconds time elapsed ( +- 0.41% ) f1(4068) (new clear_user) $ taskset -c 15 perf stat -r 16 ./a.out Performance counter stats for './a.out' (16 runs): 1345.149992 task-clock (msec) #1.000 CPUs utilized ( +- 0.01% ) 2 context-switches #0.002 K/sec ( +- 8.35% ) 0 cpu-migrations#0.000 K/sec 46 page-faults #0.034 K/sec ( +- 0.82% ) 2,823,965,728 cycles#2.099 GHz ( +- 0.01% ) 8,661,733,733 instructions #3.07 insn per cycle ( +- 0.00% ) 2,169,437,410 branches # 1612.785 M/sec ( +- 0.00% ) 4,216,469 branch-misses #0.19% of all branches ( +- 0.01% ) 1.345375114 seconds time elapsed ( +- 0.01% ) - CFLAGS = -Wall -fno-strict-aliasing -fno-common -fshort-wchar -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -funit-at-a-time -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 -fno-stack-protector -fomit-frame-pointer -fno-var-tracking-assignments -g -femit-struct-debug-baseonly -fno-var-tracking -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack 0780 : 780: movrax,rsi 783: movrcx,rsi 786: xoredx,edx 788: andeax,0x7 78b: shrrcx,0x3 78f: movesi,0x8 794: test rcx,rcx 797: je 7a3 799: movQWORD PTR [rdi],rdx 79c: addrdi,rsi 79f: dececx 7a1: jne799 7a3: movrcx,rax 7a6: test ecx,ecx 7a8: je 7b3 7aa: movBYTE PTR [rdi],dl 7ac: incrdi 7af: dececx 7b1: jne7aa 7b3: movrax,rcx 7b6: ret 07c0 : 7c0: movrax,rsi 7c3: shrrsi,0x3 7c7: andeax,0x7 7ca: movrcx,rsi 7cd: test rcx,rcx 7d0: je 7e1 7d2: movQWORD PTR [rdi],0x0 7d9: addrdi,0x8 7dd: dececx 7df: jne7d2 7e1: movrcx,rax 7e4: test ecx,ecx 7e6: je 7f2 7e8: movBYTE PTR [rdi],0x0 7eb: incrdi 7ee: dececx 7f0: jne7e8 7f2: movrax,rcx 7f5: ret
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 05, 2018 at 10:32:55AM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan wrote: > > > > Tested? :^) I had P4 maybe ~15(?) years ago. > > Did you EVEN test it on what you have today? > > Do you have any numbers at all, in other words? > > Micro-optimizations need numbers. Otherwise they aren't > micro-optimizations, they are just "change code randomly". On my potato performance increase is 33%, sheesh. And CPU starts doing 3 instructions per cycle vs 2. benchmark is "clear_user(p + 4096 - 4068, 4068)" 4068 comes from booting Debian 8 with printk. f0(4068) (old clear_user) $ taskset -c 15 perf stat -r 16 ./a.out Performance counter stats for './a.out' (16 runs): 2033.189084 task-clock (msec) #1.000 CPUs utilized ( +- 0.41% ) 2 context-switches #0.001 K/sec ( +- 11.11% ) 0 cpu-migrations#0.000 K/sec 46 page-faults #0.023 K/sec ( +- 0.91% ) 4,268,425,486 cycles#2.099 GHz ( +- 0.41% ) 8,672,326,256 instructions #2.03 insn per cycle ( +- 0.00% ) 2,169,900,710 branches # 1067.240 M/sec ( +- 0.00% ) 4,226,258 branch-misses #0.19% of all branches ( +- 0.01% ) 2.033700109 seconds time elapsed ( +- 0.41% ) f1(4068) (new clear_user) $ taskset -c 15 perf stat -r 16 ./a.out Performance counter stats for './a.out' (16 runs): 1345.149992 task-clock (msec) #1.000 CPUs utilized ( +- 0.01% ) 2 context-switches #0.002 K/sec ( +- 8.35% ) 0 cpu-migrations#0.000 K/sec 46 page-faults #0.034 K/sec ( +- 0.82% ) 2,823,965,728 cycles#2.099 GHz ( +- 0.01% ) 8,661,733,733 instructions #3.07 insn per cycle ( +- 0.00% ) 2,169,437,410 branches # 1612.785 M/sec ( +- 0.00% ) 4,216,469 branch-misses #0.19% of all branches ( +- 0.01% ) 1.345375114 seconds time elapsed ( +- 0.01% ) - CFLAGS = -Wall -fno-strict-aliasing -fno-common -fshort-wchar -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -funit-at-a-time -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 -fno-stack-protector -fomit-frame-pointer -fno-var-tracking-assignments -g -femit-struct-debug-baseonly -fno-var-tracking -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack 0780 : 780: movrax,rsi 783: movrcx,rsi 786: xoredx,edx 788: andeax,0x7 78b: shrrcx,0x3 78f: movesi,0x8 794: test rcx,rcx 797: je 7a3 799: movQWORD PTR [rdi],rdx 79c: addrdi,rsi 79f: dececx 7a1: jne799 7a3: movrcx,rax 7a6: test ecx,ecx 7a8: je 7b3 7aa: movBYTE PTR [rdi],dl 7ac: incrdi 7af: dececx 7b1: jne7aa 7b3: movrax,rcx 7b6: ret 07c0 : 7c0: movrax,rsi 7c3: shrrsi,0x3 7c7: andeax,0x7 7ca: movrcx,rsi 7cd: test rcx,rcx 7d0: je 7e1 7d2: movQWORD PTR [rdi],0x0 7d9: addrdi,0x8 7dd: dececx 7df: jne7d2 7e1: movrcx,rax 7e4: test ecx,ecx 7e6: je 7f2 7e8: movBYTE PTR [rdi],0x0 7eb: incrdi 7ee: dececx 7f0: jne7e8 7f2: movrax,rcx 7f5: ret
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan wrote: > > Tested? :^) I had P4 maybe ~15(?) years ago. Did you EVEN test it on what you have today? Do you have any numbers at all, in other words? Micro-optimizations need numbers. Otherwise they aren't micro-optimizations, they are just "change code randomly". Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 10:22 AM Alexey Dobriyan wrote: > > Tested? :^) I had P4 maybe ~15(?) years ago. Did you EVEN test it on what you have today? Do you have any numbers at all, in other words? Micro-optimizations need numbers. Otherwise they aren't micro-optimizations, they are just "change code randomly". Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 05, 2018 at 05:05:14PM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar wrote: > > > > > > - __clear_user() micro-optimization (Alexey Dobriyan) > > > > Was this actually tested? > > I'm not sure - Alexey? > > > I think one reason people avoided the constant was that on some > > microarchitecture it ended up being a separate uop just for the > > constant generation, because it wouldn't fit in a single uop. > Ok, fair point and agreed - if Alexey sends some measurements to back the > change > I'll keep this, otherwise queue up a revert. Tested? :^) I had P4 maybe ~15(?) years ago. godbolt.org earliest compiler is 4.1.2 and it generates "movb [r32], imm8" with "-m32 -O2 -march=pentium4" for simple memset-style loop if it counts for something. Actually I think __clear_user should be rewritten in C with assembly. It's biggest user is probably ELF loader and those partial page .bss clears should be noticeable.
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 05, 2018 at 05:05:14PM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar wrote: > > > > > > - __clear_user() micro-optimization (Alexey Dobriyan) > > > > Was this actually tested? > > I'm not sure - Alexey? > > > I think one reason people avoided the constant was that on some > > microarchitecture it ended up being a separate uop just for the > > constant generation, because it wouldn't fit in a single uop. > Ok, fair point and agreed - if Alexey sends some measurements to back the > change > I'll keep this, otherwise queue up a revert. Tested? :^) I had P4 maybe ~15(?) years ago. godbolt.org earliest compiler is 4.1.2 and it generates "movb [r32], imm8" with "-m32 -O2 -march=pentium4" for simple memset-style loop if it counts for something. Actually I think __clear_user should be rewritten in C with assembly. It's biggest user is probably ELF loader and those partial page .bss clears should be noticeable.
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 8:05 AM Ingo Molnar wrote: > > Ok, fair point and agreed - if Alexey sends some measurements to back the > change > I'll keep this, otherwise queue up a revert. I don't think it needs to be reverted, it's not like it's likely to hurt on any modern CPU's. The issues I talked about are fairly historical - barely even 64-bit cpus - and I'm not sure an extra uop to carry a constant around even matters in that code sequence. It was more a generic issue - any micro-optimization should be based on numbers (and there should be some numbers in the commit message), not on "this should be faster". Because while intuitively immediates _should_ be faster than registers, that's simply not always "obviously true". It _may_ be true. But numbers talk. Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
On Tue, Jun 5, 2018 at 8:05 AM Ingo Molnar wrote: > > Ok, fair point and agreed - if Alexey sends some measurements to back the > change > I'll keep this, otherwise queue up a revert. I don't think it needs to be reverted, it's not like it's likely to hurt on any modern CPU's. The issues I talked about are fairly historical - barely even 64-bit cpus - and I'm not sure an extra uop to carry a constant around even matters in that code sequence. It was more a generic issue - any micro-optimization should be based on numbers (and there should be some numbers in the commit message), not on "this should be faster". Because while intuitively immediates _should_ be faster than registers, that's simply not always "obviously true". It _may_ be true. But numbers talk. Linus
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
* Linus Torvalds wrote: > On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar wrote: > > > > - __clear_user() micro-optimization (Alexey Dobriyan) > > Was this actually tested? I'm not sure - Alexey? > I think one reason people avoided the constant was that on some > microarchitecture it ended up being a separate uop just for the > constant generation, because it wouldn't fit in a single uop. > > I'm pretty sure that used to be the case for P4, for example. > > Afaik there have also been issues with decoding instructions that have > both an immediate and a memory offset. > > I suspect none of this is an issue on modern cores, but there really > at least historically were cases where > >mov %reg,mem > > was better than > >mov $imm,mem > > if %reg already had the right value, so it's not at all 100% obvious > that the micro-optimization really _optimizes_ anything. > > Any time people do this, they should add numbers. Ok, fair point and agreed - if Alexey sends some measurements to back the change I'll keep this, otherwise queue up a revert. Thanks, Ingo
Re: x86/asm: __clear_user() micro-optimization (was: "Re: [GIT PULL] x86/asm changes for v4.18")
* Linus Torvalds wrote: > On Mon, Jun 4, 2018 at 5:21 AM Ingo Molnar wrote: > > > > - __clear_user() micro-optimization (Alexey Dobriyan) > > Was this actually tested? I'm not sure - Alexey? > I think one reason people avoided the constant was that on some > microarchitecture it ended up being a separate uop just for the > constant generation, because it wouldn't fit in a single uop. > > I'm pretty sure that used to be the case for P4, for example. > > Afaik there have also been issues with decoding instructions that have > both an immediate and a memory offset. > > I suspect none of this is an issue on modern cores, but there really > at least historically were cases where > >mov %reg,mem > > was better than > >mov $imm,mem > > if %reg already had the right value, so it's not at all 100% obvious > that the micro-optimization really _optimizes_ anything. > > Any time people do this, they should add numbers. Ok, fair point and agreed - if Alexey sends some measurements to back the change I'll keep this, otherwise queue up a revert. Thanks, Ingo