[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281 --- Comment #8 from Adam Warner adam at consulting dot net.nz 2011-03-04 10:51:01 UTC --- Jakub, I fail to see how your conclusion not to do this is supported by the facts. There are: (a) six global register variables (though the same effect can be observed with one global register variable and -ffixed-rbx -ffixed-r12 -ffixed-r13 -ffixed-r14 -ffixed-r15) (b) six function arguments (c) one stack pointer Therefore three registers remain free: %rax, %r10 and %r11. Only one free register is required to generate the optimal code. GCC 4.5 can do this. GCC 4.6 can't. The fact GCC outputs the assembly sequence mov %rdi,%r10; mov %r10,%rdi is evidence of a bizarre cascade of bugs. Even rudimentary pinhole optimisation could elide that assembly sequence. Are you able to explain why GCC outputs assembly code for a register that is never unmodified? %rdi remains unmodified. This has nothing to do with a compiler has much more limited choices in generating close to optimal code. The compiler has the choice to use %rax, %r10 or %r11 to store the address to jump to without spilling. There is no register pressure in this example. One register is required. Three are available.
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281 --- Comment #9 from Jakub Jelinek jakub at gcc dot gnu.org 2011-03-04 11:22:51 UTC --- You are talking about this single testcase, I'm talking in general that if gcc is on x86_64 tuned for a medium sized general purpose register file and you suddenly turn it into a very limited size general purpose register file, you can get non-optimal code. Such bugreports are definitely much lower priority than what you get with the common case where no global register vars are used, or at most one or two. The weird saving/restoring of %rdi into/from %r10 is because the RA chose to use %rdi for a temporary used in incrementing of REG7 and loading the next pointer from it, while postreload managed to remove all needs for such a temporary register, it is too late for the save/restore code not to be emitted.
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281 --- Comment #10 from Adam Warner adam at consulting dot net.nz 2011-03-05 02:01:04 UTC --- Jakub, Thanks for the explanation [The weird saving/restoring of %rdi into/from %r10 is because the RA chose to use %rdi for a temporary used in incrementing of REG7 and loading the next pointer from it, while postreload managed to remove all needs for such a temporary register, it is too late for the save/restore code not to be emitted.] I've replaced the memory lookup and REG7 increment with equivalent inline assembly to help clarify this explanation. With one remaining source code variable (next of type fn_t) and everything else opaque assembly the code generation is worse. #include stdint.h /* Six caller-saved registers as input arguments */ #define CALLER_SAVED uint64_t REG0, uint64_t REG1, uint64_t REG2, \ uint64_t REG3, uint64_t REG4, uint64_t REG5 typedef void (*fn_t)(CALLER_SAVED); /* Six callee-saved registers as global register variables */ register uint64_t REG6 __asm__(rbx); register fn_t*REG7 __asm__(rbp); register uint64_t REG8 __asm__(r12); register uint64_t REG9 __asm__(r13); register uint64_t REG10 __asm__(r14); register uint64_t REG11 __asm__(r15); /* Free general purpose registers are RSP, RAX, R10 and R11 */ void optimal_code_generation(CALLER_SAVED) { fn_t next=REG7[1]; next(REG0, REG1, REG2, REG3, REG4, REG5); } void unmodified_input_arg_is_copied(CALLER_SAVED) { fn_t next=REG7[1]; ++REG7; next(REG0, REG1, REG2, REG3, REG4, REG5); } void unmodified_input_arg_is_copied_alt(CALLER_SAVED) { fn_t next=REG7[1]; __asm__(add $8, %0 : +r (REG7)); next(REG0, REG1, REG2, REG3, REG4, REG5); } void unmodified_input_arg_is_copied_alt2(CALLER_SAVED) { fn_t next; __asm__(mov 0x8(%[from]), %[to] : [to] =a (next) : [from] r (REG7)); __asm__(add $8, %0 : +r (REG7)); next(REG0, REG1, REG2, REG3, REG4, REG5); } int main() { return 0; } $ gcc-4.6 -O3 unmodified_ordinary_register_is_copied_with_pure_asm.c objdump -d -m i386:x86-64 a.out|less 004004a0 optimal_code_generation: 4004a0: 48 8b 45 08 mov0x8(%rbp),%rax 4004a4: ff e0 jmpq *%rax 4004a6: 66 2e 0f 1f 84 00 00nopw %cs:0x0(%rax,%rax,1) 4004ad: 00 00 00 004004b0 unmodified_input_arg_is_copied: 4004b0: 49 89 famov%rdi,%r10 4004b3: 48 8b 45 08 mov0x8(%rbp),%rax 4004b7: 48 8d 6d 08 lea0x8(%rbp),%rbp 4004bb: 4c 89 d7mov%r10,%rdi 4004be: ff e0 jmpq *%rax 004004c0 unmodified_input_arg_is_copied_alt: 4004c0: 49 89 famov%rdi,%r10 4004c3: 48 8b 45 08 mov0x8(%rbp),%rax 4004c7: 4c 89 d7mov%r10,%rdi 4004ca: 48 83 c5 08 add$0x8,%rbp 4004ce: ff e0 jmpq *%rax 004004d0 unmodified_input_arg_is_copied_alt2: 4004d0: 49 89 famov%rdi,%r10 4004d3: 48 89 f7mov%rsi,%rdi 4004d6: 48 89 d6mov%rdx,%rsi 4004d9: 48 8b 45 08 mov0x8(%rbp),%rax 4004dd: 48 89 f2mov%rsi,%rdx 4004e0: 48 89 femov%rdi,%rsi 4004e3: 4c 89 d7mov%r10,%rdi 4004e6: 48 83 c5 08 add$0x8,%rbp 4004ea: ff e0 jmpq *%rax unmodified_input_arg_is_copied_alt2() specifies a variable next of type fn_t. The first assembly statement __asm__(mov 0x8(%[from]), %[to] : [to] =a (next) : [from] r (REG7)); directly translates to mov 0x8(%rbp),%rax. Note use of the =a machine constrain to force use of the free %rax register. The second assembly statement __asm__(add $8, %0 : +r (REG7)); directly translates to add $0x8,%rbp. This is in-place register mutation which does not require a temporary for incrementing. While I suspected I might be able to work around the spurious saving/restoring of unmodified registers with inline assembly the results are far worse. mov %rdi,%r10; mov %rsi,%rdi; mov %rdx,%rsi is maximally serialized. One cannot move %rdx into %rsi until %rsi is moved into %rdi. But one cannot move %rsi into %rdi until %rdi is moved into %r10. Restoring the unmodified registers is also maximally serialized.
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281 --- Comment #6 from Adam Warner adam at consulting dot net.nz 2011-03-04 07:22:47 UTC --- Below is a very simple test case of an ordinary input argument to a function being: (a) copied to a spare register (b) copied back from a spare register When the input argument is: (a) never modified; and (b) an ordinary register (not a global register variable) unmodified_ordinary_register_is_copied.c: #include stdint.h /* Six caller-saved registers as input arguments */ #define CALLER_SAVED uint64_t REG0, uint64_t REG1, uint64_t REG2, \ uint64_t REG3, uint64_t REG4, uint64_t REG5 typedef void (*fn_t)(CALLER_SAVED); /* Six callee-saved registers as global register variables */ register uint64_t REG6 __asm__(rbx); register fn_t*REG7 __asm__(rbp); register uint64_t REG8 __asm__(r12); register uint64_t REG9 __asm__(r13); register uint64_t REG10 __asm__(r14); register uint64_t REG11 __asm__(r15); /* Free general purpose registers are RSP, RAX, R10 and R11 */ void optimal_code_generation(CALLER_SAVED) { fn_t next=REG7[1]; next(REG0, REG1, REG2, REG3, REG4, REG5); } void unmodified_input_arg_is_copied(CALLER_SAVED) { fn_t next=REG7[1]; ++REG7; next(REG0, REG1, REG2, REG3, REG4, REG5); } int main() { return 0; } gcc-4.5 generates optimal code for both functions: $ gcc-4.5 -O3 unmodified_ordinary_register_is_copied.c objdump -d -m i386:x86-64 a.out|less ... 004004a0 optimal_code_generation: 4004a0: 48 8b 45 08 mov0x8(%rbp),%rax 4004a4: ff e0 jmpq *%rax ... 004004b0 unmodified_input_arg_is_copied: 4004b0: 48 8b 45 08 mov0x8(%rbp),%rax 4004b4: 48 83 c5 08 add$0x8,%rbp 4004b8: ff e0 jmpq *%rax ... Compare with GCC 4.6: $ gcc-4.6 --version gcc-4.6 (Debian 4.6-20110227-1) 4.6.0 20110227 (experimental) [trunk revision 170543] ... $ gcc-4.6 -O3 unmodified_ordinary_register_is_copied.c objdump -d -m i386:x86-64 a.out|less ... 004004a0 optimal_code_generation: 4004a0: 48 8b 45 08 mov0x8(%rbp),%rax 4004a4: ff e0 jmpq *%rax ... 004004b0 unmodified_input_arg_is_copied: 4004b0: 49 89 famov%rdi,%r10 4004b3: 48 8b 45 08 mov0x8(%rbp),%rax 4004b7: 48 8d 6d 08 lea0x8(%rbp),%rbp 4004bb: 4c 89 d7mov%r10,%rdi 4004be: ff e0 jmpq *%rax ... According to the Linux x86-64 ABI %rdi is the first argument passed to the functions. For some reason this is being copied to %r10 before being copied back from %r10 to %rdi. At no stage is %rdi modified. (Minor aside: lea 0x8(%rbp),%rbp has also replaced add $0x8,%rbp. My Intel Core 2 hardware can execute a maximum of one LEA instruction per clock cycle compared to three ADD instructions per clock cycle. If I add -march=core2 -mtune=core2 the code generation becomes: 004004b0 unmodified_input_arg_is_copied: 4004b0: 48 8b 45 08 mov0x8(%rbp),%rax 4004b4: 48 8d 6d 08 lea0x8(%rbp),%rbp 4004b8: 49 89 famov%rdi,%r10 4004bb: 4c 89 d7mov%r10,%rdi 4004be: ff e0 jmpq *%rax ) This bizarre register copying goes away if I comment out one of the six global register variables (i.e. five callee-saved global register variables instead of six). For some reason GCC 4.6 cannot generate sensible code with %rsp, %rax, %r10 and %r11 available---but can generate sensible code when an additional register (%rbx, %r12, %r13, %r14 or %r15) is available.
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #7 from Jakub Jelinek jakub at gcc dot gnu.org 2011-03-04 07:46:11 UTC --- Using 6 global register variables is clearly self-inflicted pain, even on x86_64, because if you take 6 registers away and another 6 registers are used for parameter passing, you make the target very limited on number of registers and the compiler has much more limited choices in generating close to optimal code. Just don't do this.
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
--- Comment #4 from pinskia at gcc dot gnu dot org 2010-09-12 14:11 --- This is caused by revision 160124: Not really, it is a noreturn function so the behavior is correct for our policy of allowing a more correct backtrace for noreturn functions. The testcase is a incorrect one based on size and not really that interesting anymore with respect of global register variables. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
--- Comment #5 from adam at consulting dot net dot nz 2010-09-13 00:24 --- Andrew Pinski wrote: This is caused by revision 160124: Not really, it is a noreturn function so the behavior is correct for our policy of allowing a more correct backtrace for noreturn functions. I'm not sure what you're trying to say here Andrew. Are you trying to justify -O3 generating slower code to simplify debugging? The testcase is a incorrect one based on size If you mean zero-extension of 32-bit function pointers, this is the x86-64 small code model. If you mean that you don't care that the testcase increased in size without further benchmarking then empirical analysis is actually unnecessary. The generated assembly is clearly worse. and not really that interesting anymore with respect of global register variables. It's another example of global register variables being copied for no good reason whatsoever. RAX is free and the obvious translation of uint32_t next = Iptr[1]; to x86-64 assembly is mov eax,DWORD PTR [rbp+0x4]; (Intel syntax, where RBP is the global register variable). Generating mov rax,rbp; mov eax,DWORD PTR [rax+0x4]; is just dumb. I've been experimenting with optimal forms of virtual machine dispatch for a long time and what you have is a fragment of a very fast direct threaded interpreter. So fast in fact that a type-safe countdown will execute at 5 cycles per iteration on Intel Core 2: #include assert.h #include stdint.h #include stdlib.h #define LIKELY(x) __builtin_expect(!!(x), 1) #define UNLIKELY(x) __builtin_expect(!!(x), 0) register uint32_t *Iptr __asm__(rbp); typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b); #define FUNC(x) ((inst_t) (uint64_t) x) #define INST(x) ((uint32_t) (uint64_t) x) __attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t b) { assert(FIXME==); } void dec(uint64_t types, uint64_t a, uint64_t b) { if (LIKELY((types 0xFF) == 1)) { uint32_t next = Iptr[1]; --a; ++Iptr; FUNC(next)(types, a, b); } else dec_helper(types, a, b); } __attribute__ ((noinline)) void if_not_equal_jump_back_1_helper(uint64_t types, uint64_t a, uint64_t b) { assert(FIXME==); } void if_not_equal_jump_back_1(uint64_t types, uint64_t a, uint64_t b) { if (LIKELY((types 0x) == 0x0101)) { if (LIKELY(a != b)) { uint32_t next = Iptr[-1]; --Iptr; FUNC(next)(types, a, b); } else { uint32_t next = Iptr[1]; ++Iptr; FUNC(next)(types, a, b); } } else if_not_equal_jump_back_1_helper(types, a, b); } void unconditional_exit(uint64_t types, uint64_t a, uint64_t b) { exit(0); } __attribute__ ((noinline, noclone)) void execute(uint32_t *code, uint64_t types, uint64_t a, uint64_t b) { Iptr = code; FUNC(code[0])(types, a, b); } int main() { uint32_t code[]={INST(dec), INST(if_not_equal_jump_back_1), INST(unconditional_exit)}; execute(code + 1, 0x0101, 30, 0); return 0; } $ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c time ./a.out real0m5.007s user0m4.996s sys 0m0.004s CPU is 3GHz. Code execution starts at the second instruction (if_not_equal_jump_back_1). a==30 of type==1 is not equal to b==0 of type==1 (the two type comparisons are performed in parallel in one cycle without masking since one can compare the low 8-, 16- or 32-bits of a 64-bit register without masking and the two types are packed into the low 16-bits of the types register). As a!=b the code jumps back to the dec instruction, which performs another type check that a is of type==1 before decrementing a and jumping to if_not_equal_jump_back_1. This continues until a==0 and program exit occurs. While the generated assembly of GCC snapshot speaks for itself, here's some empirical evidence of its inferiority: $ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c time ./a.out real0m10.014s user0m10.009s sys 0m0.000s GCC snapshot has doubled the execution time of this virtual machine example (compared to gcc-4.3, gcc-4.4 and gcc-4.5). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
--- Comment #2 from adam at consulting dot net dot nz 2010-09-11 11:15 --- GCC snapshot has regressed compared to gcc-4.5: #include assert.h #include stdint.h #define LIKELY(x) __builtin_expect(!!(x), 1) #define UNLIKELY(x) __builtin_expect(!!(x), 0) register uint32_t *Iptr __asm__(rbp); typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b); __attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t b) { assert(FIXME==); } void dec(uint64_t types, uint64_t a, uint64_t b) { if (LIKELY((types 0xFF) == 1)) { uint32_t next = Iptr[1]; --a; ++Iptr; ((inst_t) (uint64_t) next)(types, a, b); } else dec_helper(types, a, b); } int main() { return 0; } $ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch.c objdump -d -m i386:x86-64:intel a.out|less 00400520 dec: 400520: 40 80 ff 01 cmpdil,0x1 400524: 75 0d jne400533 dec+0x13 400526: 8b 45 04moveax,DWORD PTR [rbp+0x4] 400529: 48 83 ee 01 subrsi,0x1 40052d: 48 83 c5 04 addrbp,0x4 400531: ff e0 jmprax 400533: e9 c8 ff ff ff jmp400500 dec_helper 400538: eb 06 jmp400540 main 40053a: 90 nop 40053b: 90 nop 40053c: 90 nop 40053d: 90 nop 40053e: 90 nop 40053f: 90 nop The above code generation is fine. Here is what GCC snapshot {gcc (Debian 20100828-1) 4.6.0 20100828 (experimental) [trunk revision 163616]} generates: $ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch.c objdump -d -m i386:x86-64:intel a.out|less 00400500 dec: 400500: 48 83 ec 08 subrsp,0x8 400504: 40 80 ff 01 cmpdil,0x1 400508: 75 14 jne40051e dec+0x1e 40050a: 48 89 e8movrax,rbp 40050d: 48 83 ee 01 subrsi,0x1 400511: 48 8d 6d 04 learbp,[rbp+0x4] 400515: 8b 40 04moveax,DWORD PTR [rax+0x4] 400518: 48 83 c4 08 addrsp,0x8 40051c: ff e0 jmprax 40051e: e8 bd ff ff ff call 4004e0 dec_helper 400523: eb 0b jmp400530 main 400525: 90 nop 400526: 90 nop 400527: 90 nop 400528: 90 nop 400529: 90 nop 40052a: 90 nop 40052b: 90 nop 40052c: 90 nop 40052d: 90 nop 40052e: 90 nop 40052f: 90 nop Function size has jumped from rounded up to 32 bytes to rounded up to 48 bytes. Tail call has been missed, leading to insertion of stack alignment instructions. Global register variable RBP is copied into RAX for no reason whatsoever, subverting loading the next instruction before recomputing the instruction pointer. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
--- Comment #3 from hjl dot tools at gmail dot com 2010-09-11 13:49 --- (In reply to comment #2) GCC snapshot has regressed compared to gcc-4.5: #include assert.h #include stdint.h #define LIKELY(x) __builtin_expect(!!(x), 1) #define UNLIKELY(x) __builtin_expect(!!(x), 0) register uint32_t *Iptr __asm__(rbp); typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b); __attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t b) { assert(FIXME==); } void dec(uint64_t types, uint64_t a, uint64_t b) { if (LIKELY((types 0xFF) == 1)) { uint32_t next = Iptr[1]; --a; ++Iptr; ((inst_t) (uint64_t) next)(types, a, b); } else dec_helper(types, a, b); } This is caused by revision 160124: http://gcc.gnu.org/ml/gcc-cvs/2010-06/msg00036.html -- hjl dot tools at gmail dot com changed: What|Removed |Added CC||hubicka at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
-- rguenth at gcc dot gnu dot org changed: What|Removed |Added Priority|P3 |P2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
-- steven at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Component|regression |rtl-optimization Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2010-07-20 22:52:46 date|| Summary|Global Register variable|[4.3/4.4/4.5/4.6 Regression] |pessimisation and regression|Global Register variable ||pessimisation http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
[Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
-- pinskia at gcc dot gnu dot org changed: What|Removed |Added Keywords||missed-optimization Target Milestone|--- |4.3.6 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281