[Bug lto/54231] LTO generates code for the wrong CPU if different options used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED Target Milestone|--- |7.2 --- Comment #16 from Richard Biener --- (In reply to Andrew Pinski from comment #15) > I suspect this has been fixed since maybe GCC 8 (maybe GCC 7). The use-case should now indeed work fine by means of recording all optimization and target options per function and restricting inlining. I think it was fixed in GCC 7 or even earlier.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #15 from Andrew Pinski --- I suspect this has been fixed since maybe GCC 8 (maybe GCC 7).
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #14 from Thiago Macieira 2012-09-12 13:02:23 UTC --- >From GCC's own manual: (Node "Function attributes"): On the 386/x86_64 and PowerPC backends, the inliner will not inline a function that has different target options than the caller, unless the callee has a subset of the target options of the caller. For example a function declared with `target("sse3")' can inline a function with `target("sse2")', since `-msse3' implies `-msse2'.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #13 from Thiago Macieira 2012-08-13 12:13:40 UTC --- (In reply to comment #12) > Yes, there are similar option-related bugs for this. Note somebody needs > to sit down and document the desired semantics of combining translation > units T1 and T2, compiled with different options OP1 and OP2, at link-time > with > options OP3. Desired semantics including which cross-file optimizations > (inlining?) are possible. >From my (admittedly restrict) point of view, inlining should be possible, provided the following conditions: - when inlining a function with a "lower" optimisation / target setting, apply the outer scope's setting to the inlined code - when inlining a function with a higher target requirement, inlining should be done only in the sense of partial function splitting, prologue, epilogues, constant propagation, etc. In the case that I pasted, for example, I'd like GCC to realise that it has already tested if the counter variable is 0, then forego that test in the inlined, inner function. Worst case scenario, simply forego inlining completely. Then the code would simply be no worse than the non-LTO case.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #12 from Richard Guenther 2012-08-13 11:58:33 UTC --- (In reply to comment #9) > (In reply to comment #8) > > If you do something like > > > > gcc -c t1.c -mavx -flto > > gcc -c t2.c -msse2 -flto > > gcc t1.o t2.o -flto > > > > then the link step will use -mavx -msse2, that is, target options are > > concatenated. > > Indeed. > > What I'm asking for is that each source file be compiled with its own target > options. I realise this is a request for enhancement, though. Yes, there are similar option-related bugs for this. Note somebody needs to sit down and document the desired semantics of combining translation units T1 and T2, compiled with different options OP1 and OP2, at link-time with options OP3. Desired semantics including which cross-file optimizations (inlining?) are possible.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #11 from Thiago Macieira 2012-08-13 10:12:48 UTC --- Attaching __attribute__((target("xxx"))) to the function does help. It generates the following with the my_bzero function from comment 2: 02e0 : 2e0: test %rsi,%rsi 2e3: vpxor %xmm0,%xmm0,%xmm0 2e7: je 2fe 2e9: nopl 0x0(%rax) 2f0: vmovntdq %xmm0,(%rdi) 2f4: add$0x10,%rdi 2f8: sub$0x1,%rsi 2fc: jne2f0 2fe: repz retq 0300 : 300: mov0x200171(%rip),%rax# 200478 307: mov(%rax),%eax 309: test %eax,%eax 30b: jne330 30d: test %rsi,%rsi 310: pxor %xmm0,%xmm0 314: je 332 316: nopw %cs:0x0(%rax,%rax,1) 320: movntdq %xmm0,(%rdi) 324: add$0x10,%rdi 328: sub$0x1,%rsi 32c: jne320 32e: repz retq 330: jmp2e0 332: repz retq This workaround might be useful for me in a few places where the code inlining provided by LTO was desired (even though, in this example, the AVX variant is exactly what it would be if no LTO had been used). But it won't work without major changes to the code if I have 400+ functions in a file, plus possibly inlines from headers, to be compiled.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #10 from Thiago Macieira 2012-08-13 09:53:32 UTC --- Another test: $ cat main_avx.c #define BZERO bzero_avx #pragma GCC target ("avx") #include "main.c" $ cat main_sse2.c #define BZERO bzero_sse2 #pragma GCC target ("sse2") #include "main.c" $ cat main.c #include void BZERO(char *ptr, size_t count) { __m128i zero = _mm_set1_epi8(0); while (count--) { _mm_stream_si128((__m128i*)ptr, zero); ptr += 16; } } $ gcc -flto -O2 -shared -o libtest.so main_avx.c main_sse2.c $ objdump -Cdr --no-show-raw-insn libtest.so [...] 0650 : 650: test %rsi,%rsi 653: pxor %xmm0,%xmm0 657: je 66e 659: nopl 0x0(%rax) 660: movntdq %xmm0,(%rdi) 664: add$0x10,%rdi 668: sub$0x1,%rsi 66c: jne660 66e: repz retq 0670 : 670: test %rsi,%rsi 673: pxor %xmm0,%xmm0 677: je 68e 679: nopl 0x0(%rax) 680: movntdq %xmm0,(%rdi) 684: add$0x10,%rdi 688: sub$0x1,%rsi 68c: jne680 68e: repz retq
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #9 from Thiago Macieira 2012-08-13 09:44:51 UTC --- (In reply to comment #8) > If you do something like > > gcc -c t1.c -mavx -flto > gcc -c t2.c -msse2 -flto > gcc t1.o t2.o -flto > > then the link step will use -mavx -msse2, that is, target options are > concatenated. Indeed. What I'm asking for is that each source file be compiled with its own target options. I realise this is a request for enhancement, though.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #8 from Richard Guenther 2012-08-13 08:59:18 UTC --- If you do something like gcc -c t1.c -mavx -flto gcc -c t2.c -msse2 -flto gcc t1.o t2.o -flto then the link step will use -mavx -msse2, that is, target options are concatenated.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 Steven Bosscher changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2012-08-12 CC||uros at gcc dot gnu.org Ever Confirmed|0 |1 --- Comment #7 from Steven Bosscher 2012-08-12 00:27:46 UTC --- Actually, using the builtins also doesn't work. The instruction patterns are the same and GCC recog's the "best" available one. E.g.: #(insn:TI 14 12 27 3 (set (reg:V2DI 21 xmm0 [66]) #(const_vector:V2DI [ #(const_int 0 [0]) #(const_int 0 [0]) #])) /home/stevenb/devel/build-test/gcc/include/emmintrin.h:1424 {*avx_movv2di_internal} # (expr_list:REG_EQUIV (const_vector:V2DI [ #(const_int 0 [0]) #(const_int 0 [0]) #]) #(nil))) vpxor %xmm0, %xmm0, %xmm0 # 14*avx_movv2di_internal/1 [length = 4] vs. #(insn:TI 14 12 27 3 (set (reg:V2DI 21 xmm0 [66]) #(const_vector:V2DI [ #(const_int 0 [0]) #(const_int 0 [0]) #])) /home/stevenb/devel/build-test/gcc/include/emmintrin.h:1424 1124 {*movv2di_internal} # (expr_list:REG_EQUIV (const_vector:V2DI [ #(const_int 0 [0]) #(const_int 0 [0]) #]) #(nil))) pxor%xmm0, %xmm0# 14*movv2di_internal/1 [length = 4] These insns just look the same to GCC, so even if the sse2 builtin expander is used, the AVX instruction is selected. Thus a bug, confirmed. Adding i386 guy to CC.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #6 from Thiago Macieira 2012-08-11 23:23:39 UTC --- (In reply to comment #5) > "Fixing" this in the compiler isn't straight-forward. The _mm_stream functions > are just wrappers around builtin functions. It may work correctly if you put > the bzero functions in two separate files or call the builtins directly (a > variant of __builtin_ia32_movntdq in this case), but the way your BZERO is > defined, I don't think it will ever work. They *are* in separate files already. Calling the builtin directly instead of the intrinsic wrapper might work, but I did not test it because it's not acceptable, as the code would be GCC-specific. > Have you considered using ifunc? IFUNC is also irrelevant: in order to use it, I need to have two separate source files which are compiled with different compiler settings, so we end up where we started: the bzero_sse2() function will have AVX code.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #5 from Steven Bosscher 2012-08-11 22:46:31 UTC --- "Fixing" this in the compiler isn't straight-forward. The _mm_stream functions are just wrappers around builtin functions. It may work correctly if you put the bzero functions in two separate files or call the builtins directly (a variant of __builtin_ia32_movntdq in this case), but the way your BZERO is defined, I don't think it will ever work. Have you considered using ifunc?
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 Andrew Pinski changed: What|Removed |Added Component|c |lto Severity|normal |enhancement --- Comment #4 from Andrew Pinski 2012-08-11 22:39:48 UTC --- Basically the target attribute should come into play but that is currently not really supported even without LTO.