[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2021-09-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
   Target Milestone|--- |7.2

--- Comment #16 from Richard Biener  ---
(In reply to Andrew Pinski from comment #15)
> I suspect this has been fixed since maybe GCC 8 (maybe GCC 7).

The use-case should now indeed work fine by means of recording all optimization
and target options per function and restricting inlining.  I think it was fixed
in GCC 7 or even earlier.

[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2021-09-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #15 from Andrew Pinski  ---
I suspect this has been fixed since maybe GCC 8 (maybe GCC 7).

[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-09-12 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #14 from Thiago Macieira  2012-09-12 
13:02:23 UTC ---
>From GCC's own manual:

(Node "Function attributes"):

 On the 386/x86_64 and PowerPC backends, the inliner will not
 inline a function that has different target options than the
 caller, unless the callee has a subset of the target options of
 the caller.  For example a function declared with `target("sse3")'
 can inline a function with `target("sse2")', since `-msse3'
 implies `-msse2'.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #13 from Thiago Macieira  2012-08-13 
12:13:40 UTC ---
(In reply to comment #12)
> Yes, there are similar option-related bugs for this.  Note somebody needs
> to sit down and document the desired semantics of combining translation
> units T1 and T2, compiled with different options OP1 and OP2, at link-time 
> with
> options OP3.  Desired semantics including which cross-file optimizations
> (inlining?) are possible.

>From my (admittedly restrict) point of view, inlining should be possible,
provided the following conditions:
 - when inlining a function with a "lower" optimisation / target setting, apply
the outer scope's setting to the inlined code
 - when inlining a function with a higher target requirement, inlining should
be done only in the sense of partial function splitting, prologue, epilogues,
constant propagation, etc.

In the case that I pasted, for example, I'd like GCC to realise that it has
already tested if the counter variable is 0, then forego that test in the
inlined, inner function.

Worst case scenario, simply forego inlining completely. Then the code would
simply be no worse than the non-LTO case.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #12 from Richard Guenther  2012-08-13 
11:58:33 UTC ---
(In reply to comment #9)
> (In reply to comment #8)
> > If you do something like
> > 
> >  gcc -c t1.c -mavx -flto
> >  gcc -c t2.c -msse2 -flto
> >  gcc t1.o t2.o -flto
> > 
> > then the link step will use -mavx -msse2, that is, target options are
> > concatenated.
> 
> Indeed.
> 
> What I'm asking for is that each source file be compiled with its own target
> options. I realise this is a request for enhancement, though.

Yes, there are similar option-related bugs for this.  Note somebody needs
to sit down and document the desired semantics of combining translation
units T1 and T2, compiled with different options OP1 and OP2, at link-time with
options OP3.  Desired semantics including which cross-file optimizations
(inlining?) are possible.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #11 from Thiago Macieira  2012-08-13 
10:12:48 UTC ---
Attaching __attribute__((target("xxx"))) to the function does help.

It generates the following with the my_bzero function from comment 2:

02e0 :
 2e0:   test   %rsi,%rsi
 2e3:   vpxor  %xmm0,%xmm0,%xmm0
 2e7:   je 2fe 
 2e9:   nopl   0x0(%rax)
 2f0:   vmovntdq %xmm0,(%rdi)
 2f4:   add$0x10,%rdi
 2f8:   sub$0x1,%rsi
 2fc:   jne2f0 
 2fe:   repz retq 

0300 :
 300:   mov0x200171(%rip),%rax# 200478 
 307:   mov(%rax),%eax
 309:   test   %eax,%eax
 30b:   jne330 
 30d:   test   %rsi,%rsi
 310:   pxor   %xmm0,%xmm0
 314:   je 332 
 316:   nopw   %cs:0x0(%rax,%rax,1)
 320:   movntdq %xmm0,(%rdi)
 324:   add$0x10,%rdi
 328:   sub$0x1,%rsi
 32c:   jne320 
 32e:   repz retq 
 330:   jmp2e0 
 332:   repz retq 


This workaround might be useful for me in a few places where the code inlining
provided by LTO was desired (even though, in this example, the AVX variant is
exactly what it would be if no LTO had been used). But it won't work without
major changes to the code if I have 400+ functions in a file, plus possibly
inlines from headers, to be compiled.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #10 from Thiago Macieira  2012-08-13 
09:53:32 UTC ---
Another test:

$ cat main_avx.c
#define BZERO bzero_avx
#pragma GCC target ("avx")
#include "main.c"

$ cat main_sse2.c
#define BZERO bzero_sse2
#pragma GCC target ("sse2")
#include "main.c"

$ cat main.c
#include 

void BZERO(char *ptr, size_t count)
{
__m128i zero = _mm_set1_epi8(0);
while (count--) {
_mm_stream_si128((__m128i*)ptr, zero);
ptr += 16;
}
}

$ gcc -flto -O2 -shared -o libtest.so main_avx.c main_sse2.c
$ objdump -Cdr --no-show-raw-insn libtest.so
[...]

0650 :
 650:   test   %rsi,%rsi
 653:   pxor   %xmm0,%xmm0
 657:   je 66e 
 659:   nopl   0x0(%rax)
 660:   movntdq %xmm0,(%rdi)
 664:   add$0x10,%rdi
 668:   sub$0x1,%rsi
 66c:   jne660 
 66e:   repz retq 

0670 :
 670:   test   %rsi,%rsi
 673:   pxor   %xmm0,%xmm0
 677:   je 68e 
 679:   nopl   0x0(%rax)
 680:   movntdq %xmm0,(%rdi)
 684:   add$0x10,%rdi
 688:   sub$0x1,%rsi
 68c:   jne680 
 68e:   repz retq


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #9 from Thiago Macieira  2012-08-13 09:44:51 
UTC ---
(In reply to comment #8)
> If you do something like
> 
>  gcc -c t1.c -mavx -flto
>  gcc -c t2.c -msse2 -flto
>  gcc t1.o t2.o -flto
> 
> then the link step will use -mavx -msse2, that is, target options are
> concatenated.

Indeed.

What I'm asking for is that each source file be compiled with its own target
options. I realise this is a request for enhancement, though.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #8 from Richard Guenther  2012-08-13 
08:59:18 UTC ---
If you do something like

 gcc -c t1.c -mavx -flto
 gcc -c t2.c -msse2 -flto
 gcc t1.o t2.o -flto

then the link step will use -mavx -msse2, that is, target options are
concatenated.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread steven at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

Steven Bosscher  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2012-08-12
 CC||uros at gcc dot gnu.org
 Ever Confirmed|0   |1

--- Comment #7 from Steven Bosscher  2012-08-12 
00:27:46 UTC ---
Actually, using the builtins also doesn't work. The instruction patterns are
the same and GCC recog's the "best" available one. E.g.:

#(insn:TI 14 12 27 3 (set (reg:V2DI 21 xmm0 [66])
#(const_vector:V2DI [
#(const_int 0 [0])
#(const_int 0 [0])
#])) /home/stevenb/devel/build-test/gcc/include/emmintrin.h:1424
 {*avx_movv2di_internal}
# (expr_list:REG_EQUIV (const_vector:V2DI [
#(const_int 0 [0])
#(const_int 0 [0])
#])
#(nil)))
vpxor   %xmm0, %xmm0, %xmm0 # 14*avx_movv2di_internal/1 [length
= 4]

vs.

#(insn:TI 14 12 27 3 (set (reg:V2DI 21 xmm0 [66])
#(const_vector:V2DI [
#(const_int 0 [0])
#(const_int 0 [0])
#])) /home/stevenb/devel/build-test/gcc/include/emmintrin.h:1424
1124 {*movv2di_internal}
# (expr_list:REG_EQUIV (const_vector:V2DI [
#(const_int 0 [0])
#(const_int 0 [0])
#])
#(nil)))
pxor%xmm0, %xmm0# 14*movv2di_internal/1 [length = 4]

These insns just look the same to GCC, so even if the sse2 builtin expander is
used, the AVX instruction is selected.

Thus a bug, confirmed. Adding i386 guy to CC.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #6 from Thiago Macieira  2012-08-11 23:23:39 
UTC ---
(In reply to comment #5)
> "Fixing" this in the compiler isn't straight-forward. The _mm_stream functions
> are just wrappers around builtin functions. It may work correctly if you put
> the bzero functions in two separate files or call the builtins directly (a
> variant of __builtin_ia32_movntdq in this case), but the way your BZERO is
> defined, I don't think it will ever work.

They *are* in separate files already. Calling the builtin directly instead of
the intrinsic wrapper might work, but I did not test it because it's not
acceptable, as the code would be GCC-specific.

> Have you considered using ifunc?

IFUNC is also irrelevant: in order to use it, I need to have two separate
source files which are compiled with different compiler settings, so we end up
where we started: the bzero_sse2() function will have AVX code.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread steven at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #5 from Steven Bosscher  2012-08-11 
22:46:31 UTC ---
"Fixing" this in the compiler isn't straight-forward. The _mm_stream functions
are just wrappers around builtin functions. It may work correctly if you put
the bzero functions in two separate files or call the builtins directly (a
variant of __builtin_ia32_movntdq in this case), but the way your BZERO is
defined, I don't think it will ever work.

Have you considered using ifunc?


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread pinskia at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

Andrew Pinski  changed:

   What|Removed |Added

  Component|c   |lto
   Severity|normal  |enhancement

--- Comment #4 from Andrew Pinski  2012-08-11 
22:39:48 UTC ---
Basically the target attribute should come into play but that is currently not
really supported even without LTO.