[Bug target/116582] gather is a win in some cases on zen CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582 --- Comment #6 from Jan Hubicka --- Here is a variant of benchmark that needs masking #include #define M 1024*1024 T a[M], b[M]; int indices[M]; char c[M]; __attribute__ ((noipa)) void test () { for (int i = 0; i < 1024* 16; i++) if (c[i]) a[i] += b[indices[i]]; } int main() { for (int i = 0 ; i < M; i++) { indices[i] = rand () % M; c[i] = rand () % 2; } for (int i = 0 ; i < 1; i++) test (); return 0; } jh@shroud:~> ~/trunk-install-znver5/bin/g++ -DT=float -march=native cnd.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 281.03 msec task-clock:u #0.999 CPUs utilized ( +- 0.62% ) 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 659 page-faults:u#2.345 K/sec ( +- 0.06% ) 1,156,011,975 cycles:u #4.113 GHz ( +- 0.65% ) 757,216,769 stalled-cycles-frontend:u# 65.50% frontend cycles idle( +- 1.59% ) 1,292,982,312 instructions:u #1.12 insn per cycle #0.59 stalled cycles per insn ( +- 0.00% ) 360,669,069 branches:u #1.283 G/sec ( +- 0.00% ) 118,731 branch-misses:u #0.03% of all branches ( +- 8.51% ) 0.28126 +- 0.00173 seconds time elapsed ( +- 0.62% ) jh@shroud:~> ~/trunk-install-znver5/bin/g++ -DT=float -march=native cnd.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat -r 10 ./a.out 401241: 62 f2 7d 4d 92 1c 8dvgatherdps 0x904080(,%zmm1,4),%zmm3{%k5} 40125b: 62 f2 7d 4e 92 14 8dvgatherdps 0x904080(,%zmm1,4),%zmm2{%k6} 40126a: 62 f2 7d 4f 92 0c a5vgatherdps 0x904080(,%zmm4,4),%zmm1{%k7} 401280: 62 f2 7d 4d 92 2c a5vgatherdps 0x904080(,%zmm4,4),%zmm5{%k5} Performance counter stats for './a.out' (10 runs): 266.73 msec task-clock:u #0.999 CPUs utilized ( +- 4.31% ) 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 659 page-faults:u#2.471 K/sec ( +- 0.05% ) 1,097,343,324 cycles:u #4.114 GHz ( +- 4.33% ) 4,009,606 stalled-cycles-frontend:u#0.37% frontend cycles idle( +- 6.91% ) 241,592,306 instructions:u #0.22 insn per cycle #0.02 stalled cycles per insn ( +- 0.00% ) 35,549,063 branches:u # 133.279 M/sec ( +- 0.00% ) 92,191 branch-misses:u #0.26% of all branches ( +- 0.06% ) 0.2670 +- 0.0115 seconds time elapsed ( +- 4.30% ) so the difference in number of cycles is quite small while frontend works much harder without gether. If c array is ocnstnat 1: jh@shroud:~> ~/trunk-install-znver5/bin/g++ -DT=float -march=native cnd.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 520.92 msec task-clock:u #1.000 CPUs utilized ( +- 5.29% ) 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 659 page-faults:u#1.265 K/sec ( +- 0.04% ) 2,142,512,947 cycles:u #4.113 GHz ( +- 5.31% ) 137,707,449 stalled-cycles-frontend:u#6.43% frontend cycles idle( +- 94.67% ) 1,553,801,640 instructions:u #0.73 insn per cycle #0.09 stalled cycles per insn ( +- 0.00% ) 344,940,506 branches:u # 662.177 M/sec ( +
[Bug other/85716] No easy way for end-user to tell what GCC is doing when compilation is slow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85716 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #16 from Jan Hubicka --- For LTO linking we do have some idea about progress during ltrans since we compute estimated sizes of functions and we know the size of whole unit we build. WPA stage can at least be divided into few steps (i.e. streaming in where we know size of input files, inlining and stream out)
[Bug target/116582] gather is a win in some cases on zen CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582 --- Comment #3 from Jan Hubicka --- Just for completeness the codegen for parest sparse matrix multiply is: 0.31 │320: kmovb %k1,%k4 0.25 │ kmovb %k1,%k5 0.28 │ vmovdqu32 (%rcx,%rax,1),%zmm0 0.32 │ vpmovzxdq %ymm0,%zmm4 0.31 │ vextracti32x8 $0x1,%zmm0,%ymm0 0.48 │ vpmovzxdq %ymm0,%zmm0 10.32 │ vgatherqpd(%r14,%zmm4,8),%zmm2{%k4} 1.90 │ vfmadd231pd (%rdx,%rax,2),%zmm2,%zmm1 14.86 │ vgatherqpd(%r14,%zmm0,8),%zmm5{%k5} 0.27 │ vfmadd231pd 0x40(%rdx,%rax,2),%zmm5,%zmm1 0.26 │ add $0x40,%rax 0.23 │ cmp %rax,%rdi │ ↑ jne 320 which looks OK to me.
[Bug target/116582] gather is a win in some cases on zen CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582 --- Comment #2 from Jan Hubicka --- it is mysterious. I was looking into why in some cases the gather is a win in micro-benchmark and loss in real benchmark. Indeed distribution of indices makes difference. If I make indices random then the performance effect is neutral: jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 454.77 msec task-clock:u #0.999 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 663 page-faults:u#1.458 K/sec 1,854,500,227 cycles:u #4.078 GHz 4,788,337 stalled-cycles-frontend:u#0.26% frontend cycles idle 651,597,070 instructions:u #0.35 insn per cycle #0.01 stalled cycles per insn 58,222,408 branches:u # 128.027 M/sec 60,269 branch-misses:u #0.10% of all branches 0.455155383 seconds time elapsed 0.455154000 seconds user 0.0 seconds sys jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401212: 62 f2 7d 4a 92 04 8dvgatherdps 0x404080(,%zmm1,4),%zmm0{%k2} Performance counter stats for './a.out': 448.84 msec task-clock:u #0.999 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 663 page-faults:u#1.477 K/sec 1,834,437,666 cycles:u #4.087 GHz 4,522,424 stalled-cycles-frontend:u#0.25% frontend cycles idle 160,137,040 instructions:u #0.09 insn per cycle #0.03 stalled cycles per insn 27,502,394 branches:u # 61.274 M/sec 60,328 branch-misses:u #0.22% of all branches 0.449240415 seconds time elapsed 0.449224000 seconds user 0.0 seconds sys If I make stride 8 then it is a win: #include #define M 1024*1024 int indices[M]; T a[M], b[M]; __attribute__ ((noipa)) void test () { for (int i = 0; i < 1024* 16; i++) a[i] += b[indices[i]]; } int main() { for (int i = 0 ; i < M; i++) indices[i] = (i * 8)%M; for (int i = 0 ; i < 1; i++) test (); return 0; } jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 5,827.78 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 222 page-faults:u# 38.093 /sec 23,975,482,386 cycles:u #4.114 GHz 784,362,546 stalled-cycles-frontend:u#3.27% frontend cycles idle 576,680,806 instructions:u #0.02 insn per cycle #1.36 stalled cycles per insn 41,523,290 branches:u #7.125 M/sec 53,461 branch-misses:u #0.13% of all branches 5.828522527 seconds time elapsed 5.828224000 seconds user 0.0 seconds sys jh@shroud:/tmp> ~/trunk-install-znver5/bin/g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401252: 62 f2 7d 4a 92 04 8dvgatherdps 0x404080(,%zmm1,4),%zmm0{%k2} Performance counter stats for './a.out'
[Bug middle-end/116582] New: gather is a win in some cases on zen CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582 Bug ID: 116582 Summary: gather is a win in some cases on zen CPUs Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- While the sparse multiply in parest and tsvc does not seem to work well gather the following benchmark likes it: T a[M], b[M]; __attribute__ ((noipa)) void test () { for (int i = 0; i < 1024* 16; i++) a[i] += b[indices[i]]; } int main() { for (int i = 0 ; i < M; i++) indices[i] = (i * 8) % M; for (int i = 0 ; i < 1; i++) test (); return 0; } jan@localhost:/tmp> g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=^use_gather_4parts -mtune-ctrl=^use_gather_8parts -mtune-ctrl=^use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out Performance counter stats for './a.out': 3,499.60 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 221 page-faults:u# 63.150 /sec 14,526,193,995 cycles:u #4.151 GHz 467,072,127 stalled-cycles-frontend:u#3.22% frontend cycles idle 577,324,069 instructions:u #0.04 insn per cycle #0.81 stalled cycles per insn 41,578,204 branches:u # 11.881 M/sec 50,517 branch-misses:u #0.12% of all branches 3.500660600 seconds time elapsed 3.49715 seconds user 0.00000 seconds sys jan@localhost:/tmp> g++ -DT=float -march=native gather.c -Ofast -mtune-ctrl=use_gather_4parts -mtune-ctrl=use_gather_8parts -mtune-ctrl=use_gather -mtune=native -fdump-tree-all-details ; objdump -d a.out | grep gather ; perf stat ./a.out 401250: c4 e2 65 92 04 8d 40vgatherdps %ymm3,0x404040(,%ymm1,4),%ymm0 Performance counter stats for './a.out': 1,263.87 msec task-clock:u #0.922 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 222 page-faults:u# 175.651 /sec 5,172,067,789 cycles:u #4.092 GHz 93,135,962 stalled-cycles-frontend:u#1.80% frontend cycles idle 167,783,419 instructions:u #0.03 insn per cycle #0.56 stalled cycles per insn 21,097,560 branches:u # 16.693 M/sec 24,253 branch-misses:u #0.11% of all branches 1.370533592 seconds time elapsed 1.265143000 seconds user 0.0 seconds sys Non-gather loop is: .L2: movslq indices(%rax), %rcx movslq indices+8(%rax), %rdi addq$16, %rax movslq indices-12(%rax), %rdx movslq indices-4(%rax), %rsi vmovss b(,%rdi,4), %xmm1 vmovss b(,%rcx,4), %xmm0 vinsertps $0x10, b(,%rsi,4), %xmm1, %xmm1 vinsertps $0x10, b(,%rdx,4), %xmm0, %xmm0 vmovlhps%xmm1, %xmm0, %xmm0 vaddps a-16(%rax), %xmm0, %xmm0 vmovaps %xmm0, a-16(%rax) cmpq$65536, %rax while gather loop: .L2: vmovdqa indices(%rax), %ymm1 vmovaps %ymm2, %ymm3 addq$32, %rax vgatherdps %ymm3, b(,%ymm1,4), %ymm0 vaddps a-32(%rax), %ymm0, %ymm0 vmovaps %ymm0, a-32(%rax) cmpq$65536, %rax jne .L2
[Bug ipa/116296] [13/14/15 Regression] internal compiler error: in merge, at ipa-modref-tree.cc:176 at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116296 --- Comment #2 from Jan Hubicka --- It is most likely some problem with computing bit offsets for the alias oracle. I guess multiplying that number by sizeof (long) * 11 * 11 * 8 triggers overflow. Probably harmless for -fdisable-checking generated code since that access should be undefined behaviour then. I will take a look.
[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140 Jan Hubicka changed: What|Removed |Added Last reconfirmed||2024-08-01 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #2 from Jan Hubicka --- Looking at the change, I do not see how that could disable inlining. It should only reduce size of the function size estimates in the heuristics. I think it is more likely loop optimization doing something crazy. But we need to figure out what really changed in the codegen.
[Bug ipa/109914] --suggest-attribute=pure misdiagnoses static functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109914 --- Comment #7 from Jan Hubicka --- The idea is to help developers to annotate i.e. binary tree search function, which he clearly knows is always to be finite, but compiler can not prove it. Intentional infinite loops with no side effects written in a convoluted ways are almost never intentional, so almost always developer can add the pure attribute based on his/her understanding of what the code really does.
[Bug ipa/116055] [14/15 Regression] ICE from gcc.c-torture/unsorted/dump-noaddr.c after "Fix modref's iteraction with store merging"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116055 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Jan Hubicka --- Fixed.
[Bug ipa/116055] [14/15 Regression] ICE from gcc.c-torture/unsorted/dump-noaddr.c after "Fix modref's iteraction with store merging"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116055 --- Comment #4 from Jan Hubicka --- This does not reproduce for me (with trunk nor gcc14 build with --target=powerpc64le-linux-gnu) However the problem is almost surely sanity check in dumping code that flags does not get worse (which they can now thanks to store merging) gcc/ChangeLog: * ipa-modref.cc (analyze_function): Do not ICE when flags regress. diff --git a/gcc/ipa-modref.cc b/gcc/ipa-modref.cc index f6a758b5f42..59cfe91f987 100644 --- a/gcc/ipa-modref.cc +++ b/gcc/ipa-modref.cc @@ -3297,7 +3297,8 @@ analyze_function (bool ipa) fprintf (dump_file, " Flags for param %i improved:", (int)i); else - gcc_unreachable (); + fprintf (dump_file, " Flags for param %i changed:", +(int)i); dump_eaf_flags (dump_file, old_flags, false); fprintf (dump_file, " -> "); dump_eaf_flags (dump_file, new_flags, true); @@ -3313,7 +3314,7 @@ analyze_function (bool ipa) || (summary->retslot_flags & EAF_UNUSED)) fprintf (dump_file, " Flags for retslot improved:"); else - gcc_unreachable (); + fprintf (dump_file, " Flags for retslot changed:"); dump_eaf_flags (dump_file, past_retslot_flags, false); fprintf (dump_file, " -> "); dump_eaf_flags (dump_file, summary->retslot_flags, true); @@ -3328,7 +3329,7 @@ analyze_function (bool ipa) || (summary->static_chain_flags & EAF_UNUSED)) fprintf (dump_file, " Flags for static chain improved:"); else - gcc_unreachable (); + fprintf (dump_file, " Flags for static chain changed:"); dump_eaf_flags (dump_file, past_static_chain_flags, false); fprintf (dump_file, " -> "); dump_eaf_flags (dump_file, summary->static_chain_flags, true); Does it help?
[Bug ipa/106783] [12/13/14/15 Regression] ICE in ipa-modref.cc:analyze_function since r12-5247-ga34edf9a3e907de2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106783 --- Comment #6 from Jan Hubicka --- The problem is that n/=0 is undefined behavior (so we can optimize out call to function doing divide by zero), while __builtin_trap is observable and we do not optimize out code paths that may trip to it. so isolate-paths is de-facto pesimizing code from this POV. It is used __builtin_unreachable things would work. I think some parts of compiler use __builtin_unreachable (such as loop unrolling) other __builtin_trap. It would be nice to have consistent solution to this.
[Bug tree-optimization/109985] [12/13/14 Regression] __builtin_prefetch ignored by GCC 12/13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109985 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13/14 Regression] |__builtin_prefetch ignored |__builtin_prefetch ignored |by GCC 12/13|by GCC 12/13 --- Comment #10 from Jan Hubicka --- Fxied on trunk
[Bug ipa/113907] [12/13 regression] ICU miscompiled on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 regression]|[12/13 regression] ICU |ICU miscompiled on x86 |miscompiled on x86 since |since |r14-5109-ga291237b628f41 |r14-5109-ga291237b628f41| --- Comment #82 from Jan Hubicka --- All wrong code issues i know of are now fixed on 14/15
[Bug ipa/111613] [12/13 Regression] Bit field stores can be incorrectly optimized away when -fstore-merging is in effect since r12-5383-g22c242342e38eb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111613 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13 Regression] Bit |Bit field stores can be |field stores can be |incorrectly optimized away |incorrectly optimized away |when -fstore-merging is in |when -fstore-merging is in |effect since|effect since |r12-5383-g22c242342e38eb|r12-5383-g22c242342e38eb --- Comment #9 from Jan Hubicka --- Fixed on 14/15
[Bug ipa/114207] [12/13 Regression] modref gets confused by vectorized code `-O3 -fno-tree-forwprop` since r12-5439
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13 Regression] modref |modref gets confused by |gets confused by vectorized |vectorized code `-O3|code `-O3 |-fno-tree-forwprop` since |-fno-tree-forwprop` since |r12-5439|r12-5439 --- Comment #8 from Jan Hubicka --- Fixed on 14/15 so far
[Bug ipa/115033] [12/13 Regression] Incorrect optimization of by-reference closure fields by fre1 pass since r12-5113-gd70ef65692fced
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115033 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13 Regression] |Incorrect optimization of |Incorrect optimization of |by-reference closure fields |by-reference closure fields |by fre1 pass since |by fre1 pass since |r12-5113-gd70ef65692fced|r12-5113-gd70ef65692fced --- Comment #22 from Jan Hubicka --- Fixed on 14/15 so far
[Bug ipa/113291] [14/15 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #12 from Jan Hubicka --- Fixed.
[Bug middle-end/115277] [13 regression] ICF needs to match loop bound estimates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277 Jan Hubicka changed: What|Removed |Added Summary|[13/14/15 regression] ICF |[13 regression] ICF needs |needs to match loop bound |to match loop bound |estimates |estimates --- Comment #7 from Jan Hubicka --- Fixed on 14/15 so far
[Bug ipa/111613] [12/13/14/15 Regression] Bit field stores can be incorrectly optimized away when -fstore-merging is in effect since r12-5383-g22c242342e38eb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111613 --- Comment #7 from Jan Hubicka --- I suppose there is not much to do about past noread flags. I do not see how optimization can invalidate other properties, so I am testing the following: diff --git a/gcc/ipa-modref.cc b/gcc/ipa-modref.cc index f994388a96a..53a2e35133d 100644 --- a/gcc/ipa-modref.cc +++ b/gcc/ipa-modref.cc @@ -3004,6 +3004,9 @@ analyze_parms (modref_summary *summary, modref_summary_lto *summary_lto, (past, ecf_flags, VOID_TYPE_P (TREE_TYPE (TREE_TYPE (current_function_decl; + /* Store merging can produce reads when combining together multiple +bitfields. See PR111613. */ + past &= ~(EAF_NO_DIRECT_READ | EAF_NO_INDIRECT_READ); if (dump_file && (flags | past) != flags && !(flags & EAF_UNUSED)) { fprintf (dump_file,
[Bug tree-optimization/114207] [12/13/14/15 Regression] modref gets confused by vecotorized code ` -O3 -fno-tree-forwprop` since r12-5439
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207 --- Comment #5 from Jan Hubicka --- The offset gets lost in ipa-prop.cc diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc index 7d7cb3835d2..99ebd6229ec 100644 --- a/gcc/ipa-prop.cc +++ b/gcc/ipa-prop.cc @@ -1370,9 +1370,9 @@ unadjusted_ptr_and_unit_offset (tree op, tree *ret, poly_int64 *offset_ret) { if (TREE_CODE (op) == ADDR_EXPR) { - poly_int64 extra_offset = 0; + poly_int64 extra_offset; tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0), -&offset); +&extra_offset); if (!base) { base = get_base_address (TREE_OPERAND (op, 0)); here offset is the offset being tracked and get_addr_base_and_unit_offset is intended to initialize extra_offset which is later added to offset. In the testcase the pointer is first offseted by +4 and later by -4 which combines to 0.
[Bug ipa/115033] [12/13/14/15 Regression] Incorrect optimization of by-reference closure fields by fre1 pass since r12-5113-gd70ef65692fced
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115033 --- Comment #18 from Jan Hubicka --- modref_eaf_analysis::analyze_ssa_name misinterprets EAF flags. If dereferenced parameter is passed (to map_iterator in the testcase) it can be returned indirectly which in turn makes it to escape into the next function call. I am testing: diff --git a/gcc/ipa-modref.cc b/gcc/ipa-modref.cc index a5adce8ea39..a4e3cc34b4d 100644 --- a/gcc/ipa-modref.cc +++ b/gcc/ipa-modref.cc @@ -2571,8 +2571,7 @@ modref_eaf_analysis::analyze_ssa_name (tree name, bool deferred) int call_flags = deref_flags (gimple_call_arg_flags (call, i), ignore_stores); if (!ignore_retval && !(call_flags & EAF_UNUSED) - && !(call_flags & EAF_NOT_RETURNED_DIRECTLY) - && !(call_flags & EAF_NOT_RETURNED_INDIRECTLY)) + && !(call_flags & (EAF_NOT_RETURNED_DIRECTLY || EAF_NOT_RETURNED_INDIRECTLY))) merge_call_lhs_flags (call, i, name, false, true); if (ecf_flags & (ECF_CONST | ECF_NOVOPS)) m_lattice[index].merge_direct_load ();
[Bug lto/114501] [12/13/14/15 Regression] ICE during lto streaming
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114501 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13/14/15 Regression] |ICE during modref with LTO |ICE during lto streaming CC||hubicka at gcc dot gnu.org Component|ipa |lto --- Comment #11 from Jan Hubicka --- Note that this is not modref related - it is just last pass run before streaming. We miss some free lang data I guess. Will take a look
[Bug ipa/67051] symtab_node::equal_address_to too conservative?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67051 --- Comment #2 from Jan Hubicka --- I believe that there was some discussion on this in the past. I would be quite happy to change the predicate to be more aggressive. Current code basically duplicates what original fold-const.c did. One problem is that we have no way to declare in header that one symbol is alias of another while being defined in other translation unit. jan@localhost:/tmp> cat t.c extern int a; extern int b __attribute ((alias("a"))); jan@localhost:/tmp> gcc t.c t.c:2:12: error: ‘b’ aliased to undefined symbol ‘a’ 2 | extern int b __attribute ((alias("a"))); |^ jan@localhost:/tmp> clang t.c t.c:2:28: error: alias must point to a defined variable or function 2 | extern int b __attribute ((alias("a"))); |^ t.c:2:28: note: the function or variable specified in an alias must refer to its mangled name 1 error generated. So if one wants to use aliases intentionally (to do something smart about superposing) then basically only valid testcases would be if translation units never use both names together. Also folding is done early when alias may not be declared yet, but that can be solved by check for symtab state.
[Bug middle-end/115277] [13/14/15 regression] ICF needs to match loop bound estimates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277 Jan Hubicka changed: What|Removed |Added Summary|ICF needs to match loop |[13/14/15 regression] ICF |bound estimates |needs to match loop bound ||estimates --- Comment #1 from Jan Hubicka --- Reproduces on 14 and trunk. GCC 12 is not able to determine the loop bound during early optimizations
[Bug middle-end/115277] New: ICF needs to match loop bound estimates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277 Bug ID: 115277 Summary: ICF needs to match loop bound estimates Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- jan@localhost:/tmp> cat tt.c int array[1000]; void test (int a) { if (__builtin_expect (a > 3, 1)) return; for (int i = 0; i < a; i++) array[i]=i; } void test2 (int a) { if (__builtin_expect (a > 10, 1)) return; for (int i = 0; i < a; i++) array[i]=i; } int main() { test(1); test(2); test(3); test2(10); if (array[9] != 9) __builtin_abort (); return 0; } jan@localhost:/tmp> gcc -O2 tt.c ; ./a.out jan@localhost:/tmp> gcc -O3 tt.c ; ./a.out Aborted (core dumped) The problem here is that we do not match value ranges and thus we can end up with different estimates on number of iterations.
[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13/14 Regression] Wrong |Wrong code at -O with |code at -O with ipa-modref |ipa-modref on aarch64 |on aarch64 --- Comment #22 from Jan Hubicka --- Fixed on trunk so far
[Bug libstdc++/109442] Dead local copy of std::vector not removed from function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109442 --- Comment #19 from Jan Hubicka --- Note that the testcase from PR115037 also shows that we are not able to optimize out dead stores to the vector, which is another quite noticeable problem. void test() { std::vector test; test.push_back (1); } We alocate the block, store 1 and immediately delete it. void test () { int * test$D25839$_M_impl$D25146$_M_start; struct vector test; int * _61; [local count: 1073741824]: _61 = operator new (4); [local count: 1063439392]: *_61 = 1; operator delete (_61, 4); test ={v} {CLOBBER}; test ={v} {CLOBBER(eol)}; return; [count: 0]: : test ={v} {CLOBBER}; resx 2 } So my understanding is that we decided to not optimize away the dead stores since the particular operator delete does not pass test: /* If the call is to a replaceable operator delete and results from a delete expression as opposed to a direct call to such operator, then we can treat it as free. */ if (fndecl && DECL_IS_OPERATOR_DELETE_P (fndecl) && DECL_IS_REPLACEABLE_OPERATOR (fndecl) && gimple_call_from_new_or_delete (stmt)) return ". o "; This is because we believe that operator delete may be implemented in an insane way that inspects the values stored in the block being freed. I can sort of see that one can write standard conforming code that allocates some data that is POD and inspects it in destructor. However for std::vector this argument is not really applicable. Standard does specify that new/delete is used to allocate/deallocate the memory but does not say how the memory is organized or what happens before deallocation. (i.e. it is probably valid for std::vector to memset the block just before deallocating it). Similar argument can IMO be used for eliding unused memory allocations. It is kind of up to std::vector implementation on how many allocations/deallocations it does, right? So we need a way to annotate the new/delete calls in the standard library as safe for such optimizations (i.e. implement clang's __bulitin_operator_new/delete?) How clang manages to optimize this out without additional hinting?
[Bug middle-end/115037] Unused std::vector is not optimized away.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037 Jan Hubicka changed: What|Removed |Added CC||jason at redhat dot com, ||jwakely at redhat dot com --- Comment #2 from Jan Hubicka --- I tried to look for duplicates, but did not find one. However I think the first problem is that we do not optimize away the store of 1 to vector while clang does. I think this is because we do not believe we can trust that delete operator is safe? We get: void test () { int * test$D25839$_M_impl$D25146$_M_start; struct vector test; int * _61; [local count: 1073741824]: _61 = operator new (4); [local count: 1063439392]: *_61 = 1; operator delete (_61, 4); test ={v} {CLOBBER}; test ={v} {CLOBBER(eol)}; return; [count: 0]: : test ={v} {CLOBBER}; resx 2 } If we can not trust fact that operator delete is good, perhaps we can arrange explicit clobber before calling it? I think it is up to std::vector to decide what it will do with the stored array so in this case even instane oprator delete has no right to expect that the data in vector will be sane :)
[Bug middle-end/115037] New: Unused std::vector is not optimized away.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037 Bug ID: 115037 Summary: Unused std::vector is not optimized away. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Compiling #include void test() { std::vector test; test.push_back (1); } leads to _Z4testv: .LFB1253: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 movl$4, %edi call_Znwm movl$4, %esi movl$1, (%rax) movq%rax, %rdi addq$8, %rsp .cfi_def_cfa_offset 8 jmp _ZdlPvm while clang optimizes to: _Z4testv: # @_Z4testv .cfi_startproc # %bb.0: retq
[Bug middle-end/115036] New: division is not shortened based on value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115036 Bug ID: 115036 Summary: division is not shortened based on value range Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- For long test(long a, long b) { if (a > 65535 || a < 0) __builtin_unreachable (); if (b > 65535 || b < 0) __builtin_unreachable (); return a/b; } we produce test: .LFB0: .cfi_startproc movq%rdi, %rax cqto idivq %rsi ret while clang does: test: # @test .cfi_startproc # %bb.0: movq%rdi, %rax # kill: def $ax killed $ax killed $rax xorl%edx, %edx divw%si movzwl %ax, %eax retq clang also by default adds 32bit divide path even when value range is not known long test(long a, long b) { return a/b; } compiles as test: # @test .cfi_startproc # %bb.0: movq%rdi, %rax movq%rdi, %rcx orq %rsi, %rcx shrq$32, %rcx je .LBB0_1 # %bb.2: cqto idivq %rsi retq
[Bug ipa/114985] [15 regression] internal compiler error: in discriminator_fail during stage2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114985 --- Comment #14 from Jan Hubicka --- So this is problem in ipa_value_range_from_jfunc? It is Maritn's code, I hope he will know why types are wrong here. Once can get type compatibility problem on mismatched declarations and LTO, but it seems that this testcase is single-file. So indeed this looks like a bug either in jump function construction or even earlier...
[Bug middle-end/114852] New: jpegxl 10.0.1 is faster with clang18 then with gcc14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114852 Bug ID: 114852 Summary: jpegxl 10.0.1 is faster with clang18 then with gcc14 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3 reports about 8% difference. I can measure 13% on zen3. The code has changed and it is no longer bound by push_back but runs AVX2 version of inner loops. The hottest loops looks comparable 0.00 │266:┌─→vmovaps (%r14,%rax,4),%ymm0 0.11 ││ vmulps (%rcx,%rax,4),%ymm7,%ymm2 1.18 ││ vfnmadd213ps (%rsi,%rax,4),%ymm11,%ymm0 0.25 ││ vmulps %ymm2,%ymm0,%ymm0 5.94 ││ vroundps $0x8,%ymm0,%ymm2 0.35 ││ vsubps %ymm2,%ymm0,%ymm0 1.05 ││ vmulps (%rdx,%rax,4),%ymm0,%ymm0 3.19 ││ vmovaps %ymm0,0x0(%r13,%rax,4) 0.15 ││ vandps %ymm10,%ymm2,%ymm0 0.03 ││ add $0x8,%rax 0.03 ││ vcmpeqps %ymm8,%ymm0,%ymm2 0.09 ││ vsqrtps %ymm0,%ymm0 27.25 ││ vaddps %ymm0,%ymm6,%ymm6 0.35 ││ vandnps %ymm9,%ymm2,%ymm0 0.12 ││ vaddps %ymm0,%ymm5,%ymm5 0.05 │├──cmp %r12,%rax 0.02 │└──jb 266 and clang 0.00 │ c90:┌─→vmulps (%r9,%rdx,4),%ymm0,%ymm2 0.97 │ │ vmovaps (%r15,%rdx,4),%ymm1 0.36 │ │ vsubps %ymm2,%ymm1,%ymm1 4.24 │ │ vmulps (%rcx,%rdx,4),%ymm4,%ymm2 1.92 │ │ vmulps %ymm2,%ymm1,%ymm1 0.65 │ │ vroundps $0x8,%ymm1,%ymm2 0.06 │ │ vsubps %ymm2,%ymm1,%ymm1 1.11 │ │ vmulps (%rax,%rdx,4),%ymm1,%ymm1 3.53 │ │ vmovaps %ymm1,(%rsi,%rdx,4) 0.68 │ │ vandps %ymm6,%ymm2,%ymm1 0.23 │ │ vcmpneqps%ymm5,%ymm2,%ymm2 3.64 │ │ add $0x8,%rdx 0.24 │ │ vsqrtps %ymm1,%ymm1 22.16 │ │ vaddps %ymm1,%ymm8,%ymm8 0.25 │ │ vbroadcastss 0x31eba5(%rip),%ymm1# 34f840 0.05 │ │ vandps %ymm1,%ymm2,%ymm1 0.04 │ │ vaddps %ymm1,%ymm7,%ymm7 0.11 │ ├──cmp %rdi,%rdx 0.07 │ └──jb c90▒ GCC profile: 10.78% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::EstimateEntropy(jxl::AcStrategy const&, float, unsigned long, unsigned long, jxl::ACSConfig const&, float con 7.02% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::FindBestMultiplier(float const*, float const*, unsigned long, float, float, bool) [clone .part.0] 4.50% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::Symmetric5Row(jxl::Plane const&, jxl::RectT const&, long, jxl: 4.47% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::TransformFromPixels(jxl::AcStrategy::Type, float const*, unsigned long, float*, float* 4.31% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::TransformToPixels(jxl::AcStrategy::Type, float*, float*, unsigned long, float*) 4.00% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, int const* restrict*, jxl::AcStra 3.56% cjxl libm.so.6 [.] __ieee754_pow_fma 3.49% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::IDCT1DImpl<8ul, 8ul>::operator()(float const*, unsigned long, float*, unsigned long, f 3.43% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::AdaptiveQuantizationImpl::ComputeTile(float, float, jxl::Image3 const&, jxl::Re 3.27% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::DCT1DWrapper<32ul, 0ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom, jxl::N_AVX2: 3.16% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<8ul, 8ul>::operator()(float*, float*) [clone .isra.0] 2.87% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::ComputeScaledIDCT<4ul, 8ul>::operator()::operator()::operator() const&, jxl::RectT const&, jxl::DequantMatrices const&, jxl::AcStrategyImage const*, jxl::Plane const*, jxl::Quantizer const*, jxl::Rect▒ 5.03% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, jxl::RectT const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*, jxl::Pla▒ 4.66% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<16ul, 8ul>::operator()(float*, float*)
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #9 from Jan Hubicka --- Phoronix still claims the difference https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2
[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236 --- Comment #3 from Jan Hubicka --- Seems this perofmance difference is still there on zen4 https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3
[Bug tree-optimization/114787] [13 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787 --- Comment #18 from Jan Hubicka --- predict.cc queries number of iterations using number_of_iterations_exit and loop_niter_by_eval and finally using estimated_stmt_executions. The first two queries are not updating the upper bounds datastructure so that is why we get around without computing them in some cases. I guess we can just drop dumping here. We now dump the recorded estimates elsehwere, so this is somewhat redundant.
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #13 from Jan Hubicka --- Thanks a lot, looks great! Do we still auto-detect memmove when the copy constructor turns out to be memcpy equivalent after optimization?
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #9 from Jan Hubicka --- Your patch gives me error compiling testcase jh@ryzen3:/tmp> ~/trunk-install/bin/g++ -O3 ~/t.C In file included from /home/jh/trunk-install/include/c++/14.0.1/vector:65, from /home/jh/t.C:1: /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h: In instantiation of ‘_ForwardIterator std::__relocate_a(_InputIterator, _InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = const pair*; _ForwardIterator = pair*; _Allocator = allocator >; _Traits = allocator_traits > >]’: /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1127:31: required from ‘_Tp* std::__relocate_a(_Tp*, _Tp*, _Tp*, allocator<_T2>&) [with _Tp = pair; _Up = pair]’ 1127 | return std::__relocate_a(__cfirst, __clast, __result, __alloc); | ~^~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:509:26: required from ‘static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp, _Alloc>::_S_relocate(pointer, pointer, pointer, _Tp_alloc_type&) [with _Tp = std::pair; _Alloc = std::allocator >; pointer = std::pair*; _Tp_alloc_type = std::vector >::_Tp_alloc_type]’ 509 | return std::__relocate_a(__first, __last, __result, __alloc); |~^~~~ /home/jh/trunk-install/include/c++/14.0.1/bits/vector.tcc:647:32: required from ‘void std::vector<_Tp, _Alloc>::_M_realloc_append(_Args&& ...) [with _Args = {const std::pair&}; _Tp = std::pair; _Alloc = std::allocator >]’ 647 | __new_finish = _S_relocate(__old_start, __old_finish, |~~~^~~ 648 |__new_start, _M_get_Tp_allocator()); | ~~~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:1294:21: required from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = std::pair; _Alloc = std::allocator >; value_type = std::pair]’ 1294 | _M_realloc_append(__x); | ~^ /home/jh/t.C:8:25: required from here 8 | stack.push_back (pair); | ^~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56: error: use of deleted function ‘const _Tp* std::addressof(const _Tp&&) [with _Tp = pair]’ 1084 | std::addressof(std::move(*__first | ~~^ In file included from /home/jh/trunk-install/include/c++/14.0.1/bits/stl_pair.h:61, from /home/jh/trunk-install/include/c++/14.0.1/bits/stl_algobase.h:64, from /home/jh/trunk-install/include/c++/14.0.1/vector:62: /home/jh/trunk-install/include/c++/14.0.1/bits/move.h:168:16: note: declared here 168 | const _Tp* addressof(const _Tp&&) = delete; |^ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56: note: use ‘-fdiagnostics-all-candidates’ to display considered candidates 1084 | std::addressof(std::move(*__first | ~~^ It is easy to check if conversion happens - just compile it and see if there is memcpy or memmove in the optimized dump file (or final assembly)
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #8 from Jan Hubicka --- I had wrong noexcept specifier. This version works, but I still need to inline relocate_object_a into the loop diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..f02d4fb878f 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1100,8 +1100,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION "relocation is only possible for values of the same type"); _ForwardIterator __cur = __result; for (; __first != __last; ++__first, (void)++__cur) - std::__relocate_object_a(std::__addressof(*__cur), -std::__addressof(*__first), __alloc); + { + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, std::__addressof(*__cur), std::move(*std::__addressof(*__first))); + __traits::destroy(__alloc, std::__addressof(*std::__addressof(*__first))); + } return __cur; } @@ -1109,8 +1112,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1150,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _Tp* +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +allocator<_Up>& __alloc) +noexcept(noexcept(__relocate_a_1(__first, __last, __result, __alloc))) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #6 from Jan Hubicka --- Thanks. I though the relocate_a only cares about the fact if the pointed-to type can be bitwise copied. It would be nice to early produce memcpy from libstdc++ for std::pair, so the second patch makes sense to me (I did not test if it works) I think it would be still nice to tell GCC that the copy loop never gets overlapping memory locations so the cases which are not early optimized to memcpy can still be optimized later (or vectorized if it does really something non-trivial). So i tried your second patch fixed so it compiles: diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..0d2e588ae5e 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _Tp* +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +allocator<_Up>& __alloc) +noexcept(std::__is_bitwise_relocatable<_Tp>::value) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11 it does not make ldist to hit, so the restrict info is still lost. I think the problem is that if you call relocate_object the restrict reduces scope, so we only know that the elements are pairwise disjoint, not that the vectors are. This is because restrict is interpreted early pre-inlining, but it is really Richard's area. It seems that the patch makes us to go through __uninitialized_copy_a instead of __uninit_copy. I am not even sure how these are different, so I need to stare at the code bit more to make sense of it :)
[Bug middle-end/114822] New: ldist should produce memcpy/memset/memmove histograms based on loop information converted
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114822 Bug ID: 114822 Summary: ldist should produce memcpy/memset/memmove histograms based on loop information converted Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- When loop is converted to string builtin we lose information about its size. This means that we won't expand it inline when the block size is expected to be small. This causes performance problem i.e. on std::vector and testcase from PR114821 which at least with profile feedback runs significantly slower than variant where memcpy is produced early #include typedef unsigned int uint32_t; int pair; void test() { std::vector stack; stack.push_back (pair); while (!stack.empty()) { int cur = stack.back(); stack.pop_back(); if (true) { cur++; stack.push_back (cur); stack.push_back (cur); } if (cur > 1) break; } } int main() { for (int i = 0; i < 1; i++) test(); }
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #2 from Jan Hubicka --- What I am shooting for is to optimize it later in loop distribution. We can recognize memcpy loop if we can figure out that source and destination memory are different. We can help here with restrict, but I was bit lost in how to get them done. This seems to do the trick, but for some reason I get memmove diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..1a6223ea892 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } return __result + __count; } + + template +_GLIBCXX20_CONSTEXPR +inline __enable_if_t::value, _Tp*> +__relocate_a(_Tp * __restrict __first, _Tp *__last, +_Tp * __restrict __result, _Allocator& __alloc) noexcept +{ + ptrdiff_t __count = __last - __first; + if (__count > 0) + { +#ifdef __cpp_lib_is_constant_evaluated + if (std::is_constant_evaluated()) + { + for (; __first != __last; ++__first, (void)++__result) + { + // manually inline relocate_object_a to not lose restrict qualifiers + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, __result, std::move(*__first)); + __traits::destroy(__alloc, std::__addressof(*__first)); + } + return __result; + } #endif + __builtin_memcpy(__result, __first, __count * sizeof(_Tp)); + } + return __result + __count; +} +#endif + + template +_GLIBCXX20_CONSTEXPR +#if _GLIBCXX_HOSTED +inline __enable_if_t::value, _Tp*> +#else +inline _Tp * +#endif +__relocate_a(_Tp * __restrict __first, _Tp *__last, +_Tp * __restrict __result, _Allocator& __alloc) +noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc, +__result, std::move(*__first))) +&& noexcept(std::allocator_traits<_Allocator>::destroy( + __alloc, std::__addressof(*__first +{ + for (; __first != __last; ++__first, (void)++__result) + { + // manually inline relocate_object_a to not lose restrict qualifiers + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, __result, std::move(*__first)); + __traits::destroy(__alloc, std::__addressof(*__first)); + } + return __result; +} template
[Bug libstdc++/114821] New: _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 Bug ID: 114821 Summary: _M_realloc_append should use memcpy instead of loop to copy data when possible Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- In thestcase #include typedef unsigned int uint32_t; std::pair pair; void test() { std::vector> stack; stack.push_back (pair); while (!stack.empty()) { std::pair cur = stack.back(); stack.pop_back(); if (!cur.first) { cur.second++; stack.push_back (cur); stack.push_back (cur); } if (cur.second > 1) break; } } int main() { for (int i = 0; i < 1; i++) test(); } We produce _M_reallloc_append which uses loop to copy data instead of memcpy. This is bigger and slower. The reason why __relocate_a does not use memcpy seems to be fact that pair has copy constructor. It still can be pattern matched by ldist but it fails with: (compute_affine_dependence ref_a: *__first_1, stmt_a: *__cur_37 = *__first_1; ref_b: *__cur_37, stmt_b: *__cur_37 = *__first_1; ) -> dependence analysis failed So we can not disambiguate old and new vector memory and prove that loop is indeed memcpy loop. I think this is valid since operator new is not required to return new memory, but I think adding __restrict should solve this. Problem is that I got lost on where to add them, since relocate_a uses iterators instead of pointers
[Bug tree-optimization/114787] [13/14 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787 --- Comment #13 from Jan Hubicka --- -fdump-tree-all-all changing generated code is also bad. We probably should avoid dumping loop bounds then they are not recorded. I added dumping of loop bounds and this may be unexpected side effect. WIll take a look.
[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008 --- Comment #8 from Jan Hubicka --- Note that cold attribute is also quite strong since it turns optimize_size codegen that is often a lot slower. Reading the discussion again, I don't think we have a way to make inline keyword ignored by inliner. We can make not_really_inline attribute (better name would be welcome).
[Bug tree-optimization/114779] __builtin_constant_p does not work in inline functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114779 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #7 from Jan Hubicka --- Note that the test about side-effects also makes it impossible to test for constantness of values passed to function by reference which could be also useful. Workaround is to load it into temporary so the side-effect is not seen. So that early folding to 0 never made too much of sense to me. I agree that it is a can of worms and it is not clear if changing behaviour would break things...
[Bug middle-end/114774] Missed DSE in simple code due to interleaving sotres
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774 Jan Hubicka changed: What|Removed |Added Summary|Missed DSE in simple code |Missed DSE in simple code |due to other stores being |due to interleaving sotres |conditional | --- Comment #1 from Jan Hubicka --- the other store being conditional is not the core issue. Here we miss DSE too: #include int a; short p,q; void test (int b) { a=1; if (b) p++; else q++; a=2; } The problem in DSE seems to be that instead of recursively walking the memory-SSA graph it insist the graph to form a chain. Now SRA leaves stores to scalarized variables and even removes the corresponding clobbers, so this is relatively common scenario in non-trivial C++ code.
[Bug middle-end/114774] New: Missed DSE in simple code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774 Bug ID: 114774 Summary: Missed DSE in simple code Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- In the following #include int a; short *p; void test (int b) { a=1; if (b) { (*p)++; a=2; printf ("1\n"); } else { (*p)++; a=3; printf ("2\n"); } } We are not able to optimize out "a=1". This is simplified real-world scenario where SRA does not remove definition of SRAed variables. Note that clang does conditional move here test: # @test .cfi_startproc # %bb.0: movqp(%rip), %rax incw(%rax) xorl%eax, %eax testl %edi, %edi leaq.Lstr(%rip), %rcx leaq.Lstr.2(%rip), %rdi cmoveq %rcx, %rdi sete%al orl $2, %eax movl%eax, a(%rip) jmp puts@PLT# TAILCALL
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 --- Comment #19 from Jan Hubicka --- I looked into the remaining exit/nonexit rename discussed here earlier before the PR was closed. The following patch would restore the code to do the same calls as before my patch PR tree-optimization/109596 * tree-ssa-loop-ch.c (ch_base::copy_headers): Fix use of exit/nonexit edges. diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc index b7ef485c4cc..cd5f6bc3c2a 100644 --- a/gcc/tree-ssa-loop-ch.cc +++ b/gcc/tree-ssa-loop-ch.cc @@ -952,13 +952,13 @@ ch_base::copy_headers (function *fun) if (!single_pred_p (nonexit->dest)) { header = split_edge (nonexit); - exit = single_pred_edge (header); + nonexit = single_pred_edge (header); } edge entry = loop_preheader_edge (loop); propagate_threaded_block_debug_into (nonexit->dest, entry->dest); - if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs, + if (!gimple_duplicate_seme_region (entry, nonexit, bbs, n_bbs, copied_bbs, true)) { delete candidate.static_exits; I however convinced myself this is an noop. both exit and nonexit sources have same basic blocks. propagate_threaded_block_debug_into walks predecessors of its first parameter and moves debug statements to the second parameter, so it does the same job, since the split BB is empty. gimple_duplicate_seme_region uses the parametr to update loop header but it does not do that correctly for loop header copying and we re-do it in tree-ssa-loop-ch. Still the code as it is now in trunk is very confusing, so perhaps we should update it?
[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208 --- Comment #28 from Jan Hubicka --- So the main problem is that in t2 we have _ZN6vectorI12QualityValueEC1ERKS1_/7 (vector<_Tp>::vector(const vector<_Tp>&) [with _Tp = QualityValue]) Type: function definition analyzed alias cpp_implicit_alias Visibility: semantic_interposition public weak comdat comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only Same comdat group as: _ZN6vectorI12QualityValueEC2ERKS1_/6 References: _ZN6vectorI12QualityValueEC2ERKS1_/6 (alias) Referring: Function flags: Called by: _Z41__static_initialization_and_destruction_0v/8 (can throw external) Calls: and in t1 we have _ZN6vectorI12QualityValueEC1ERKS1_/2 (constexpr vector<_Tp>::vector(const vector<_Tp>&) [with _Tp = QualityValue]) Type: function definition Visibility: semantic_interposition external public weak comdat comdat_group:_ZN6vectorI12QualityValueEC1ERKS1_ one_only References: Referring: Function flags: Called by: Calls: This is the same symbol name but in two different comdat groups (C1 compared to C5). With -O0 both seems to get the C5 group I can silence the ICE by making aliases undefined during symbol merging (which is kind of hack but should make sanity checks happy), but I am still lost how this is supposed to work in valid code.
[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208 --- Comment #27 from Jan Hubicka --- OK, but the problem is same. Having comdats with same key defining different set of public symbols is IMO not a good situation for both non-LTO and LTO builds. Unless the additional alias is never used by valid code (which would make it useless and probably we should not generate it) it should be possible to produce a scenario where linker will pick wrong version of comdat and we get undefined symbol in non-LTO builds...
[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208 --- Comment #25 from Jan Hubicka --- So we have comdat groups that diverges in t1.o and t2.o. In one object it has alias in it while in other object it does not Merging nodes for _ZN6vectorI12QualityValueEC2ERKS1_. Candidates: _ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base ) Type: function definition analyzed Visibility: externally_visible semantic_interposition prevailing_def_ironly public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only next sharing asm name: 19 References: Referring: Read from file: t1.o Unit id: 1 Function flags: count:1073741824 (estimated locally) Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw external) Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZN6vectorI12QualityValueEC2ERKS1_/19 (__ct_base ) Type: function definition analyzed Visibility: externally_visible semantic_interposition preempted_ir public weak comdat comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only Same comdat group as: _ZN6vectorI12QualityValueEC1ERKS1_/20 previous sharing asm name: 1 References: Referring: _ZN6vectorI12QualityValueEC1ERKS1_/20 (alias) Read from file: t2.o Unit id: 2 Function flags: count:1073741824 (estimated locally) Called by: Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/23 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZNK12_Vector_baseI12QualityValueE1gEv/24 (1073741824 (estimated locally),1.00 per call) (can throw external) After resolution: _ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base ) Type: function definition analyzed Visibility: externally_visible semantic_interposition prevailing_def_ironly public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only next sharing asm name: 19 References: Referring: Read from file: t1.o Unit id: 1 Function flags: count:1073741824 (estimated locally) Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw external) Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00 per call) (can throw external) We opt for version without alias and later ICE in sanity check verifying that aliases have same comdat group as their targets. I wonder how this is ice-on-valid code, since with normal linking the aliased symbol may or may not appear in the winning comdat group, so using he alias has to break. If constexpr changes how the constructor is generated, isn't this violation of ODR? We probably can go and reset every node in losing comdat group to silence the ICE and getting undefined symbol instead
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #8 from Jan Hubicka --- I am not sure this ought to be P1: - the compilation technically is finite, but not in reasonable time - it is possible to adjust the testcas (do early inlining manually) and get same infinite build on release branches - if you ask for inline bomb, you get it. But after some more testing, I do not see reasonably easy way to get better diagnostics. So I will retest the patch fro #6 and go ahead with it.
[Bug ipa/113359] [13/14 Regression] LTO miscompilation of ceph on aarch64 and x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359 --- Comment #23 from Jan Hubicka --- The patch looks reasonable. We probably could hash the padding vectors at summary generation time to reduce WPA overhead, but that can be done incrementally next stage1. I however wonder if we really guarantee to copy the paddings everywhere else then the total scalarization part? (i.e. in all paths through the RTL expansion)
[Bug ipa/109817] internal error in ICF pass on Ada interfaces
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109817 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #5 from Jan Hubicka --- That check was added to verify that we do not lose the thunk annotations. Now when datastructure is stable, i think we can simply drop it, if that makes Ada to work.
[Bug gcov-profile/113765] [14 Regression] ICE: autofdo: val-profiler-threads-1.c compilation, error: probability of edge from entry block not initialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113765 --- Comment #6 from Jan Hubicka --- Running auto-fdo without guessing branch probabilities is somewhat odd idea in general. I suppose we can indeed just avoid setting full_profile flag. Though the optimization passes are not that much tested to work with non-full profiles so there is some risk that resulting code will be worse than without auto-FDO.
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #7 from Jan Hubicka --- Found it, probably. I renamed exit to nonexit (since name was misleading) and then forgot to update propagate_threaded_block_debug_into (exit->dest, entry->dest); I will check this after teaching (which I have in 10 mins)
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 --- Comment #6 from Jan Hubicka --- On this testcase trunk does get same dump as gcc13 for pass just before ch2 with ch2 we get: @@ -192,9 +236,8 @@ # DEBUG BEGIN_STMT goto ; [100.00%] - [local count: 954449105]: + [local count: 954449104]: # j_15 = PHI - # DEBUG j => j_15 # DEBUG BEGIN_STMT a[b_14][j_15] = 0; # DEBUG BEGIN_STMT @@ -203,29 +246,30 @@ # DEBUG j => j_9 # DEBUG BEGIN_STMT if (j_9 <= 7) -goto ; [88.89%] +goto ; [87.50%] else -goto ; [11.11%] +goto ; [12.50%] [local count: 119292720]: + # DEBUG j => 0 # DEBUG BEGIN_STMT b_7 = b_14 + 1; # DEBUG b => b_7 # DEBUG b => b_7 # DEBUG BEGIN_STMT if (b_7 <= 6) -goto ; [87.50%] +goto ; [85.71%] else -goto ; [12.50%] +goto ; [14.29%] [local count: 119292720]: # b_14 = PHI - # DEBUG b => b_14 # DEBUG j => 0 # DEBUG BEGIN_STMT goto ; [100.00%] [local count: 17041817]: + # DEBUG b => 0 # DEBUG BEGIN_STMT optimize_me_not (); # DEBUG BEGIN_STMT So in addition to updating BB profile, we indeed end up moving debug statements around. The change of dump is: + Analyzing: if (b_1 <= 6) +Will eliminate peeled conditional in bb 6. +May duplicate bb 6 + Not duplicating bb 8: it is single succ. + Analyzing: if (j_2 <= 7) +Will eliminate peeled conditional in bb 4. +May duplicate bb 4 + Not duplicating bb 3: it is single succ. Loop 2 is not do-while loop: latch is not empty. +Duplicating header BB to obtain do-while loop Copying headers of loop 1 Will duplicate bb 6 - Not duplicating bb 8: it is single succ. -Duplicating header of the loop 1 up to edge 6->8, 2 insns. +Duplicating header of the loop 1 up to edge 6->7 Loop 1 is do-while loop Loop 1 is now do-while loop. +Exit count: 17041817 (estimated locally) +Entry count: 17041817 (estimated locally) +Peeled all exits: decreased number of iterations of loop 1 by 1. Copying headers of loop 2 Will duplicate bb 4 - Not duplicating bb 3: it is single succ. -Duplicating header of the loop 2 up to edge 4->3, 2 insns. +Duplicating header of the loop 2 up to edge 4->5 Loop 2 is do-while loop Loop 2 is now do-while loop. +Exit count: 119292720 (estimated locally) +Entry count: 119292720 (estimated locally) +Peeled all exits: decreased number of iterations of loop 2 by 1. Dumps moved around, but we do same duplicaitons as before (BB6 and BB4 to eliminate the conditionals). [local count: 1073741824]: # j_2 = PHI <0(8), j_9(3)> # DEBUG j => j_2 # DEBUG BEGIN_STMT if (j_2 <= 7) goto ; [88.89%] else goto ; [11.11%] [local count: 136334537]: # b_1 = PHI <0(2), b_7(5)> # DEBUG b => b_1 # DEBUG BEGIN_STMT if (b_1 <= 6) goto ; [87.50%] else goto ; [12.50%]
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 --- Comment #4 from Jan Hubicka --- The change makes loop iteration estimates more realistics, but does not introduce any new code that actually changes the IL, so it seems this makes existing problem more visible. I will try to debug what happens.
[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #59 from Jan Hubicka --- just to explain what happens in the testcase. There is test and testb. They are almost same: int testb(void) { struct bar *fp; test2 ((void *)&fp); fp = NULL; (*ptr)++; test3 ((void *)&fp); } the difference is in the alias set of FP. In one case it aliases with the (*ptr)++ while in other it does not. This makes one function to have jump function specifying aggregate value of 0 for *fp, while other does not. Now with LTO both struct bar and foo becomes compatible for TBAA, so the functions gets merged and the winning variant has the jump function specifying aggregate 0, which is wrong in the context code is invoked.
[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #58 from Jan Hubicka --- Created attachment 57702 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57702&action=edit Compare value ranges in jump functions This patch implements the jump function compare, however it is not good enough. Here is another wrong code: jh@ryzen3:~/gcc/build/stage1-gcc> cat a.c #include #include __attribute__((used)) int val,val2 = 1; struct foo {int a;}; struct foo **ptr; __attribute__ ((noipa)) int test2 (void *a) { ptr = (struct foo **)a; } int test3 (void *a); int test(void) { struct foo *fp; test2 ((void *)&fp); fp = NULL; (*ptr)++; test3 ((void *)&fp); } int testb (void); int main() { for (int i = 0; i < val2; i++) if (val) testb (); else test(); } jh@ryzen3:~/gcc/build/stage1-gcc> cat b.c #include struct bar {int a;}; struct foo {int a;}; struct barp {struct bar *f; struct bar *g;}; extern struct foo **ptr; int test2 (void *); int test3 (void *); int testb(void) { struct bar *fp; test2 ((void *)&fp); fp = NULL; (*ptr)++; test3 ((void *)&fp); } jh@ryzen3:~/gcc/build/stage1-gcc> cat c.c #include __attribute__ ((noinline)) int test3 (void *a) { if (!*(void **)a) abort (); return 0; } jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B ./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc -B ./ b.o a.o c.o ; ./a.out Aborted (core dumped) jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B ./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc -B ./ b.o a.o c.o --disable-ipa-icf ; ./a.out lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295] lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295]
[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #55 from Jan Hubicka --- > Anyway, can we in the spot my patch changed just walk all > source->node->callees > cgraph_edges, for each of them find the corresponding > cgraph_edge in the alias > and for each walk all the jump_functions recorded > and union their m_vr? > Or is that something that can't be done in LTO for some reason? That was my fist idea too, but the problem is that icf has (very limited) support for matching function which differ by order of the basic blocks: it computes hash of every basic block and orders them by their hash prior comparing. This seems half-finished since i.e. order of edges in PHIs has to match exactly. Callee lists are officially randomly ordered, but practically they follows the order of basic blocks (as they are built this way). However since BB orders can differ, just walking both callee sequences and comparing pairwise does not work. This also makes merging the information harder, since we no longer have the BB map at the time decide to merge. It is however not hard to match the jump function while walking gimple bodies and comparing statements, which is backportable and localized. I am still waiting for my statistics to converge and will send it soon.
[Bug ipa/106716] Identical Code Folding (-fipa-icf) confuses between functions with different [[likely]] attributes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106716 --- Comment #6 from Jan Hubicka --- The reason why GIMPLE_PREDICT is ignored is that it is never used after ipa-icf and gets removed at the very beggining of late optimizations. GIMPLE_PREDICT is consumed by profile_generate pass which is run before ipa-icf. The reason why GIMPLE_PREDICT statements are not stripped during ICF is early inlining. If we early inline, we throw away its profile and estimate it again (in the context of function it was inlined to) and for that it is a good idea to keep predicts. There is no convenient place to remove them after early inlining was done and before IPA passes and that is the only reason why they are around. We may revisit that since streaming them to LTO bytecode is probably more harmful then adding extra pass after early opts to strip them. ICF doesn't code to compare edge profiles and stmt histograms. It knows how to merge them (so resulting BB profile is consistent with merging) but I suppose we may want to have some threshold on when we do not want to marge functions with very different branch probabilities in the hot part of their bodies...
[Bug lto/114241] False-positive -Wodr warning when using -flto and -fno-semantic-interposition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114241 Jan Hubicka changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #3 from Jan Hubicka --- mine. Will debug why the tables diverges.
[Bug debug/92387] [11/12/13 Regression] gcc generates wrong debug information at -O1 since r10-1907-ga20f263ba1a76a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92387 --- Comment #5 from Jan Hubicka --- The revision is changing inlining decisions, so it would be probably possible to reproduce the problem without that change with right alaways_inline and noinline attributes.
[Bug tree-optimization/114207] [12/13/14 Regression] modref gets confused by vecotorized code ` -O3 -fno-tree-forwprop` since r12-5439
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #3 from Jan Hubicka --- mine. The summary is: loads: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:4 offset:0 size:64 max_size:64 stores: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:0 offset:0 size:64 max_size:64 while with fwprop we get: loads: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:0 offset:0 size:64 max_size:64 stores: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:0 offset:0 size:64 max_size:64 So it seems that offset is misaccounted.
[Bug lto/85432] Wodr can be more verbose for C code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85432 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WORKSFORME --- Comment #1 from Jan Hubicka --- This should be solved for a long time. We recognize ODR types by mangled names produced only by C++ frontend. I checked that GCC 12, 13 and trunk does not produce the warning.
[Bug tree-optimization/114052] [11/12/13/14 Regression] Wrong code at -O2 for well-defined infinite loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114052 --- Comment #5 from Jan Hubicka --- So if I understand it right, you want to determine the property that if the loop header is executed then BB containing undefined behavior at that iteration will be executed, too. modref tracks if function will always return and if it can not determine it, it will set the side_effect flag. So you can check for that in modref summary. It uses finite_function_p which was originally done for pure/const detection and is implemented by looking at loop nest if all loops are known to be finite and also by checking for irreducible loops. In your setup you probably also want to check for volatile asms that are also possibly infinite. In mod-ref we get around by considering them to be side-effects anyway. There is also determine_unlikely_bbs which is trying to set profile_count to zero to as many basic blocks as possible by propagating from basic blocks containing undefined behaviour or cold noreturn call backward & forward. The backward walk can be used to determine the property hat executing header implies UB. It stops on all loops though. In this case it would be nice to walk through loops known to be finite...
[Bug ipa/108802] [11/12/13/14 Regression] missed inlining of call via pointer to member function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108802 --- Comment #5 from Jan Hubicka --- I don't think we can reasonably expect every caller of lambda function to be early inlined, so we need to extend ipa-prop to understand the obfuscated code. I disucussed that with Martin some time ago - I think this is quite common problem with modern C++, so we will need to pattern match this, which is quite unfortunate.
[Bug ipa/111960] [14 Regression] ICE: during GIMPLE pass: rebuild_frequencies: SIGSEGV (Invalid read of size 4) with -fdump-tree-rebuild_frequencies-all
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111960 --- Comment #5 from Jan Hubicka --- hmm. cfg.cc:815 for me is: fputs (", maybe hot", outf); which seems quite safe. The problem does not seem to reproduce for me: jh@ryzen3:~/gcc/build/gcc> ./xgcc -B ./ tt.c -O --param=max-inline-recursive-depth=100 -fdump-tree-rebuild_frequencies-all -wrapper valgrind ==25618== Memcheck, a memory error detector ==25618== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==25618== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==25618== Command: ./cc1 -quiet -iprefix /home/jh/gcc/build/gcc/../lib64/gcc/x86_64-pc-linux-gnu/14.0.1/ -isystem ./include -isystem ./include-fixed tt.c -quiet -dumpdir a- -dumpbase tt.c -dumpbase-ext .c -mtune=generic -march=x86-64 -O -fdump-tree-rebuild_frequencies-all --param=max-inline-recursive-depth=100 -o /tmp/ccpkfjdK.s ==25618== ==25618== ==25618== HEAP SUMMARY: ==25618== in use at exit: 1,818,714 bytes in 1,175 blocks ==25618== total heap usage: 39,645 allocs, 38,470 frees, 12,699,874 bytes allocated ==25618== ==25618== LEAK SUMMARY: ==25618==definitely lost: 0 bytes in 0 blocks ==25618==indirectly lost: 0 bytes in 0 blocks ==25618== possibly lost: 8,032 bytes in 1 blocks ==25618==still reachable: 1,810,682 bytes in 1,174 blocks ==25618== suppressed: 0 bytes in 0 blocks ==25618== Rerun with --leak-check=full to see details of leaked memory ==25618== ==25618== For lists of detected and suppressed errors, rerun with: -s ==25618== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==25627== Memcheck, a memory error detector ==25627== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==25627== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==25627== Command: ./as --64 -o /tmp/ccp5TNme.o /tmp/ccpkfjdK.s ==25627== ==25637== Memcheck, a memory error detector ==25637== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==25637== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==25637== Command: ./collect2 -plugin ./liblto_plugin.so -plugin-opt=./lto-wrapper -plugin-opt=-fresolution=/tmp/cclWZD7F.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /lib/../lib64/crt1.o /lib/../lib64/crti.o ./crtbegin.o -L. -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccp5TNme.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state ./crtend.o /lib/../lib64/crtn.o ==25637== /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: /lib/../lib64/crt1.o: in function `_start': /home/abuild/rpmbuild/BUILD/glibc-2.38/csu/../sysdeps/x86_64/start.S:103:(.text+0x2b): undefined reference to `main' collect2: error: ld returned 1 exit status ==25637== ==25637== HEAP SUMMARY: ==25637== in use at exit: 89,760 bytes in 39 blocks ==25637== total heap usage: 175 allocs, 136 frees, 106,565 bytes allocated ==25637== ==25637== LEAK SUMMARY: ==25637==definitely lost: 0 bytes in 0 blocks ==25637==indirectly lost: 0 bytes in 0 blocks ==25637== possibly lost: 0 bytes in 0 blocks ==25637==still reachable: 89,760 bytes in 39 blocks ==25637== of which reachable via heuristic: ==25637== newarray : 1,544 bytes in 1 blocks ==25637== suppressed: 0 bytes in 0 blocks ==25637== Rerun with --leak-check=full to see details of leaked memory ==25637== ==25637== For lists of detected and suppressed errors, rerun with: -s ==25637== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
[Bug middle-end/113907] [12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 Jan Hubicka changed: What|Removed |Added Summary|[14 regression] ICU |[12/13/14 regression] ICU |miscompiled since on x86|miscompiled since on x86 |since |since |r14-5109-ga291237b628f41|r14-5109-ga291237b628f41 --- Comment #41 from Jan Hubicka --- OK, the reason why this does not work is that ranger ignores earlier value ranges on everything but default defs and phis. // This is where the ranger picks up global info to seed initial // requests. It is a slightly restricted version of // get_range_global() above. // // The reason for the difference is that we can always pick the // default definition of an SSA with no adverse effects, but for other // SSAs, if we pick things up to early, we may prematurely eliminate // builtin_unreachables. // // Without this restriction, the test in g++.dg/tree-ssa/pr61034.C has // all of its unreachable calls removed too early. // // See discussion here: // https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571709.html void gimple_range_global (vrange &r, tree name, struct function *fun) { tree type = TREE_TYPE (name); gcc_checking_assert (TREE_CODE (name) == SSA_NAME); if (SSA_NAME_IS_DEFAULT_DEF (name) || (fun && fun->after_inlining) || is_a (SSA_NAME_DEF_STMT (name))) { get_range_global (r, name, fun); return; } r.set_varying (type); } This makes ipa-prop to ignore earlier known value range and mask the bug. However adding PHI makes the problem to reproduce: #include #include int data[100]; int c; static __attribute__((noinline)) int bar (int d, unsigned int d2) { if (d2 > 30) c++; return d + d2; } static int test2 (unsigned int i) { if (i > 100) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; for (int j = 0; j < 100; j++) data[i] += bar (data[j], i&1 ? i+17 : i + 16); return data[i]; } static int test (unsigned int i) { if (i > 10) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; for (int j = 0; j < 100; j++) data[i] += bar (data[j], i&1 ? i+17 : i + 16); return data[i]; } int main () { int ret = test (1) + test (2) + test (3) + test2 (4) + test2 (30); if (!c) abort (); return ret; } This fails with trunk, gcc12 and gcc13 and also with Jakub's patch.
[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #39 from Jan Hubicka --- This testcase #include int data[100]; __attribute__((noinline)) int bar (int d, unsigned int d2) { if (d2 > 10) printf ("Bingo\n"); return d + d2; } int test2 (unsigned int i) { if (i > 10) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; printf ("%i\n",i); for (int j = 0; j < 100; j++) data[i] += bar (data[j], i+17); return data[i]; } int test (unsigned int i) { if (i > 100) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; printf ("%i\n",i); for (int j = 0; j < 100; j++) data[i] += bar (data[j], i+17); return data[i]; } int main () { test (1); test (2); test (3); test2 (4); test2 (100); return 0; } gets me most of what I want to reproduce ipa-prop problem. Functions test and test2 are split with different value ranges visible in the fnsplit dump. However curiously enough ipa-prop analysis seems to ignore the value ranges and does not attach them to the jump function, which is odd...
[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #31 from Jan Hubicka --- Having a testcase is great. I was just playing with crafting one. I am still concerned about value ranges in ipa-prop's jump functions. Let me see if I can modify the testcase to also trigger problem with value ranges in ipa-prop jump functions. Not streaming value ranges is an omission on my side (I mistakely assumed we do stream them). We ought to stream them, since otherwise we will lose propagated return value ranges in partitioned programs, which is pity.
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #6 from Jan Hubicka --- Created attachment 57427 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57427&action=edit patch The patch makes compilation to finish in reasonable time. I ended up in need to dropping DISREGARD_INLINE_LIMITS in late inlining for functions with self recursive always inlines, since these grow large quick and even non-recursive inlining is too slow. We also end up with quite ugly diagnostics of form: tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param max-inline-insns-auto limit reached 13 | f1 (void) | ^~ tt.c:17:3: note: called from here 17 | f1 (); | ^ tt.c:6:1: error: inlining failed in call to ‘always_inline’ ‘f0’: --param max-inline-insns-auto limit reached 6 | f0 (void) | ^~ tt.c:16:3: note: called from here 16 | f0 (); | ^ tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param max-inline-insns-auto limit reached 13 | f1 (void) | ^~ tt.c:15:3: note: called from here 15 | f1 (); | ^ In function ‘f1’, inlined from ‘f0’ at tt.c:8:3, which is quite large so I can not add it to a testuiste. I will see if I can reduce this even more.
[Bug middle-end/111054] [14 Regression] ICE: in to_sreal, at profile-count.cc:472 with -O3 -fno-guess-branch-probability since r14-2967
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111054 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from Jan Hubicka --- Fixed.
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #5 from Jan Hubicka --- There is a cap in want_inline_self_recursive_call_p which gives up on inlining after reaching max recursive inlining depth of 8. Problem is that the tree here is too wide. After early inlining f0 contains 4 calls to f1 and 3 calls to f0. Similarly for f0, so we have something like (9+3*9)^8 as a cap on number of inlines that takes a while to converge. One may want to limit number of copies of function A within function B rather than depth, but that number can be large even for sane code. I am making patch to make inliner to ignore always_inline on all self-recrusive inline decisions.
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #4 from Jan Hubicka --- There is a cap in want_inline_self_recursive_call_p which gives up on inlining after reaching max recursive inlining depth of 8. Problem is that the tree here is too wide. After early inlining f0 contains 4 calls to f1 and
[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #29 from Jan Hubicka --- Safest fix is to make equals_p to reject merging functions with different value ranges assigned to corresponding SSA names. I would hope that, since early opts are still mostly local, that does not lead to very large degradation. This is lame of course. If we go for smarter merging, we need to also handle ipa-prop jump functions. In that case I think equals_p needs to check if value range sin SSA_NAMES and jump functions differs and if so, keep that noted so the merging code can do corresponding update. I will check how hard it is to implement this. (Equality handling is Martin Liska's code, but if I recall right, each equivalence class has a leader, and we can keep track if there are some differences WRT that leader, but I do not recall how subdivision of equivalence classes is handled).
[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787 --- Comment #13 from Jan Hubicka --- So my understanding is that ivopts does something like offset = &base2 - &base1 and then translate val = base2[i] to val = *((base1+i)+offset) Where (base1+i) is then an iv variable. I wonder if we consider doing memory reference with base changed via offset a valid transformation. Is there way to tell when this happens? A quick fix would be to run IPA modref before ivopts, but I do not see how such transformation can work with rest of alias analysis (PTA etc)
[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787 --- Comment #8 from Jan Hubicka --- I will take a look. Mod-ref only reuses the code detecting errneous paths in ssa-split-paths, so that code will get confused, too. It makes sense for ivopts to compute difference of two memory allocations, but I wonder if that won't also confuse PTA and other stuff, so perhaps we need way to exlicitely tag memory location where such optimization happen? (to make it clear that original base is lost, or keep track of it)
[Bug ipa/113359] [13 Regression] LTO miscompilation of ceph on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359 --- Comment #11 from Jan Hubicka --- If there are two ODR types with same ODR name one with integer and other with pointer types third field, then indeed we should get ODR warning and give up on handling them as ODR types for type merging. So dumping their assembler names would be useful starting point. Of course if you have two ODR types of different names but you mix them up in COMDAT function of same name, then the warning will not trigger, so this might be some missing type compatibility check in ipa-sra or ipa-prop summary, too.
[Bug ipa/97119] Top level option to disable creation of IPA symbols such as .localalias is desired
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97119 --- Comment #7 from Jan Hubicka --- Local aliases are created by ipa-visibility pass. Most common case is that function is declared inline but ELF superposition rules say that the symbol can be overwritten by a different library. Since GCC knows that all implementaitons must be equivalent, it can force calls within DSO to be direct. I am not quite sure how this confuses stack unwinding on Solaris? For live patching, if you want to patch inline function, one definitely needs to look for places it has been inlined to. However in the situation the function got offlined, I think live patching should just work, since it will place jump in the beggining of function body. The logic for creating local aliases is in ipa-visibility.cc. Adding command line option to control it is not hard. There are other transformations we do there - like breaking up comdat groups and other things. part aliases are controlled by -fno-partial-inlining, isra by -fno-ipa-sra. There is also ipa-cp controlled by -fno-ipa-prop. We also do alises as part of openMP offlining and LTO partitioning that are kind of mandatory (there is no way to produce correct code without them).
[Bug ipa/113422] Missed optimizations in the presence of pointer chains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113422 --- Comment #2 from Jan Hubicka --- Cycling read-only var discovery would be quite expensive, since you need to interleave it with early opts each round. I wonder how llvm handles this? I think there is more hope with IPA-PTA getting scalable version at -O2 and possibly being able to solve this.
[Bug ipa/113520] ICE with mismatched types with LTO (tree check: expected array_type, have integer_type in array_ref_low_bound)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113520 --- Comment #8 from Jan Hubicka --- I think the ipa-cp summaries should be used only when types match. At least Martin added type streaming for all the jump functions. So we are missing some check?
[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #16 from Jan Hubicka --- Fixed.
[Bug c++/109753] [13/14 Regression] pragma GCC target causes std::vector not to compile (always_inline on constructor)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109753 --- Comment #12 from Jan Hubicka --- I think this is a problem with two meanings of always_inline. One is "it must be inlined or otherwise we will not be able to generate code" other is "disregard inline limits". I guess practical solution here would be to ingore always inline for functions called from static construction wrappers (since they only optimize around array of function pointers). Question is how to communicate this down from FE to ipa-inline...
[Bug middle-end/79704] [meta-bug] Phoronix Test Suite compiler performance issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704 Bug 79704 depends on bug 109811, which changed state. Bug 109811 Summary: libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #19 from Jan Hubicka --- I think we can declare this one fixed.
[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236 Jan Hubicka changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2024-01-05 CC||hubicka at gcc dot gnu.org Status|UNCONFIRMED |NEW --- Comment #2 from Jan Hubicka --- On zen3 I get 0.75MP/s for GCC and 0.80MP/s for clang, so only 6.6%, but seems reproducible. Profile looks comparable: gcc 30.96% cwebplibwebp.so.7.1.5 [.] GetCombinedEntropyUnre 26.19% cwebplibwebp.so.7.1.5 [.] VP8LHashChainFill 3.34% cwebplibwebp.so.7.1.5 [.] CalculateBestCacheSize 3.30% cwebplibwebp.so.7.1.5 [.] CombinedShannonEntropy 3.21% cwebplibwebp.so.7.1.5 [.] CollectColorBlueTransf clang: 34.06% cwebplibwebp.so.7.1.5[.] GetCombinedEntropy 28.95% cwebplibwebp.so.7.1.5[.] VP8LHashChainFill 5.37% cwebplibwebp.so.7.1.5[.] VP8LGetBackwardReferences 4.39% cwebplibwebp.so.7.1.5[.] CombinedShannonEntropy_SS 4.28% cwebplibwebp.so.7.1.5[.] CollectColorBlueTransform In the first loop clang seems to ifconvert while GCC doesn't: 0.59 │ lea kSLog2Table,%rdi 3.69 │ vmovss (%rdi,%rax,4),%xmm0 0.98 │ 6f: vcvtsi2ss%edx,%xmm2,%xmm1 0.63 │ vfnmadd213ss 0x0(%r13),%xmm0,%xmm1 38.16 │ vmovss %xmm1,0x0(%r13) 5.48 │ cmp %r12d,0xc(%r13) 0.06 │ ↓ jae 89 │ mov %r12d,0xc(%r13) 0.99 │ 89: mov 0x4(%r13),%edi 0.96 │ 8d: xor %eax,%eax 0.40 │ test %r12d,%r12d 0.60 │ setne%al │ vcvtsd2ss%xmm0,%xmm0,%xmm1 0.02 │362: mov %r15d,%eax 0.57 │ imul %r12d,%eax 0.00 │ cmp %r12d,%r9d 0.03 │ cmovbe %r12d,%r9d 0.02 │ vmovd%eax,%xmm0 0.08 │ vpinsrd $0x1,%r15d,%xmm0,%xmm0 1.50 │ vpaddd %xmm0,%xmm4,%xmm4 1.08 │ vcvtsi2ss%r15d,%xmm5,%xmm0 0.87 │ vfnmadd231ss %xmm0,%xmm1,%xmm3 5.40 │ vmovaps %xmm3,%xmm0 0.02 │38c: xor %eax,%eax 0.16 │ cmp $0x4,%r15d
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #6 from Jan Hubicka --- The internal loops are: static const unsigned keccakf_rotc[24] = { 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44 }; static const unsigned keccakf_piln[24] = { 10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1 }; static void keccakf(ulong64 s[25]) { int i, j, round; ulong64 t, bc[5]; for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) { /* Theta */ for(i = 0; i < 5; i++) bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; for(i = 0; i < 5; i++) { t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1); for(j = 0; j < 25; j += 5) s[j + i] ^= t; } /* Rho Pi */ t = s[1]; for(i = 0; i < 24; i++) { j = keccakf_piln[i]; bc[0] = s[j]; s[j] = ROL64(t, keccakf_rotc[i]); t = bc[0]; } /* Chi */ for(j = 0; j < 25; j += 5) { for(i = 0; i < 5; i++) bc[i] = s[j + i]; for(i = 0; i < 5; i++) s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; } s[0] ^= keccakf_rndc[round]; } } I suppose with complete unrolling this will propagate, partly stay in registers and fold. I think increasing the default limits, especially -O3 may make sense. Value of 16 is there for very long time (I think since the initial implementation).
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark |is almost 40% slower vs.|is almost 40% slower vs. |Clang |Clang (not enough complete ||loop peeling) --- Comment #5 from Jan Hubicka --- On my zen3 machine default build gets me 180MB/S -O3 -flto -funroll-all-loops gets me 193MB/s -O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone with --param max-completely-peel-times=20, default is 16.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #4 from Jan Hubicka --- I keep mentioning to Larabel that he should use -fno-semantic-interposition, but he doesn't. Profile is very simple: 96.75% SMHasher[.] keccakf.lto_priv.0 ◆ All goes to simple loop. On Zen3 gcc 13 -march=native -Ofast -flto I get: 3.85 │330: mov%r8,%rdi 7.68 │ movslq (%rsi,%r9,1),%rcx 3.85 │ lea(%rax,%rcx,8),%r10 3.86 │ mov(%rdx,%r9,1),%ecx 3.83 │ add$0x4,%r9 3.86 │ mov(%r10),%r8 7.37 │ rol%cl,%rdi 7.37 │ mov%rdi,(%r10) 4.76 │ cmp$0x60,%r9 0.00 │ ↑ jne330 Clang seems to unroll it: 0.25 │ d0: mov -0x48(%rsp),%rdx ▒ 0.25 │ xor %r12,%rcx ▒ 0.25 │ mov %r13,%r12 ▒ 0.25 │ mov %r13,0x10(%rsp) ▒ 0.25 │ mov %rax,%r13 ◆ 0.26 │ xor %r15,%r13 ▒ 0.23 │ mov %r11,-0x70(%rsp) ▒ 0.25 │ mov %r8,0x8(%rsp) ▒ 0.25 │ mov %r15,-0x40(%rsp) ▒ 0.25 │ mov %r10,%r15 ▒ 0.26 │ mov %r10,(%rsp) ▒ 0.26 │ mov %r14,%r10 ▒ 0.25 │ xor %r12,%r10 ▒ 0.26 │ xor %rsi,%r15 ▒ 0.24 │ mov %rbp,-0x80(%rsp) ▒ 0.25 │ xor %rcx,%r15 ▒ 0.26 │ mov -0x60(%rsp),%rcx ▒ 0.25 │ xor -0x68(%rsp),%r15 ▒ 0.26 │ xor %rbp,%rdx ▒ 0.25 │ mov -0x30(%rsp),%rbp ▒ 0.25 │ xor %rdx,%r13 ▒ 0.24 │ mov -0x10(%rsp),%rdx ▒ 0.25 │ mov %rcx,%r12 ▒ 0.24 │ xor %rcx,%r13 ▒ 0.25 │ mov $0x1,%ecx ▒ 0.25 │ xor %r11,%rdx ▒ 0.24 │ mov %r8,%r11 ▒ 0.25 │ mov -0x28(%rsp),%r8 ▒ 0.26 │ xor -0x58(%rsp),%r8 ▒ 0.24 │ xor %rdx,%r8 ▒ 0.26 │ mov -0x8(%rsp),%rdx ▒ 0.25 │ xor %rbp,%r8 ▒ 0.26 │ xor %r11,%rdx ▒ 0.25 │ mov -0x20(%rsp),%r11 ▒ 0.25 │ xor %rdx,%r10 ▒
[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 --- Comment #23 from Jan Hubicka --- Created attachment 56970 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56970&action=edit Patch I am testing Hi, this adds -falign-all-functions parameter. It still look like more reasonable (and backward compatible) thing to do. I also poked about Richi's suggestion of extending the syntax of -falign-functions but I think it is less readable.
[Bug ipa/92606] [11/12/13 Regression][avr] invalid merge of symbols in progmem and data sections
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92606 --- Comment #31 from Jan Hubicka --- This is Maritn's code, but I agree that equals_wpa should reject pairs with "dangerous" attributes on them (ideally we should hash them). I think we could add test for same attributes to equals_wpa and eventually white list attributes we consider mergeable? There are attributes that serves no meaning once we enter backend, so it may be also good option to strip them, so they are not confusing passes like ICF.
[Bug ipa/81323] IPA-VRP doesn't handle return values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81323 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #9 from Jan Hubicka --- Note that r14-5628-g53ba8d669550d3 does just the easy part of propagating within single translation unit. We will need to add actual IPA bits into WPA next stage1
[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #18 from Jan Hubicka --- Reading all the discussion again, I am leaning towards -falign-all-functions + documentation update explaining that -falign-functions/-falign-loops are optimizations and ignored for -Os. I do use -falign-functions/-falign-loops when tuning for new generations of CPUs and I definitely want to have way to specify alignment that is ignored for cold functions (as perforance optimization) and we have this behavior since profile code was introduced in 2002. As an optimization, we also want to have hot functions aligned more than 8 byte boundary needed for patching. I will prepare patch for this and send it for disucssion. Pehraps we want -flive-patching to also imply FUNCTION_BOUNDARY increase on x86-64? Or is live patching useful if function entries are not aligned?
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #11 from Jan Hubicka --- trunk -O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute GCC13 -O3 -flto -march=native -fopenmp 257 256 256 Average: 256 Iterations Per Minute clang17 O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute So I guess I will need to try on zen3 to see if there is any difference. the internal loop is: 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒ 0.02 ││ vmovss (%r8,%rax,4),%xmm2▒ 0.95 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 20.22 ││ movzbl 0x1(%rdx,%rax,4),%esi ▒ 0.01 ││ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒ 11.97 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 18.76 ││ movzbl (%rdx,%rax,4),%esi▒ 0.00 ││ inc %rax ▒ 0.72 ││ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒ 12.55 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 14.95 ││ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒ 15.93 │├──cmp %rax,%r13 ▒ 0.35 │└──jne 460 so it still does not get
[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 --- Comment #18 from Jan Hubicka --- I made a typo: Mainline with -O2 -flto -march=native run manually since build machinery patch is needed 23.03 22.85 23.04 Should be Mainline with -O3 -flto -march=native run manually since build machinery patch is needed 23.03 22.85 23.04 So with -O2 we still get slightly lower score than clang with -O3 we are slightly better. push_back inlining does not seem to be a problem (as tested by increasing limits) so perhaps more agressive unrolling/vectorization settings clang has at -O2. I think upstream jpegxl should use -O3 or -Ofast instead of -O2. It is quite typical kind of task that benefits from large optimization levels. I filled in https://github.com/libjxl/libjxl/issues/2970
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #20 from Jan Hubicka --- On zen4 hardware I now get GCC13 with -O3 -flto -march=native -fopenmp 2163 2161 2153 Average: 2159 Iterations Per Minute clang 17 with -O3 -flto -march=native -fopenmp 2004 1988 1991 Average: 1994 Iterations Per Minute trunk -O3 -flto -march=native -fopenmp Operation: Resizing: 2126 2135 2123 Average: 2128 Iterations Per Minute So no big changes here...
[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653 --- Comment #8 from Jan Hubicka --- On ARM32 and other targets methods returns this pointer. Togher with making return value escape this probably completely disables any chance for IPA tracking of C++ data types...
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #10 from Jan Hubicka --- runtimes on zen4 hardware. trunk -O3 -flto -march-native 42171 42964 42106 clang -O3 -flto -march=native 37393 37423 37508 gcc 13 -O3 -flto -march=native 42380 42314 43285 So seems the performance did not change