[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #12 from Richard Biener --- Ah, sorry I failed to see the entry from BB2 isn't fallthru. I agree the difference is spurious in this particular case but in general it's quite hard to do better without excessively rotating most multi-exit loops. Note in the end you also have to satisfy should_duplicate_loop_header_p for each block to duplicate. So what is suprious here is that we do not consider do { } while (i++ < n); to be a do-while loop but we do for do { } while (++i < n); due to the IV update in the latch. IMHO technically that's correct. The loop in question does not look like a do-while loop to us because the loop header doesn't contain "most" of the loop body. But CFG wise it perfectly satisfies the predicate. I think the new FAILs are somewhat spurious and one would need to investigate on a case-by-case basis whether they really are a regression. Often they are testcases for specific input IL (which you changed) rather than a specific C testcase.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #11 from Hongyu Wang --- (In reply to rguent...@suse.de from comment #10) > > It has two exits which makes it difficult > Or impossible to make it truly do-while. > But it's close enough and further rotating the loop doesn't make it better. Thanks for the explanation. But the original loop without eliminating the parser code is: [local count: 114863532]: goto ; [100.00%] [local count: 1014686025]: i.0_1 = (unsigned int) i_10; _2 = i.0_1 * 4; _3 = a_13(D) + _2; _4 = *_3; _5 = i.0_1 + 1; _6 = _5 * 4; _7 = a_13(D) + _6; _8 = *_7; if (_4 > _8) goto ; [5.50%] else goto ; [94.50%] [local count: 958878293]: i_15 = i_10 + 1; [local count: 1073741824]: # i_10 = PHI <0(2), i_15(4)> _9 = n_12(D) + -1; if (_9 > i_10) goto ; [94.50%] else goto ; [5.50%] [local count: 114863532]: # _11 = PHI <0(3), 1(5)> return _11; This is considered as not a do-while loop since the latch (bb 4) is not empty, and it is successfully rotated. I don't think there is much difference except i_15 = i_10 + 1 which can be optimized by fre after the parser change. So I wonder if the original one is rotated, why the one with empty latch can't.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #10 from rguenther at suse dot de --- On January 16, 2020 3:55:18 AM GMT+01:00, wwwhhhyyy333 at gmail dot com wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 > >--- Comment #9 from Hongyu Wang --- >(In reply to Hongtao.liu from comment #6) >> New fail by removal > >> unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is >now >> do-while loop" >> unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 >" if " >> 3 >> unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is >now >> do-while loop" >> unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 >"Will >> duplicate bb" 3 >> unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom >"Store-stores >> chain" 2 >> unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is >now >> do-while loop" >> unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 >" if " >> 3 >> unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is >now >> do-while loop" >> unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 >"Will >> duplicate bb" 3 > >I looked into this and found it is because of current judgement for >do-while >loop: > >-- > [local count: 114863532]: > goto ; [100.00%] > > [local count: 1014686025]: > i.0_1 = (unsigned int) i_11; > _2 = i.0_1 * 4; > _3 = a_14(D) + _2; > _4 = *_3; > _5 = i_11 + 1; > _6 = (unsigned int) _5; > _7 = _6 * 4; > _8 = a_14(D) + _7; > _9 = *_8; > if (_4 > _9) > goto ; [5.50%] > else > goto ; [94.50%] > > [local count: 958878294]: > > [local count: 1073741824]: > # i_11 = PHI <0(2), _5(6)> > _10 = n_13(D) + -1; > if (_10 > i_11) > goto ; [94.50%] > else > goto ; [5.50%] > > [local count: 114863532]: > # _12 = PHI <0(3), 1(4)> > return _12; > >This is regarded as a do-while loop because the it satisfies all the >condition >for do-while loop judgement: > >1) Loop latch is empty. >2) The latch has single predecessor. >3) The latch predecessor will exit loop. > >But I don't think the loop above is a do-while loop, so does the >conditions >need to be modified? It has two exits which makes it difficult Or impossible to make it truly do-while. But it's close enough and further rotating the loop doesn't make it better.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #9 from Hongyu Wang --- (In reply to Hongtao.liu from comment #6) > New fail by removal > unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now > do-while loop" > unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 " if " > 3 > unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now > do-while loop" > unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will > duplicate bb" 3 > unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores > chain" 2 > unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now > do-while loop" > unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 " if " > 3 > unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now > do-while loop" > unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will > duplicate bb" 3 I looked into this and found it is because of current judgement for do-while loop: -- [local count: 114863532]: goto ; [100.00%] [local count: 1014686025]: i.0_1 = (unsigned int) i_11; _2 = i.0_1 * 4; _3 = a_14(D) + _2; _4 = *_3; _5 = i_11 + 1; _6 = (unsigned int) _5; _7 = _6 * 4; _8 = a_14(D) + _7; _9 = *_8; if (_4 > _9) goto ; [5.50%] else goto ; [94.50%] [local count: 958878294]: [local count: 1073741824]: # i_11 = PHI <0(2), _5(6)> _10 = n_13(D) + -1; if (_10 > i_11) goto ; [94.50%] else goto ; [5.50%] [local count: 114863532]: # _12 = PHI <0(3), 1(4)> return _12; This is regarded as a do-while loop because the it satisfies all the condition for do-while loop judgement: 1) Loop latch is empty. 2) The latch has single predecessor. 3) The latch predecessor will exit loop. But I don't think the loop above is a do-while loop, so does the conditions need to be modified?
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #8 from Hongtao.liu --- (In reply to Andrew Pinski from comment #4) > But that is not true any more. So I think this optimization can be removed > as it is too early. Just double check the above testcase and the C++ > testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post > that removal. This optimization is most likely causing other missed > optimizations already too. So I would compile SPEC to see if there is any > differences; my bet you might find some. No big impact for SPEC2017, more or less like noise. 500.perlbench_r 0.21% 502.gcc_r 0.14% 505.mcf_r -0.40% 520.omnetpp_r -0.47% 523.xalancbmk_r -1.20% 525.x264_r -1.26% 531.deepsjeng_r -0.05% 541.leela_r -0.39% 548.exchange2_r -0.09% 557.xz_r-0.16% geomean for intrate -0.37% 503.bwaves_r-0.19% 507.cactuBSSN_r 0.23% 508.namd_r -0.12% 510.parest_r0.18% 511.povray_r-0.30% 519.lbm_r BuildSame #VALUE! 521.wrf_r -0.01% 526.blender_r -0.44% 527.cam4_r -0.17% 538.imagick_r 0.47% 544.nab_r -1.00% 549.fotonik3d_r 0.09% 554.roms_r 0.28% geomean for fprate -0.08% geomean -0.21%
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #7 from Andrew Pinski --- (In reply to Hongtao.liu from comment #6) > New fail by removal > > unix/-m32: c-c++-common/restrict-2.c -Wc++-compat scan-tree-dump-times > lim2 "Moving statement" 11 > unix/-m32: c-c++-common/restrict-2.c -std=gnu++14 scan-tree-dump-times > lim2 "Moving statement" 11 > unix/-m32: c-c++-common/restrict-2.c -std=gnu++17 scan-tree-dump-times > lim2 "Moving statement" 11 > unix/-m32: c-c++-common/restrict-2.c -std=gnu++2a scan-tree-dump-times > lim2 "Moving statement" 11 > unix/-m32: c-c++-common/restrict-2.c -std=gnu++98 scan-tree-dump-times > lim2 "Moving statement" 11 I think this is just there are a different number of statement moved out of the loop. The rest I don't know what is the issue but you should look into each one to figure out what the difference is.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #6 from Hongtao.liu --- New fail by removal unix/-m32: c-c++-common/restrict-2.c -Wc++-compat scan-tree-dump-times lim2 "Moving statement" 11 unix/-m32: c-c++-common/restrict-2.c -std=gnu++14 scan-tree-dump-times lim2 "Moving statement" 11 unix/-m32: c-c++-common/restrict-2.c -std=gnu++17 scan-tree-dump-times lim2 "Moving statement" 11 unix/-m32: c-c++-common/restrict-2.c -std=gnu++2a scan-tree-dump-times lim2 "Moving statement" 11 unix/-m32: c-c++-common/restrict-2.c -std=gnu++98 scan-tree-dump-times lim2 "Moving statement" 11 unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now do-while loop" unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 " if " 3 unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now do-while loop" unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will duplicate bb" 3 unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores chain" 2 unix/-m32: gcc.dg/vect/pr57558-2.c -flto -ffat-lto-objects scan-tree-dump vect "vectorized 1 loops" unix/-m32: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops" unix/-m64: c-c++-common/builtins.c -Wc++-compat (test for excess errors) unix/-m64: c-c++-common/restrict-2.c -Wc++-compat scan-tree-dump-times lim2 "Moving statement" 11 unix/-m64: c-c++-common/restrict-2.c -std=gnu++14 scan-tree-dump-times lim2 "Moving statement" 11 unix/-m64: c-c++-common/restrict-2.c -std=gnu++17 scan-tree-dump-times lim2 "Moving statement" 11 unix/-m64: c-c++-common/restrict-2.c -std=gnu++2a scan-tree-dump-times lim2 "Moving statement" 11 unix/-m64: c-c++-common/restrict-2.c -std=gnu++98 scan-tree-dump-times lim2 "Moving statement" 11 unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now do-while loop" unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 " if " 3 unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now do-while loop" unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will duplicate bb" 3 unix/-m64: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores chain" 2 unix/-m64: gcc.dg/vect/pr57558-2.c -flto -ffat-lto-objects scan-tree-dump vect "vectorized 1 loops" unix/-m64: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #5 from Hongtao.liu --- (In reply to Andrew Pinski from comment #4) > But that is not true any more. So I think this optimization can be removed > as it is too early. Just double check the above testcase and the C++ > testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post > that removal. Removal works fines with those 2 testcases.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #4 from Andrew Pinski --- (In reply to Hongtao.liu from comment #3) > > ptrop ---> src1 + 18446744073709551612; > intop ---> j > > It seems on purpose??? Kinda. What needs to happen is a sign extend rather than a zero extend which is what this code is doing really in a weird way. Take (This is a modified testcase from PR4401 which was added in 2002 because of a C++ bug report and when the code was moved from being C specific to being C family specific): char *a; char b[] = ""; void foo (void) { unsigned int i, j; i = 2; j = 3; a[i + 1 - j] += i; } int main (void) { a = b; foo (); if (b[0] != 'A' + 2) __builtin_abort (); __builtin_exit (0); } --- CUT --- NOTE this code dates before the GCC code was moved into source control (1992). I suspect it has not changed because it is "working". Also note from that time period to now, there has been much more optimizations going on so this might be something which should change too. The comment says this: /* If what we are about to multiply by the size of the elements contains a constant term, apply distributive law and multiply that constant term separately. This helps produce common subexpressions. */ But that is not true any more. So I think this optimization can be removed as it is too early. Just double check the above testcase and the C++ testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post that removal. This optimization is most likely causing other missed optimizations already too. So I would compile SPEC to see if there is any differences; my bet you might find some.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #3 from Hongtao.liu --- (In reply to Andrew Pinski from comment #2) > I think the problem is we are folding the right side of the array (with the > conversion to size_t) too early. > That is: > src1[j-1] > > Is being folded too early to have (j-1)*4 > > Fixing this up in match.pd is wrong. > > This gets us the best code without any patch to match.pd: > int foo(unsigned int *__restrict src1, int i, int k, int n) > { > int j = k + n; > int sum = src1[j]; > int jj = j-1; > sum += src1[jj]; > if (i <= k) > { > j+=2; > int ii = j-3; > sum += src1[ii]; > } > return sum + j; > } > > See how j-1 and j-3 are not folded early and that fixes the issue. It's done by parser, cut from c-common.c -- 3182 if (TREE_CODE (intop) == MINUS_EXPR) 3183 subcode = (subcode == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR); 3184 /* Convert both subexpression types to the type of intop, 3185 because weird cases involving pointer arithmetic 3186 can result in a sum or difference with different type args. */ 3187 ptrop = build_binary_op (EXPR_LOCATION (TREE_OPERAND (intop, 1)), 3188subcode, ptrop, 3189convert (int_type, TREE_OPERAND (intop, 1)), 3190true); 3191 intop = convert (int_type, TREE_OPERAND (intop, 0)); 3192 } -- ptrop ---> src1 + 18446744073709551612; intop ---> j It seems on purpose???
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed||2019-12-18 Ever confirmed|0 |1 Severity|normal |enhancement --- Comment #2 from Andrew Pinski --- I think the problem is we are folding the right side of the array (with the conversion to size_t) too early. That is: src1[j-1] Is being folded too early to have (j-1)*4 Fixing this up in match.pd is wrong. This gets us the best code without any patch to match.pd: int foo(unsigned int *__restrict src1, int i, int k, int n) { int j = k + n; int sum = src1[j]; int jj = j-1; sum += src1[jj]; if (i <= k) { j+=2; int ii = j-3; sum += src1[ii]; } return sum + j; } See how j-1 and j-3 are not folded early and that fixes the issue.
[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980 --- Comment #1 from Hongtao.liu --- test.c.033.fre1 foo (unsigned int * restrict src1, int i, int k, int n) { int sum; int j; long unsigned int _1; long unsigned int _2; unsigned int * _3; unsigned int _4; sizetype _7; unsigned int * _8; unsigned int _9; unsigned int _11; long unsigned int _12; long unsigned int _13; sizetype _14; unsigned int * _15; unsigned int _16; unsigned int _18; int _31; : j_23 = k_21(D) + n_22(D); _1 = (long unsigned int) j_23; _2 = _1 * 4; _3 = src1_24(D) + _2; _4 = *_3; sum_26 = (int) _4; _7 = _2 + 18446744073709551612; _8 = src1_24(D) + _7; _9 = *_8; _11 = _4 + _9; sum_27 = (int) _11; if (k_21(D) >= i_28(D)) goto ; [INV] else goto ; [INV] : j_29 = j_23 + 2; _12 = (long unsigned int) j_29; _13 = _12 * 4; _14 = _13 + 18446744073709551604; --- it shoule be simplified to _7 _15 = src1_24(D) + _14; _16 = *_15; _18 = _11 + _16; sum_30 = (int) _18; : # j_19 = PHI # sum_20 = PHI _31 = j_19 + sum_20; return _31; }