[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #17 from amker at gcc dot gnu.org --- Author: amker Date: Mon Mar 7 16:39:27 2016 New Revision: 234034 URL: https://gcc.gnu.org/viewcvs?rev=234034=gcc=rev Log: PR rtl-optimization/69052 * rtlanal.c (commutative_operand_precedence): Set higher precedence to CONST_WIDE_INT. Modified: trunk/gcc/ChangeLog trunk/gcc/rtlanal.c
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #16 from amker at gcc dot gnu.org --- (In reply to amker from comment #14) > Author: amker > Date: Wed Mar 2 14:10:56 2016 > New Revision: 233907 > > URL: https://gcc.gnu.org/viewcvs?rev=233907=gcc=rev > Log: > > PR tree-optimization/69052 > * loop-invariant.c (canonicalize_address): New function. > (inv_can_prop_to_addr_use): Check validity of address expression > which is canonicalized by above function. > > gcc/testsuite/ChangeLog > PR tree-optimization/69052 > * gcc.target/i386/pr69052.c: New test. > > > Added: > trunk/gcc/testsuite/gcc.target/i386/pr69052.c > Modified: > trunk/gcc/ChangeLog > trunk/gcc/loop-invariant.c > trunk/gcc/testsuite/ChangeLog Just realized that CONST_INT and CONST_WIDE_INT both have the same precedence. The patch I applied could be broken in cases which have both CONST_INT and CONST_WIDE_INT address parts and the CONST_WIDE_INT one happens to be sorted after CONST_INT, as below code: /* Simplify all constant int summary if possible. */ for (i = 0; i < addr_parts.length (); i++) if (CONST_INT_P (addr_parts[i])) break; for (j = i + 1; j < addr_parts.length (); j++) { gcc_assert (CONST_INT_P (addr_parts[j])); <-assertion failure addr_parts[i] = simplify_gen_binary (PLUS, mode, addr_parts[i], addr_parts[j]); } Although I don't think we can really run into such complicated address expression, I will work out a patch to compare_address_parts to force CONST_INT after CONST_WIDE_INT during sorting.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Jeffrey A. Law changed: What|Removed |Added Status|NEW |RESOLVED CC||law at redhat dot com Resolution|--- |FIXED --- Comment #15 from Jeffrey A. Law --- Fixed by Bin's patch on the trunk.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #14 from amker at gcc dot gnu.org --- Author: amker Date: Wed Mar 2 14:10:56 2016 New Revision: 233907 URL: https://gcc.gnu.org/viewcvs?rev=233907=gcc=rev Log: PR tree-optimization/69052 * loop-invariant.c (canonicalize_address): New function. (inv_can_prop_to_addr_use): Check validity of address expression which is canonicalized by above function. gcc/testsuite/ChangeLog PR tree-optimization/69052 * gcc.target/i386/pr69052.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr69052.c Modified: trunk/gcc/ChangeLog trunk/gcc/loop-invariant.c trunk/gcc/testsuite/ChangeLog
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-03-01 Ever confirmed|0 |1
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #12 from amker at gcc dot gnu.org --- Patch sent for review at https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00612.html It works for the reduced test case, could you please help me to check if it works for you original case? Thanks, bin
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #13 from Yuri Rumyantsev --- I checked that performance is back for the whole benchmark. Thanks a lot. Yuri. 2016-02-09 14:17 GMT+03:00 amker at gcc dot gnu.org: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 > > --- Comment #12 from amker at gcc dot gnu.org --- > Patch sent for review at > https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00612.html > It works for the reduced test case, could you please help me to check if it > works for you original case? > Thanks, > bin > > -- > You are receiving this mail because: > You reported the bug.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #9 from amker at gcc dot gnu.org --- (In reply to Ilya Enkovich from comment #8) > (In reply to amker from comment #7) > > According to discussion at https://gcc.gnu.org/ml/gcc/2016-01/msg00190.html, > > hook is probably not wanted in this case. > > Bernd gave another proposal by moving combine before loop transforms is also > > interesting, but it can be for GCC6. > > So a backend fix would be nice. > > Unfortunately my patch causes significant regressions in some SPEC > benchmarks. Looks like address operands order matters in some other places. I know little about x86, is it because of generation of non-canonical rtl expression after this change? Another question for this case is: Is it because operand ordering that inv_can_prop_to_addr_use failed to verify the result insn as a valid one? If so, is it because of non-canonical rtl expressions? I will check how combine merges these two instructions. Thanks.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #8 from Ilya Enkovich --- (In reply to amker from comment #7) > According to discussion at https://gcc.gnu.org/ml/gcc/2016-01/msg00190.html, > hook is probably not wanted in this case. > Bernd gave another proposal by moving combine before loop transforms is also > interesting, but it can be for GCC6. > So a backend fix would be nice. Unfortunately my patch causes significant regressions in some SPEC benchmarks. Looks like address operands order matters in some other places.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #10 from Ilya Enkovich --- (In reply to amker from comment #9) > I know little about x86, is it because of generation of non-canonical rtl > expression after this change? > > Another question for this case is: Is it because operand ordering that > inv_can_prop_to_addr_use failed to verify the result insn as a valid one? > If so, is it because of non-canonical rtl expressions? In i386 address expression only the first operand of PLUS may be another PLUS. Thus (r1 + r2) + CONST is a valid address but r1 + (r2 + CONST) is not. Thus when you propagate a register you may need to reorder operands.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #11 from amker at gcc dot gnu.org --- (In reply to Ilya Enkovich from comment #10) > (In reply to amker from comment #9) > > I know little about x86, is it because of generation of non-canonical rtl > > expression after this change? > > > > Another question for this case is: Is it because operand ordering that > > inv_can_prop_to_addr_use failed to verify the result insn as a valid one? > > If so, is it because of non-canonical rtl expressions? > > In i386 address expression only the first operand of PLUS may be another > PLUS. Thus (r1 + r2) + CONST is a valid address but r1 + (r2 + CONST) is > not. Thus when you propagate a register you may need to reorder operands. I see, I will check what combine does here. Maybe it's just an overlook in inv_can_prop_to_addr_use, which reminds me if other targets like AArch64 are also affected. Thanks.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #7 from amker at gcc dot gnu.org --- (In reply to Ilya Enkovich from comment #6) > (In reply to amker from comment #5) > > Not sure if stage4 is a good time for a new hook either. > > > > Any ideas? > > We can try to improve i386 address recognition to ignore operands order. > With this patch: > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index 34b57a4..b13d3f6 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -14111,8 +14111,16 @@ ix86_decompose_address (rtx addr, struct > ix86_address *out) > { > if (n >= 4) > return 0; > - addends[n++] = XEXP (op, 1); > - op = XEXP (op, 0); > + if (GET_CODE (XEXP (op, 1)) == PLUS) > + { > + addends[n++] = XEXP (op, 0); > + op = XEXP (op, 1); > + } > + else > + { > + addends[n++] = XEXP (op, 1); > + op = XEXP (op, 0); > + } > } >while (GET_CODE (op) == PLUS); >if (n >= 4) > > I get following ASM: > > .L5: > movlind@GOTOFF(%ebx,%esi,4), %eax > movl12(%esp), %edi > movl%ebp, (%edi,%eax,4) According to discussion at https://gcc.gnu.org/ml/gcc/2016-01/msg00190.html, hook is probably not wanted in this case. Bernd gave another proposal by moving combine before loop transforms is also interesting, but it can be for GCC6. So a backend fix would be nice.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Ilya Enkovich changed: What|Removed |Added CC||ienkovich at gcc dot gnu.org --- Comment #6 from Ilya Enkovich --- (In reply to amker from comment #5) > Not sure if stage4 is a good time for a new hook either. > > Any ideas? We can try to improve i386 address recognition to ignore operands order. With this patch: diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 34b57a4..b13d3f6 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -14111,8 +14111,16 @@ ix86_decompose_address (rtx addr, struct ix86_address *out) { if (n >= 4) return 0; - addends[n++] = XEXP (op, 1); - op = XEXP (op, 0); + if (GET_CODE (XEXP (op, 1)) == PLUS) + { + addends[n++] = XEXP (op, 0); + op = XEXP (op, 1); + } + else + { + addends[n++] = XEXP (op, 1); + op = XEXP (op, 0); + } } while (GET_CODE (op) == PLUS); if (n >= 4) I get following ASM: .L5: movlind@GOTOFF(%ebx,%esi,4), %eax movl12(%esp), %edi movl%ebp, (%edi,%eax,4)
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #5 from amker at gcc dot gnu.org --- (In reply to Yuri Rumyantsev from comment #0) > In loop_invariant phase additional function inv_can_prop_to_addr_use which > tried to determine if forward propagation for cheap address is possible > through call of verify_changes which is very poor in comparison with combine > phase. > For example, for attached test-case it tries > (gdb) call debug_rtx(def_insn) > (insn 69 67 70 9 (set (reg/f:SI 149) > (plus:SI (reg:SI 87) > (const:SI (unspec:SI [ > (symbol_ref:SI ("ind") [flags 0x2] 0x77ffbe10 ind>) > ] UNSPEC_GOTOFF t1.c:40 212 {*leasi} > (expr_list:REG_DEAD (reg:SI 87) > (nil))) > (gdb) call debug_rtx(use_insn) > (insn 70 69 71 9 (set (reg:SI 150) > (mem/u:SI (plus:SI (mult:SI (reg/v:SI 90 [ k ]) > (const_int 4 [0x4])) > (reg/f:SI 149)) [1 ind S4 A32])) t1.c:40 86 {*movsi_internal} > (expr_list:REG_DEAD (reg/f:SI 149) > (nil))) > and determines that propagation is not possible: > (gdb) p ok > $1 = false > but combine can do such substitution. > > This leads to undesired code motion and performance lost: > for stmt out[ind[k]] = result > before r229402 > movlind@GOTOFF(%ebx,%esi,4), %eax > movl12(%esp), %edi > movl%ebp, (%edi,%eax,4) > after r229402 > movl28(%esp), %eax > movl24(%esp), %ebx > movl(%eax,%esi,4), %eax > movl%edi, (%ebx,%eax,4) > > redundant fill has been generated by LRA. > > Since emulation combine phase is not so simple I assume that additional hook > should be added to turn off such transformation for x86 in PIE mode. I agree it's hard to mimic combine behavior here. It would be nice if we can abstract an interface out of combine so that we can test if instructions can be combined at any passes, but this sounds even more difficult... Also is it possible to give green light to got related address computation. On other targets, I fount it's better to keep these instructions inside one basic block, otherwise redundant computation or CSE opportunities might be missed among different basic blocks. Not sure if stage4 is a good time for a new hook either. Any ideas?
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #4 from amker at gcc dot gnu.org --- (In reply to Igor Zamyatin from comment #3) > (In reply to amker from comment #2) > > It's my change, I will look into it. > > Any plans on this? Sorry for late response, I will try to get to this one in one or two weeks.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Igor Zamyatin changed: What|Removed |Added CC||izamyatin at gmail dot com --- Comment #3 from Igor Zamyatin --- (In reply to amker from comment #2) > It's my change, I will look into it. Any plans on this?
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #2 from amker at gcc dot gnu.org --- It's my change, I will look into it.
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-linux-gnu Target Milestone|--- |6.0
[Bug rtl-optimization/69052] [6 Regression] Performance regression after r229402.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69052 --- Comment #1 from Yuri Rumyantsev --- Created attachment 37133 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37133=edit test-case to reproduce It should be compile with -O2 -m32 options to reproduce.