[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #14 from Richard Biener --- Author: rguenth Date: Fri Dec 4 08:09:46 2015 New Revision: 231245 URL: https://gcc.gnu.org/viewcvs?rev=231245&root=gcc&view=rev Log: 2015-12-04 Richard Biener PR middle-end/67438 * match.pd: Guard ~X cmp ~Y -> Y cmp X and the variant with a constant with single_use. Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #13 from Richard Biener --- Fixed.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Andrew Pinski changed: What|Removed |Added Target|i686|i?86-*-* Summary|[6 Regression] ~X op ~Y |[6 Regression] ~X op ~Y |pattern relocation causes |pattern relocation causes |loop performance|loop performance |degradation |degradation on 32bit x86 --- Comment #3 from Andrew Pinski --- I bet this code is slightly faster on a machine with some extra registers like even x86_64 or aarch64 or PowerPC or MIPS.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Mikhail Maltsev changed: What|Removed |Added CC||miyuki at gcc dot gnu.org --- Comment #4 from Mikhail Maltsev --- I looked at gimple dumps. The only difference looks like this. In the "good" revision after forwprop1: : _13 = *in_2; a_14 = ~_13; _17 = MEM[(char *)in_2 + 1B]; b_18 = ~_17; in_20 = &MEM[(void *)in_2 + 3B]; _21 = MEM[(char *)in_2 + 2B]; c_22 = ~_21; if (a_14 < b_18) goto ; else goto ; In the "bad" revision this basic block is simplified: : _13 = *in_2; a_14 = ~_13; _17 = MEM[(char *)in_2 + 1B]; b_18 = ~_17; in_20 = &MEM[(void *)in_2 + 3B]; _21 = MEM[(char *)in_2 + 2B]; c_22 = ~_21; if (_13 > _17) goto ; else goto ; Next BB's are: : d_23 = MIN_EXPR ; : d_24 = MIN_EXPR ; : # d_4 = PHI The condition of "if" is not altered throughout all other passes (it gets if-converted and vectorized). Another small difference: VRP adds assertions in bb 4 (a_12 lt_expr b_14, b_14 gt_expr a_12) and bb5 (a_12 ge_expr b_14, b_14 le_expr a_12). For some reason this does not happen in the "bad" revision. As I understand, the problem is that if we do not fold the condition, values _13 and _17 are killed after we calculate a_14 = ~_13 and b_18 = ~_17. But if we do fold, they are still live (because they are used in the condition), thus, register pressure increases.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #5 from rguenther at suse dot de --- On Thu, 3 Sep 2015, miyuki at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 > > Mikhail Maltsev changed: > >What|Removed |Added > > CC||miyuki at gcc dot gnu.org > > --- Comment #4 from Mikhail Maltsev --- > I looked at gimple dumps. The only difference looks like this. In the "good" > revision after forwprop1: > > : > _13 = *in_2; > a_14 = ~_13; > _17 = MEM[(char *)in_2 + 1B]; > b_18 = ~_17; > in_20 = &MEM[(void *)in_2 + 3B]; > _21 = MEM[(char *)in_2 + 2B]; > c_22 = ~_21; > if (a_14 < b_18) > goto ; > else > goto ; > > In the "bad" revision this basic block is simplified: > > : > _13 = *in_2; > a_14 = ~_13; > _17 = MEM[(char *)in_2 + 1B]; > b_18 = ~_17; > in_20 = &MEM[(void *)in_2 + 3B]; > _21 = MEM[(char *)in_2 + 2B]; > c_22 = ~_21; > if (_13 > _17) > goto ; > else > goto ; > > Next BB's are: > > : d_23 = MIN_EXPR ; > : d_24 = MIN_EXPR ; > : # d_4 = PHI > > The condition of "if" is not altered throughout all other passes (it gets > if-converted and vectorized). > > Another small difference: VRP adds assertions in bb 4 (a_12 lt_expr b_14, b_14 > gt_expr a_12) and bb5 (a_12 ge_expr b_14, b_14 le_expr a_12). For some reason > this does not happen in the "bad" revision. > > As I understand, the problem is that if we do not fold the condition, values > _13 and _17 are killed after we calculate a_14 = ~_13 and b_18 = ~_17. But if > we do fold, they are still live (because they are used in the condition), > thus, > register pressure increases. Yes. Note that because of :s implementation details "fixing" /* Fold ~X op ~Y as Y op X. */ (for cmp (simple_comparison) (simplify (cmp (bit_not @0) (bit_not @1)) (cmp @1 @0))) with :s on the bit_not's is not going to help (because we still allow a single-stmt result as we are just replacing one with another). So :s cannot be used to guard against register pressure increase but only to guard against undoing CSE. For the case in this bug the user might have written the testcase in the way we transform it now and thus what is desirable is a pass that can reduce register pressure by expressing values in a different way. For the case above, why is a_14 = ~_13 not sunk to the edge 3->4 and b_18 = ~_17 to the edge 3->5? (yes, this creates additional BBs) This would reduce register pressure. Maybe this kind of scheduling can be considered when register pressure is high (does -fsched-pressure -fschedule-insns help?)
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #6 from Mikhail Maltsev --- (In reply to rguent...@suse.de from comment #5) > For the case above, why is a_14 = ~_13 not sunk to the edge > 3->4 and b_18 = ~_17 to the edge 3->5? (yes, this creates > additional BBs) This would reduce register pressure. I think, because a_14 and b_18 are used in the next bb. Actually I wrote only part of bb6. The full dump looks like this: : # d_4 = PHI out_26 = out_3 + 1; *out_3 = a_14; out_29 = &MEM[(void *)out_3 + 2B]; MEM[(char *)out_3 + 1B] = b_18; out_32 = &MEM[(void *)out_3 + 3B]; MEM[(char *)out_3 + 2B] = c_22; out_35 = &MEM[(void *)out_3 + 4B]; MEM[(char *)out_3 + 3B] = d_4; : # n_1 = PHI # in_2 = PHI # out_3 = PHI n_10 = n_1 + -1; if (n_10 != 0) goto ; else goto ; : return; > Maybe this kind of scheduling can be considered when register pressure > is high (does -fsched-pressure -fschedule-insns help?) Not much. With -fsched-pressure -fschedule-insns we generate 2 insns less: .L7: movzbl 0(%ebp), %edi # MEM[base: in_70, offset: 0B], D.1940 addl$3, %ebp#, in movzbl -2(%ebp), %esi # MEM[base: in_70, offset: 1B], D.1940 movl%edi, %eax # D.1940, a movzbl -1(%ebp), %edx # MEM[base: in_30, offset: 4294967295B], MEM[base: in_30, offset: 4294967295B] notl%eax# a movb%al, (%ebx) # a, MEM[base: out_71, offset: 0B] movl%esi, %ecx # D.1940, b notl%ecx# b movb%cl, 1(%ebx)# b, MEM[base: out_71, offset: 1B] notl%edx# c movb%dl, 2(%ebx)# c, MEM[base: out_71, offset: 2B] cmpb%dl, %al# c, a cmovg %edx, %eax # d,, c, d cmpb%dl, %cl# c, b movb%al, 4(%esp)# tmp277, %sfp cmovle %ecx, %edx # b,, d movl%esi, %eax # D.1940, D.1940 movl%edi, %ecx # D.1940, D.1940 addl$4, %ebx#, out cmpb%al, %cl# D.1940, D.1940 movzbl 4(%esp), %eax # %sfp, d cmovg %eax, %edx # d,, d cmpl8(%esp), %ebp # %sfp, in movb%dl, -1(%ebx) # d, MEM[base: out_11, offset: 4294967295B] jne .L7 #, I wonder, whether a transformation like this could help: bb1: x = min(a, c) goto bb3 bb2: y = min(b, c) goto bb3 bb3: z = phi(x, y) // x and y are single-use ---> bb1: x = a goto bb3 bb2: y = b goto bb3 bb3: z' = phi(x, y) z = min(z', c) Though if we don't simplify phi(x, y), we would increase register pressure even more.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #7 from Alexander Fomin --- Looks like a cost model should be introduced to avoid such kind of transformations for targets with HW min/max implementation.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #8 from graham.stott at btinternet dot com --- Sent from Samsung Mobile on O2 Original message From: "afomin.mailbox at gmail dot com" Date:07/09/2015 13:35 (GMT+00:00) To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #7 from Alexander Fomin --- Looks like a cost model should be introduced to avoid such kind of transformations for targets with HW min/max implementation.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Richard Biener changed: What|Removed |Added Target Milestone|--- |6.0
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #9 from Yuri Rumyantsev --- It looks like such transformation is profitable if only def statements have a single use, i.e. it looks reasonable for if (255 - a) > (255 -b) /* a,b have char type. */ but it does not look reasonable for attached test-case since after it we missed min/max recognition, namely, c = 255 - r; /* c has mulitple uses! */ m = 255 - g; /* likewise. */ y = 255 - b; /* likewise. */ if (c < m) k = MIN (c, y); else k = MIN (m, y); *write++ = c - k;
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #10 from rguenther at suse dot de --- On Tue, 17 Nov 2015, ysrumyan at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 > > --- Comment #9 from Yuri Rumyantsev --- > It looks like such transformation is profitable if only def statements have a > single use, i.e. it looks reasonable for >if (255 - a) > (255 -b) /* a,b have char type. */ > but it does not look reasonable for attached test-case since after it we > missed > min/max recognition, namely, > > c = 255 - r; /* c has mulitple uses! */ > m = 255 - g; /* likewise. */ > y = 255 - b; /* likewise. */ > if (c < m) > k = MIN (c, y); > else > k = MIN (m, y); > *write++ = c - k; Looks like we are missing the corresponding pattern for MIN/MAX instead. MIN (~X, ~Y) -> ~MAX (X, Y) MAX (~X, ~Y) -> ~MIN (X, Y) (for minmax (min max) maxmin (max min) (simplify (minmax (bit_not @0) (bit_not @1)) (bit_not (maxmin @0 @1 maybe that helps. I notice a missed optimization to combine the test with the two MINs on the GIMPLE level anyway.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #11 from Yuri Rumyantsev --- In fact, the problem is quite different although it is caused by non-profitable pattern matching ~X CMP ~Y -> Y CMP X. In general this pattern may be helpful if we can delete not operation, e.g. x1 = ~x; y1 = ~y; if (x1 y1) ... and there no any other uses of x1 and y1, i.e. x1 and y1 have single use. But if this is not truth we will increase register pressure since we can not use the same register for x,x1 and y,y1. Richard proposed to use the same simplification for min/max operations but in original test-case nested min/max operation (min(x,min(y,z)) or multi operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does such transformation) and so this won't help since we have the same register pressure issue: c = ~r; m = ~g; y = ~b; k = min(c, m, y); *out++ = c - k; *out++ = m - k; *out++ = y - k; *out++ = k; and we can see that value of 'c' is used in min computation and resulting store, so if we will use r g comparison we will increase live range for r, g, b variables and additional registers will require for them (till comparison). Note also that there exists another issue with path-splitting (aka tail duplication) which duplicate loop back edge and in fact move tail block to hammock. This transformation does not loop useful (at least at given stage of design) but this is another topic for discussion. I'd like to propose to introduce new predicate for pattern matching which tells us how much uses have left-hand side of ~x.
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-11-25 Ever confirmed|0 |1 --- Comment #12 from Richard Biener --- (In reply to Yuri Rumyantsev from comment #11) > In fact, the problem is quite different although it is caused by > non-profitable pattern matching ~X CMP ~Y -> Y CMP X. In general this > pattern may be helpful if we can delete not operation, e.g. > x1 = ~x; > y1 = ~y; > if (x1 y1) ... and there no any other uses of x1 and y1, i.e. x1 and > y1 have single use. But if this is not truth we will increase register > pressure since we can not use the same register for x,x1 and y,y1. > > Richard proposed to use the same simplification for min/max operations but > in original test-case nested min/max operation (min(x,min(y,z)) or multi > operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does > such transformation) Can you file an enhancement bug for this? Best with a testcase. AFAICS a full solution will have pieces in phi-opt and reassoc at least. > and so this won't help since we have the same register > pressure issue: > c = ~r; > m = ~g; > y = ~b; > k = min(c, m, y); > *out++ = c - k; > *out++ = m - k; > *out++ = y - k; > *out++ = k; > and we can see that value of 'c' is used in min computation and resulting > store, so if we will use r g comparison we will increase live range > for r, g, b variables and additional registers will require for them (till > comparison). > Note also that there exists another issue with path-splitting (aka tail > duplication) which duplicate loop back edge and in fact move tail block to > hammock. This transformation does not loop useful (at least at given stage > of design) but this is another topic for discussion. > > I'd like to propose to introduce new predicate for pattern matching which > tells us how much uses have left-hand side of ~x. There are examples in match.pd that use single_use () in conditions, doing that would fix this issue. Note that generally constraining patterns to "single-use" operands misses the case where applying (the same) pattern(s) at multiple locations may effectively make operands "single-use". Currently match.pd patterns are applied one-by-one (without fully cleaning up dead stmts) in tree-ssa-forwprop.c which will miss opportunities because of this. Thus there is no "global" analysis done to determine whether an operand becomes dead after applying (multiple) pattern(s) (which is the point of single-use checks).
[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #15 from Andrew Pinski --- (In reply to Yuri Rumyantsev from comment #11) > Richard proposed to use the same simplification for min/max operations but > in original test-case nested min/max operation (min(x,min(y,z)) or multi > operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does > such transformation) and so this won't help since we have the same register > pressure issue: > c = ~r; > m = ~g; > y = ~b; > k = min(c, m, y); > *out++ = c - k; > *out++ = m - k; > *out++ = y - k; > *out++ = k; This is now recognized since GCC 13 (by r13-1950-g9bb19e143cfe88 and improved for GCC 14 by r14-337-gc43819a9b4cdaa). Now there is a missing MIN/MAX detection still: int f(int a, int b, int c) { int at = ~a; int bt = ~b; int ct = ~c; int t = a < b ? at : bt; return t; } Which is not detected until phiopt4. I will file a bug about that. I think once that is fixed I think we might be able to remove the single_use again.