[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-12-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #14 from Richard Biener  ---
Author: rguenth
Date: Fri Dec  4 08:09:46 2015
New Revision: 231245

URL: https://gcc.gnu.org/viewcvs?rev=231245&root=gcc&view=rev
Log:
2015-12-04  Richard Biener  

PR middle-end/67438
* match.pd: Guard ~X cmp ~Y -> Y cmp X and the variant with
a constant with single_use.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/match.pd

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-12-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Richard Biener  ---
Fixed.

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-02 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

Andrew Pinski  changed:

   What|Removed |Added

 Target|i686|i?86-*-*
Summary|[6 Regression] ~X op ~Y |[6 Regression] ~X op ~Y
   |pattern relocation causes   |pattern relocation causes
   |loop performance|loop performance
   |degradation |degradation on 32bit x86

--- Comment #3 from Andrew Pinski  ---
I bet this code is slightly faster on a machine with some extra registers like
even x86_64 or aarch64 or PowerPC or MIPS.


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-02 Thread miyuki at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

Mikhail Maltsev  changed:

   What|Removed |Added

 CC||miyuki at gcc dot gnu.org

--- Comment #4 from Mikhail Maltsev  ---
I looked at gimple dumps. The only difference looks like this. In the "good"
revision after forwprop1:

  :
  _13 = *in_2;
  a_14 = ~_13;
  _17 = MEM[(char *)in_2 + 1B];
  b_18 = ~_17;
  in_20 = &MEM[(void *)in_2 + 3B];
  _21 = MEM[(char *)in_2 + 2B];
  c_22 = ~_21;
  if (a_14 < b_18)
goto ;
  else
goto ;

In the "bad" revision this basic block is simplified:

  :
  _13 = *in_2;
  a_14 = ~_13;
  _17 = MEM[(char *)in_2 + 1B];
  b_18 = ~_17;
  in_20 = &MEM[(void *)in_2 + 3B];
  _21 = MEM[(char *)in_2 + 2B];
  c_22 = ~_21;
  if (_13 > _17)
goto ;
  else
goto ;

Next BB's are:

  : d_23 = MIN_EXPR ;
  : d_24 = MIN_EXPR ;
  : # d_4 = PHI 

The condition of "if" is not altered throughout all other passes (it gets
if-converted and vectorized).

Another small difference: VRP adds assertions in bb 4 (a_12 lt_expr b_14, b_14
gt_expr a_12) and bb5 (a_12 ge_expr b_14, b_14 le_expr a_12). For some reason
this does not happen in the "bad" revision.

As I understand, the problem is that if we do not fold the condition, values
_13 and _17 are killed after we calculate a_14 = ~_13 and b_18 = ~_17. But if
we do fold, they are still live (because they are used in the condition), thus,
register pressure increases.


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-03 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #5 from rguenther at suse dot de  ---
On Thu, 3 Sep 2015, miyuki at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438
> 
> Mikhail Maltsev  changed:
> 
>What|Removed |Added
> 
>  CC||miyuki at gcc dot gnu.org
> 
> --- Comment #4 from Mikhail Maltsev  ---
> I looked at gimple dumps. The only difference looks like this. In the "good"
> revision after forwprop1:
> 
>   :
>   _13 = *in_2;
>   a_14 = ~_13;
>   _17 = MEM[(char *)in_2 + 1B];
>   b_18 = ~_17;
>   in_20 = &MEM[(void *)in_2 + 3B];
>   _21 = MEM[(char *)in_2 + 2B];
>   c_22 = ~_21;
>   if (a_14 < b_18)
> goto ;
>   else
> goto ;
> 
> In the "bad" revision this basic block is simplified:
> 
>   :
>   _13 = *in_2;
>   a_14 = ~_13;
>   _17 = MEM[(char *)in_2 + 1B];
>   b_18 = ~_17;
>   in_20 = &MEM[(void *)in_2 + 3B];
>   _21 = MEM[(char *)in_2 + 2B];
>   c_22 = ~_21;
>   if (_13 > _17)
> goto ;
>   else
> goto ;
> 
> Next BB's are:
> 
>   : d_23 = MIN_EXPR ;
>   : d_24 = MIN_EXPR ;
>   : # d_4 = PHI 
> 
> The condition of "if" is not altered throughout all other passes (it gets
> if-converted and vectorized).
> 
> Another small difference: VRP adds assertions in bb 4 (a_12 lt_expr b_14, b_14
> gt_expr a_12) and bb5 (a_12 ge_expr b_14, b_14 le_expr a_12). For some reason
> this does not happen in the "bad" revision.
> 
> As I understand, the problem is that if we do not fold the condition, values
> _13 and _17 are killed after we calculate a_14 = ~_13 and b_18 = ~_17. But if
> we do fold, they are still live (because they are used in the condition), 
> thus,
> register pressure increases.

Yes.  Note that because of :s implementation details "fixing"

/* Fold ~X op ~Y as Y op X.  */
(for cmp (simple_comparison)
 (simplify
  (cmp (bit_not @0) (bit_not @1))
  (cmp @1 @0)))

with :s on the bit_not's is not going to help (because we still allow
a single-stmt result as we are just replacing one with another).  So
:s cannot be used to guard against register pressure increase but
only to guard against undoing CSE.

For the case in this bug the user might have written the testcase
in the way we transform it now and thus what is desirable is a pass
that can reduce register pressure by expressing values in a different
way.

For the case above, why is a_14 = ~_13 not sunk to the edge
3->4 and b_18 = ~_17 to the edge 3->5?  (yes, this creates
additional BBs)  This would reduce register pressure.  Maybe
this kind of scheduling can be considered when register pressure
is high (does -fsched-pressure -fschedule-insns help?)


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-03 Thread miyuki at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #6 from Mikhail Maltsev  ---
(In reply to rguent...@suse.de from comment #5)
> For the case above, why is a_14 = ~_13 not sunk to the edge
> 3->4 and b_18 = ~_17 to the edge 3->5?  (yes, this creates
> additional BBs)  This would reduce register pressure.

I think, because a_14 and b_18 are used in the next bb. Actually I wrote only
part of bb6. The full dump looks like this:

  :
  # d_4 = PHI 
  out_26 = out_3 + 1;
  *out_3 = a_14;
  out_29 = &MEM[(void *)out_3 + 2B];
  MEM[(char *)out_3 + 1B] = b_18;
  out_32 = &MEM[(void *)out_3 + 3B];
  MEM[(char *)out_3 + 2B] = c_22;
  out_35 = &MEM[(void *)out_3 + 4B];
  MEM[(char *)out_3 + 3B] = d_4;

  :
  # n_1 = PHI 
  # in_2 = PHI 
  # out_3 = PHI 
  n_10 = n_1 + -1;
  if (n_10 != 0)
goto ;
  else
goto ;

  :
  return;


> Maybe this kind of scheduling can be considered when register pressure
> is high (does -fsched-pressure -fschedule-insns help?)

Not much. With -fsched-pressure -fschedule-insns we generate 2 insns less:

.L7:
movzbl  0(%ebp), %edi   # MEM[base: in_70, offset: 0B], D.1940
addl$3, %ebp#, in
movzbl  -2(%ebp), %esi  # MEM[base: in_70, offset: 1B], D.1940
movl%edi, %eax  # D.1940, a
movzbl  -1(%ebp), %edx  # MEM[base: in_30, offset: 4294967295B],
MEM[base: in_30, offset: 4294967295B]
notl%eax# a
movb%al, (%ebx) # a, MEM[base: out_71, offset: 0B]
movl%esi, %ecx  # D.1940, b
notl%ecx# b
movb%cl, 1(%ebx)# b, MEM[base: out_71, offset: 1B]
notl%edx# c
movb%dl, 2(%ebx)# c, MEM[base: out_71, offset: 2B]
cmpb%dl, %al# c, a
cmovg   %edx, %eax  # d,, c, d
cmpb%dl, %cl# c, b
movb%al, 4(%esp)# tmp277, %sfp
cmovle  %ecx, %edx  # b,, d
movl%esi, %eax  # D.1940, D.1940
movl%edi, %ecx  # D.1940, D.1940
addl$4, %ebx#, out
cmpb%al, %cl# D.1940, D.1940
movzbl  4(%esp), %eax   # %sfp, d
cmovg   %eax, %edx  # d,, d
cmpl8(%esp), %ebp   # %sfp, in
movb%dl, -1(%ebx)   # d, MEM[base: out_11, offset: 4294967295B]
jne .L7 #,

I wonder, whether a transformation like this could help:

bb1:
  x = min(a, c)
  goto bb3
bb2:
  y = min(b, c)
  goto bb3
bb3:
  z = phi(x, y) // x and y are single-use

--->

bb1:
  x = a
  goto bb3
bb2:
  y = b
  goto bb3
bb3:
  z' = phi(x, y)
  z = min(z', c)

Though if we don't simplify phi(x, y), we would increase register pressure even
more.


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-07 Thread afomin.mailbox at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #7 from Alexander Fomin  ---
Looks like a cost model should be introduced to avoid such kind of
transformations for targets with HW min/max implementation.


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-07 Thread graham.stott at btinternet dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #8 from graham.stott at btinternet dot com ---
Sent from Samsung Mobile on O2

 Original message From: "afomin.mailbox at
gmail dot com"  Date:07/09/2015  13:35 
(GMT+00:00) To: gcc-bugs@gcc.gnu.org Subject: [Bug
middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop
performance degradation on 32bit x86 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #7 from Alexander Fomin  ---
Looks like a cost model should be introduced to avoid such kind of
transformations for targets with HW min/max implementation.


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-09-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |6.0


[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-11-17 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #9 from Yuri Rumyantsev  ---
It looks like such transformation is profitable if only def statements have a
single use, i.e. it looks reasonable for 
   if (255 - a) > (255 -b) /* a,b have char type.  */
but it does not look reasonable for attached test-case since after it we missed
min/max recognition, namely,

c = 255 - r; /* c has mulitple uses!  */
m = 255 - g; /* likewise.  */
y = 255 - b; /* likewise.  */
if (c < m) 
  k = MIN (c, y);
else
  k = MIN (m, y);
*write++ = c - k;

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-11-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #10 from rguenther at suse dot de  ---
On Tue, 17 Nov 2015, ysrumyan at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438
> 
> --- Comment #9 from Yuri Rumyantsev  ---
> It looks like such transformation is profitable if only def statements have a
> single use, i.e. it looks reasonable for 
>if (255 - a) > (255 -b) /* a,b have char type.  */
> but it does not look reasonable for attached test-case since after it we 
> missed
> min/max recognition, namely,
> 
> c = 255 - r; /* c has mulitple uses!  */
> m = 255 - g; /* likewise.  */
> y = 255 - b; /* likewise.  */
> if (c < m) 
>   k = MIN (c, y);
> else
>   k = MIN (m, y);
> *write++ = c - k;

Looks like we are missing the corresponding pattern for MIN/MAX
instead.

 MIN (~X, ~Y) -> ~MAX (X, Y)
 MAX (~X, ~Y) -> ~MIN (X, Y)

(for minmax (min max)
 maxmin (max min)
 (simplify
  (minmax (bit_not @0) (bit_not @1))
  (bit_not (maxmin @0 @1

maybe that helps.  I notice a missed optimization to combine
the test with the two MINs on the GIMPLE level anyway.

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-11-24 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #11 from Yuri Rumyantsev  ---
In fact, the problem is quite different although it is caused by non-profitable
pattern matching ~X CMP ~Y -> Y CMP X. In general this pattern may be helpful
if we can delete not operation, e.g.
  x1 = ~x;
  y1 = ~y;
  if (x1  y1) ... and there no any other uses of x1 and y1, i.e. x1 and y1
have single use. But if this is not truth we will increase register pressure
since we can not use the same register for x,x1 and y,y1.

Richard proposed to use the same simplification for min/max operations but
in original test-case nested min/max operation (min(x,min(y,z)) or multi
operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does such
transformation) and so this won't help since we have the same register pressure
issue:
c = ~r; 
m = ~g;
y = ~b;
k = min(c, m, y);
*out++ = c - k;
*out++ = m - k;
*out++ = y - k;
*out++ = k;
and we can see that value of 'c' is used in min computation and resulting
store, so if we will use r  g comparison we will increase live range for
r, g, b variables and additional registers will require for them (till
comparison).
Note also that there exists another issue with path-splitting (aka tail
duplication) which duplicate loop back edge and in fact move tail block to
hammock. This transformation does not loop useful (at least at given stage of
design) but this is another topic for discussion.

I'd like to propose to introduce new predicate for pattern matching which tells
us how much uses have left-hand side of ~x.

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2015-11-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-11-25
 Ever confirmed|0   |1

--- Comment #12 from Richard Biener  ---
(In reply to Yuri Rumyantsev from comment #11)
> In fact, the problem is quite different although it is caused by
> non-profitable pattern matching ~X CMP ~Y -> Y CMP X. In general this
> pattern may be helpful if we can delete not operation, e.g.
>   x1 = ~x;
>   y1 = ~y;
>   if (x1  y1) ... and there no any other uses of x1 and y1, i.e. x1 and
> y1 have single use. But if this is not truth we will increase register
> pressure since we can not use the same register for x,x1 and y,y1.
> 
> Richard proposed to use the same simplification for min/max operations but
> in original test-case nested min/max operation (min(x,min(y,z)) or multi
> operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does
> such transformation)

Can you file an enhancement bug for this?  Best with a testcase.  AFAICS
a full solution will have pieces in phi-opt and reassoc at least.

> and so this won't help since we have the same register
> pressure issue:
> c = ~r; 
> m = ~g;
> y = ~b;
> k = min(c, m, y);
> *out++ = c - k;
> *out++ = m - k;
> *out++ = y - k;
> *out++ = k;
> and we can see that value of 'c' is used in min computation and resulting
> store, so if we will use r  g comparison we will increase live range
> for r, g, b variables and additional registers will require for them (till
> comparison).
> Note also that there exists another issue with path-splitting (aka tail
> duplication) which duplicate loop back edge and in fact move tail block to
> hammock. This transformation does not loop useful (at least at given stage
> of design) but this is another topic for discussion.
> 
> I'd like to propose to introduce new predicate for pattern matching which
> tells us how much uses have left-hand side of ~x.

There are examples in match.pd that use single_use () in conditions, doing
that would fix this issue.

Note that generally constraining patterns to "single-use" operands misses
the case where applying (the same) pattern(s) at multiple locations may
effectively make operands "single-use".  Currently match.pd patterns
are applied one-by-one (without fully cleaning up dead stmts) in
tree-ssa-forwprop.c which will miss opportunities because of this.  Thus
there is no "global" analysis done to determine whether an operand becomes
dead after applying (multiple) pattern(s) (which is the point of single-use
checks).

[Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86

2023-10-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438

--- Comment #15 from Andrew Pinski  ---
(In reply to Yuri Rumyantsev from comment #11)
> Richard proposed to use the same simplification for min/max operations but
> in original test-case nested min/max operation (min(x,min(y,z)) or multi
> operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does
> such transformation) and so this won't help since we have the same register
> pressure issue:
> c = ~r; 
> m = ~g;
> y = ~b;
> k = min(c, m, y);
> *out++ = c - k;
> *out++ = m - k;
> *out++ = y - k;
> *out++ = k;

This is now recognized since GCC 13 (by r13-1950-g9bb19e143cfe88 and improved
for GCC 14 by r14-337-gc43819a9b4cdaa).

Now there is a missing MIN/MAX detection still:
int f(int a, int b, int c)
{
int at = ~a;
int bt = ~b;
int ct = ~c;
int t = a < b ? at : bt;
return t;
}

Which is not detected until phiopt4. I will file a bug about that.
I think once that is fixed I think we might be able to remove the single_use
again.