[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #12 from Richard Biener  ---
Author: rguenth
Date: Fri Jan 26 10:30:36 2018
New Revision: 257077

URL: https://gcc.gnu.org/viewcvs?rev=257077=gcc=rev
Log:
2018-01-26  Richard Biener  

PR tree-optimization/81082
* fold-const.c (fold_plusminus_mult_expr): Do not perform the
association if it requires casting to unsigned.
* match.pd ((A * C) +- (B * C) -> (A+-B)): New patterns derived
from fold_plusminus_mult_expr to catch important cases late when
range info is available.

* gcc.dg/vect/pr81082.c: New testcase.
* gcc.dg/tree-ssa/loop-15.c: XFAIL the (int)((unsigned)n + -1U) * n + n
simplification to n * n.

Added:
trunk/gcc/testsuite/gcc.dg/vect/pr81082.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/fold-const.c
trunk/gcc/match.pd
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-15.c

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Richard Biener  ---
Fixed.

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

Richard Biener  changed:

   What|Removed |Added

 CC||kristerw at gcc dot gnu.org

--- Comment #10 from Richard Biener  ---
*** Bug 81554 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-23 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #9 from Marc Glisse  ---
+  (if (! INTEGRAL_TYPE_P (type)

Integer vectors satisfy this condition...
Also, floats need some check (I don't know which one is appropriate).

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-23 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #8 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #6)
> So moving the transform to match.pd will only have an effect in late VRP
> given we need loop header copying to derive a range for the bNs.
> 
> The following is what I've done, remove the problematic foldings from
> fold-const.c
> and re-instantiate the full transform (but not the factoring of power of
> twos)
> in match.pd.  It vectorizes this testcase again and restores Himeno
> performance
> (PR81554)

Thanks for looking at this.  Himeno was (as you probably guessed) the original
benchmark from which the testcase was reduced.  The impact for us was much
higher than the one in PR81554.

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #7 from Richard Biener  ---
FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized " + " 0
(found 2 times)

the loop is gone but we end up with unfolded

  _1 = (unsigned int) n_5;
  _10 = _1 + 4294967295;
  _6 = (int) _10;
  _13 = n_5 * _6;
  j_2 = n_5 + _13;

which is when you decipher, just (n * (n - 1)) + n == n * n

FAIL: gcc.dg/tree-ssa/pr23294.c scan-tree-dump-not optimized "* 6"
FAIL: gcc.dg/tree-ssa/pr23294.c scan-tree-dump-times optimized " * 2" 3
(found 4 times)
FAIL: gcc.dg/tree-ssa/pr23294.c scan-tree-dump-times optimized "a_..D. * 5"
3 (found 0 times)

my patterns are incomplete and don't for example handle

  _1 = a_2(D) * 6;
  _3 = _1 - a_2(D);

where it isn't important whether a_2 is zero or -1.  Leaving these cases
to fold-const.c might be best at this point(?).

FAIL: gcc.dg/tree-ssa/pr63586-2.c scan-tree-dump-times reassoc1 "* 6" 1
(found 0 times)

Similar.

FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "* 4" 1
(found 4 times)
FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "+ 12|,
12>" 1 (found 0 times)
FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "+ 4|,
4>" 2 (found 1 times)
FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "+ 8|,
8>" 1 (found 0 times)
FAIL: gcc.dg/tree-ssa/slsr-4.c scan-tree-dump-times optimized "* 40" 1
(found 2 times)
FAIL: gcc.dg/tree-ssa/slsr-4.c scan-tree-dump-times optimized "+ 40" 1
(found 0 times)

FAIL: gfortran.dg/reassoc_4.f   -O   scan-tree-dump-times reassoc1 "[0-9] *
" 22 (found 23 times)

Leaves us with those FAILs.  Thus adjusted fold-const.c hunk would be

Index: gcc/fold-const.c
===
--- gcc/fold-const.c(revision 256977)
+++ gcc/fold-const.c(working copy)
@@ -7097,7 +7097,7 @@ fold_plusminus_mult_expr (location_t loc

   /* Same may be zero and thus the operation 'code' may overflow.  Likewise
  same may be minus one and thus the multiplication may overflow.  Perform
- the operations in an unsigned type.  */
+ the sum operation in an unsigned type.  */
   tree utype = unsigned_type_for (type);
   tree tem = fold_build2_loc (loc, code, utype,
  fold_convert_loc (loc, utype, alt0),
@@ -7110,9 +7110,9 @@ fold_plusminus_mult_expr (location_t loc
 return fold_build2_loc (loc, MULT_EXPR, type,
fold_convert (type, tem), same);

-  return fold_convert_loc (loc, type,
-  fold_build2_loc (loc, MULT_EXPR, utype, tem,
-   fold_convert_loc (loc, utype,
same)));
+  /* Do not resort to unsigned multiplication because
+ we lose the no-overflow property of the expression.  */
+  return NULL_TREE;
 }

 /* Subroutine of native_encode_expr.  Encode the INTEGER_CST

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #6 from Richard Biener  ---
So moving the transform to match.pd will only have an effect in late VRP
given we need loop header copying to derive a range for the bNs.

The following is what I've done, remove the problematic foldings from
fold-const.c
and re-instantiate the full transform (but not the factoring of power of twos)
in match.pd.  It vectorizes this testcase again and restores Himeno performance
(PR81554), dump differences are too big to see what helps but I can say
the match.pd patterns mostly apply via generic (SCEV/IVOPTs) and there's
two late applies in VRP2 and forwprop4 on GIMPLE.  Note that all but one
match the a*b + a cases, the single a*b + a*c case is from forwprop4.
I'm quite sure the patterns are confused by association, say,
a*(b*c) + a*b, where for signed arithmetic reassoc doesn't "fix" this up.
But the fold-const.c implementation has exactly the same issue.

Index: gcc/fold-const.c
===
--- gcc/fold-const.c(revision 256977)
+++ gcc/fold-const.c(working copy)
@@ -7095,24 +7095,7 @@ fold_plusminus_mult_expr (location_t loc
 fold_convert_loc (loc, type, alt1)),
fold_convert_loc (loc, type, same));

-  /* Same may be zero and thus the operation 'code' may overflow.  Likewise
- same may be minus one and thus the multiplication may overflow.  Perform
- the operations in an unsigned type.  */
-  tree utype = unsigned_type_for (type);
-  tree tem = fold_build2_loc (loc, code, utype,
- fold_convert_loc (loc, utype, alt0),
- fold_convert_loc (loc, utype, alt1));
-  /* If the sum evaluated to a constant that is not -INF the multiplication
- cannot overflow.  */
-  if (TREE_CODE (tem) == INTEGER_CST
-  && (wi::to_wide (tem)
- != wi::min_value (TYPE_PRECISION (utype), SIGNED)))
-return fold_build2_loc (loc, MULT_EXPR, type,
-   fold_convert (type, tem), same);
-
-  return fold_convert_loc (loc, type,
-  fold_build2_loc (loc, MULT_EXPR, utype, tem,
-   fold_convert_loc (loc, utype,
same)));
+  return NULL_TREE;
 }

 /* Subroutine of native_encode_expr.  Encode the INTEGER_CST
Index: gcc/match.pd
===
--- gcc/match.pd(revision 256977)
+++ gcc/match.pd(working copy)
@@ -4617,3 +4617,29 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
|| wi::geu_p (wi::to_wide (@rpos),
  wi::to_wide (@ipos) + isize))
 (BIT_FIELD_REF @0 @rsize @rpos)
+
+(for plusminus (plus minus)
+ (simplify
+  (plusminus (mult @0 @1) (mult:c @0 @2))
+  (if (! INTEGRAL_TYPE_P (type)
+   || TYPE_OVERFLOW_WRAPS (type)
+   || (tree_expr_nonzero_p (@0)
+  && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
+   (mult (plusminus @1 @2) @0)))
+ (simplify
+  (plusminus @0 (mult:c @0 @2))
+  /* We cannot generate constant 1 for fract.  */
+  (if (! ALL_FRACT_MODE_P (TYPE_MODE (type))
+   && (! INTEGRAL_TYPE_P (type)
+  || TYPE_OVERFLOW_WRAPS (type)
+  || (tree_expr_nonzero_p (@0)
+  && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION
(type))
+   (mult (plusminus { build_one_cst (type); } @2) @0)))
+ (simplify
+  (plusminus (mult:c @0 @2) @0)
+  (if (! ALL_FRACT_MODE_P (TYPE_MODE (type))
+   && (! INTEGRAL_TYPE_P (type)
+  || TYPE_OVERFLOW_WRAPS (type)
+  || (tree_expr_nonzero_p (@0)
+  && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION
(type))
+   (mult (plusminus @2 { build_one_cst (type); }) @0


Now testing the patch looking for testsuite fallout.

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #5 from Richard Biener  ---
So I think the solution will be to remove this transform from early folding
and make reassoc do this transform - which it already can do, but only for
unsigned arithmetic.

I'll give teaching reassoc to operate on signed arithmetic a quick shot...

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-02 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #4 from Richard Biener  ---
Created attachment 43003
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43003=edit
patch introducing "late" gimple with -fwrapv semantics

So that was for another PR, PR81554.

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2018-01-02 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

--- Comment #3 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #2)
> So is there anything we can do here?
> Isn't the bigger problem that we no longer notice the multiplications can't
> overflow?

Yes, that's the issue with all arithmetic we drop to unsigned because of
association.  It's always a trade-off ...

>  With the int multiplications gone, we need to assume [1, INT_MAX]
> ranges on b1, b2 and b3 in the loop (otherwise it wouldn't loop at all), and
> [0, INT_MAX - 1] ranges for i1, i2 and i3.  But, from the i1 * b2 * b3 + i2
> * b3 + (i3 - 1) we could have determined that b1 * b2 * b3 is [0, INT_MAX].
> 
> We do vectorize:
> int
> f (int *x, int b1, int b2, int b3)
> {
>   int foo = 0;
>   for (int i1 = 0; i1 < b1; ++i1)
> for (int i2 = 0; i2 < b2; ++i2)
>   for (int i3 = 0; i3 < b3; ++i3)
> foo += x[(i1 * b2 + i2) * b3 + (i3 - 1)];
>   return foo;
> }
> without gather loads, while the #c0 only with them (if available).
> 
> Regarding the PR66313 optimization, first of all, what happened to the move
> of it to match.pd?

Given we have the reassoc pass doing this "properly" (well, but not for
signed integer types...) I have not pursused fully moving it.  We might
want to move some of the more "obvious" cases (but IIRC some cases involving
constants are already handled).

So, the ultimate plan is to make reassoc stronger, rewriting assoicated
signed arithmetic to unsigned where required.

> As we vectorize:
> int
> f (int *x, int b1, int b2, int b3)
> {
>   int foo = 0;
>   for (int i1 = 0; i1 < b1; ++i1)
> for (int i2 = 0; i2 < b2; ++i2)
>   for (int i3 = 0; i3 < b3; ++i3)
>   {
> int t1 = i1 * b2 * b3;
> int t2 = i2 * b3;
> int t3 = i3 - 1;
> foo += x[t1 + t2 + t3];
>   }
>   return foo;
> }
> 
> it seems that hasn't been done.
> 
> Second thing, in fold_plusminus_mult_expr I don't understand the:
>   /* We are neither factoring zero nor minus one.  */
>   || TREE_CODE (same) == INTEGER_CST)
> part, shouldn't that be || (TREE_CODE (same) == INTEGER_CST &&
> !integer_zerop (same) && !integer_minus_onep (same)) ?

The comment says that same == 0 or same == -1 don't happen by construction.

> Another thing is, for signed a, b, c we can't optimize a * b + a * c as
> a * (b + c) if a can be -1 or 0, exhaustive proof for signed char:
> int
> main ()
> {
>   int a, b, c;
>   for (a = -128; a < 128; a++)
> for (b = -128; b < 128; b++)
>   for (c = -128; c < 128; c++)
>   {
> int ab = a * b;
> int ac = a * c;
> int ab_ac = ab + ac;
> if (ab < -128 || ab >= 128 || ac < -128 || ac >= 128 || ab_ac < -128 
> ||
> ab_ac >= 128)
>   continue;
> /* Ok, in this case a * b + a * c is without UB.  */
> int b_c = b + c;
> #ifdef NEW
> b_c = (signed char) b_c;
> #endif
> int r = a * b_c;
> if (b_c < -128 || b_c >= 128 || r != ab_ac)
>   __builtin_printf ("%d * %d + %d * %d = %d wrong opt (%d)\n", a, b, 
> a,
> c, ab_ac, r);
>   }
>   return 0;
> }
> 
> But, if we know for sure that a is not -1, we could optimize it as
>   a * (signed)((unsigned) b + c)
> (which doesn't work for a == -1, but works for others) instead of
>   (signed)(a * ((unsigned) b + c))
> (which works for all, but is a too big hammer).
> 
> Perhaps if this optimization is moved over to match.pd and we'd defer it if
> a isn't constant until at least vrp1 and tune based on range info?
> The range for b3 at least during vrp pass itself is [1, INT_MAX], so neither
> -1 nor 0 are possible and we could safely reassociate it as a * (b + c)
> rather than the variants with unsigned.
> 
> Or do we want to somehow mark the MULT_EXPR we've created, propagate it
> through the IL and if we in vrp notice this way marked MULT_EXPR (or IFN) we
> fold it as integer multiplication?  That would be ugly.

I don't really see a good solutuon to this PR either :/

Delaying the folding to GIMPLE would be ok with me but as said I'd like to
see reassoc enhanced rather than stuffing more and more "complex" patterns
to match.pd.

Ultimatively we'd want to compute SCEV for all relevant IVs early and
cache them.  That should be easy when only constants are involved but
will be tricky (impossible) when any symbols are.  In the case of this
testcase it's all constants of course.

I once did experiment with "lowering" GIMPLE to -fwrapv (as RTL is) before
late reassoc and that "worked" (well, it of course regressed stuff, and it
was merely to prove reassoc of signed arithmetic is useful).

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2017-12-20 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek  ---
So is there anything we can do here?
Isn't the bigger problem that we no longer notice the multiplications can't
overflow?  With the int multiplications gone, we need to assume [1, INT_MAX]
ranges on b1, b2 and b3 in the loop (otherwise it wouldn't loop at all), and
[0, INT_MAX - 1] ranges for i1, i2 and i3.  But, from the i1 * b2 * b3 + i2 *
b3 + (i3 - 1) we could have determined that b1 * b2 * b3 is [0, INT_MAX].

We do vectorize:
int
f (int *x, int b1, int b2, int b3)
{
  int foo = 0;
  for (int i1 = 0; i1 < b1; ++i1)
for (int i2 = 0; i2 < b2; ++i2)
  for (int i3 = 0; i3 < b3; ++i3)
foo += x[(i1 * b2 + i2) * b3 + (i3 - 1)];
  return foo;
}
without gather loads, while the #c0 only with them (if available).

Regarding the PR66313 optimization, first of all, what happened to the move of
it to match.pd?
As we vectorize:
int
f (int *x, int b1, int b2, int b3)
{
  int foo = 0;
  for (int i1 = 0; i1 < b1; ++i1)
for (int i2 = 0; i2 < b2; ++i2)
  for (int i3 = 0; i3 < b3; ++i3)
{
  int t1 = i1 * b2 * b3;
  int t2 = i2 * b3;
  int t3 = i3 - 1;
  foo += x[t1 + t2 + t3];
}
  return foo;
}

it seems that hasn't been done.

Second thing, in fold_plusminus_mult_expr I don't understand the:
  /* We are neither factoring zero nor minus one.  */
  || TREE_CODE (same) == INTEGER_CST)
part, shouldn't that be || (TREE_CODE (same) == INTEGER_CST && !integer_zerop
(same) && !integer_minus_onep (same)) ?

Another thing is, for signed a, b, c we can't optimize a * b + a * c as
a * (b + c) if a can be -1 or 0, exhaustive proof for signed char:
int
main ()
{
  int a, b, c;
  for (a = -128; a < 128; a++)
for (b = -128; b < 128; b++)
  for (c = -128; c < 128; c++)
{
  int ab = a * b;
  int ac = a * c;
  int ab_ac = ab + ac;
  if (ab < -128 || ab >= 128 || ac < -128 || ac >= 128 || ab_ac < -128
|| ab_ac >= 128)
continue;
  /* Ok, in this case a * b + a * c is without UB.  */
  int b_c = b + c;
#ifdef NEW
  b_c = (signed char) b_c;
#endif
  int r = a * b_c;
  if (b_c < -128 || b_c >= 128 || r != ab_ac)
__builtin_printf ("%d * %d + %d * %d = %d wrong opt (%d)\n", a, b,
a, c, ab_ac, r);
}
  return 0;
}

But, if we know for sure that a is not -1, we could optimize it as
  a * (signed)((unsigned) b + c)
(which doesn't work for a == -1, but works for others) instead of
  (signed)(a * ((unsigned) b + c))
(which works for all, but is a too big hammer).

Perhaps if this optimization is moved over to match.pd and we'd defer it if a
isn't constant until at least vrp1 and tune based on range info?
The range for b3 at least during vrp pass itself is [1, INT_MAX], so neither -1
nor 0 are possible and we could safely reassociate it as a * (b + c) rather
than the variants with unsigned.

Or do we want to somehow mark the MULT_EXPR we've created, propagate it through
the IL and if we in vrp notice this way marked MULT_EXPR (or IFN) we fold it as
integer multiplication?  That would be ugly.

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2017-06-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-06-13
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Confirmed.  SCEV finds (rightfully so) the evolution of the base is no longer
affine (it may wrap).

[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation

2017-06-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||rguenth at gcc dot gnu.org
Version|unknown |8.0
   Target Milestone|--- |8.0
Summary|[7 Regression] Failure to   |[8 Regression] Failure to
   |vectorise after |vectorise after
   |reassociating index |reassociating index
   |computation |computation