[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #12 from Richard Biener --- Author: rguenth Date: Fri Jan 26 10:30:36 2018 New Revision: 257077 URL: https://gcc.gnu.org/viewcvs?rev=257077=gcc=rev Log: 2018-01-26 Richard BienerPR tree-optimization/81082 * fold-const.c (fold_plusminus_mult_expr): Do not perform the association if it requires casting to unsigned. * match.pd ((A * C) +- (B * C) -> (A+-B)): New patterns derived from fold_plusminus_mult_expr to catch important cases late when range info is available. * gcc.dg/vect/pr81082.c: New testcase. * gcc.dg/tree-ssa/loop-15.c: XFAIL the (int)((unsigned)n + -1U) * n + n simplification to n * n. Added: trunk/gcc/testsuite/gcc.dg/vect/pr81082.c Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/match.pd trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-15.c
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #11 from Richard Biener --- Fixed.
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 Richard Biener changed: What|Removed |Added CC||kristerw at gcc dot gnu.org --- Comment #10 from Richard Biener --- *** Bug 81554 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #9 from Marc Glisse --- + (if (! INTEGRAL_TYPE_P (type) Integer vectors satisfy this condition... Also, floats need some check (I don't know which one is appropriate).
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #8 from rsandifo at gcc dot gnu.org --- (In reply to Richard Biener from comment #6) > So moving the transform to match.pd will only have an effect in late VRP > given we need loop header copying to derive a range for the bNs. > > The following is what I've done, remove the problematic foldings from > fold-const.c > and re-instantiate the full transform (but not the factoring of power of > twos) > in match.pd. It vectorizes this testcase again and restores Himeno > performance > (PR81554) Thanks for looking at this. Himeno was (as you probably guessed) the original benchmark from which the testcase was reduced. The impact for us was much higher than the one in PR81554.
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #7 from Richard Biener --- FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized " + " 0 (found 2 times) the loop is gone but we end up with unfolded _1 = (unsigned int) n_5; _10 = _1 + 4294967295; _6 = (int) _10; _13 = n_5 * _6; j_2 = n_5 + _13; which is when you decipher, just (n * (n - 1)) + n == n * n FAIL: gcc.dg/tree-ssa/pr23294.c scan-tree-dump-not optimized "* 6" FAIL: gcc.dg/tree-ssa/pr23294.c scan-tree-dump-times optimized " * 2" 3 (found 4 times) FAIL: gcc.dg/tree-ssa/pr23294.c scan-tree-dump-times optimized "a_..D. * 5" 3 (found 0 times) my patterns are incomplete and don't for example handle _1 = a_2(D) * 6; _3 = _1 - a_2(D); where it isn't important whether a_2 is zero or -1. Leaving these cases to fold-const.c might be best at this point(?). FAIL: gcc.dg/tree-ssa/pr63586-2.c scan-tree-dump-times reassoc1 "* 6" 1 (found 0 times) Similar. FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "* 4" 1 (found 4 times) FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "+ 12|, 12>" 1 (found 0 times) FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "+ 4|, 4>" 2 (found 1 times) FAIL: gcc.dg/tree-ssa/slsr-3.c scan-tree-dump-times optimized "+ 8|, 8>" 1 (found 0 times) FAIL: gcc.dg/tree-ssa/slsr-4.c scan-tree-dump-times optimized "* 40" 1 (found 2 times) FAIL: gcc.dg/tree-ssa/slsr-4.c scan-tree-dump-times optimized "+ 40" 1 (found 0 times) FAIL: gfortran.dg/reassoc_4.f -O scan-tree-dump-times reassoc1 "[0-9] * " 22 (found 23 times) Leaves us with those FAILs. Thus adjusted fold-const.c hunk would be Index: gcc/fold-const.c === --- gcc/fold-const.c(revision 256977) +++ gcc/fold-const.c(working copy) @@ -7097,7 +7097,7 @@ fold_plusminus_mult_expr (location_t loc /* Same may be zero and thus the operation 'code' may overflow. Likewise same may be minus one and thus the multiplication may overflow. Perform - the operations in an unsigned type. */ + the sum operation in an unsigned type. */ tree utype = unsigned_type_for (type); tree tem = fold_build2_loc (loc, code, utype, fold_convert_loc (loc, utype, alt0), @@ -7110,9 +7110,9 @@ fold_plusminus_mult_expr (location_t loc return fold_build2_loc (loc, MULT_EXPR, type, fold_convert (type, tem), same); - return fold_convert_loc (loc, type, - fold_build2_loc (loc, MULT_EXPR, utype, tem, - fold_convert_loc (loc, utype, same))); + /* Do not resort to unsigned multiplication because + we lose the no-overflow property of the expression. */ + return NULL_TREE; } /* Subroutine of native_encode_expr. Encode the INTEGER_CST
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #6 from Richard Biener --- So moving the transform to match.pd will only have an effect in late VRP given we need loop header copying to derive a range for the bNs. The following is what I've done, remove the problematic foldings from fold-const.c and re-instantiate the full transform (but not the factoring of power of twos) in match.pd. It vectorizes this testcase again and restores Himeno performance (PR81554), dump differences are too big to see what helps but I can say the match.pd patterns mostly apply via generic (SCEV/IVOPTs) and there's two late applies in VRP2 and forwprop4 on GIMPLE. Note that all but one match the a*b + a cases, the single a*b + a*c case is from forwprop4. I'm quite sure the patterns are confused by association, say, a*(b*c) + a*b, where for signed arithmetic reassoc doesn't "fix" this up. But the fold-const.c implementation has exactly the same issue. Index: gcc/fold-const.c === --- gcc/fold-const.c(revision 256977) +++ gcc/fold-const.c(working copy) @@ -7095,24 +7095,7 @@ fold_plusminus_mult_expr (location_t loc fold_convert_loc (loc, type, alt1)), fold_convert_loc (loc, type, same)); - /* Same may be zero and thus the operation 'code' may overflow. Likewise - same may be minus one and thus the multiplication may overflow. Perform - the operations in an unsigned type. */ - tree utype = unsigned_type_for (type); - tree tem = fold_build2_loc (loc, code, utype, - fold_convert_loc (loc, utype, alt0), - fold_convert_loc (loc, utype, alt1)); - /* If the sum evaluated to a constant that is not -INF the multiplication - cannot overflow. */ - if (TREE_CODE (tem) == INTEGER_CST - && (wi::to_wide (tem) - != wi::min_value (TYPE_PRECISION (utype), SIGNED))) -return fold_build2_loc (loc, MULT_EXPR, type, - fold_convert (type, tem), same); - - return fold_convert_loc (loc, type, - fold_build2_loc (loc, MULT_EXPR, utype, tem, - fold_convert_loc (loc, utype, same))); + return NULL_TREE; } /* Subroutine of native_encode_expr. Encode the INTEGER_CST Index: gcc/match.pd === --- gcc/match.pd(revision 256977) +++ gcc/match.pd(working copy) @@ -4617,3 +4617,29 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) || wi::geu_p (wi::to_wide (@rpos), wi::to_wide (@ipos) + isize)) (BIT_FIELD_REF @0 @rsize @rpos) + +(for plusminus (plus minus) + (simplify + (plusminus (mult @0 @1) (mult:c @0 @2)) + (if (! INTEGRAL_TYPE_P (type) + || TYPE_OVERFLOW_WRAPS (type) + || (tree_expr_nonzero_p (@0) + && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) + (mult (plusminus @1 @2) @0))) + (simplify + (plusminus @0 (mult:c @0 @2)) + /* We cannot generate constant 1 for fract. */ + (if (! ALL_FRACT_MODE_P (TYPE_MODE (type)) + && (! INTEGRAL_TYPE_P (type) + || TYPE_OVERFLOW_WRAPS (type) + || (tree_expr_nonzero_p (@0) + && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) + (mult (plusminus { build_one_cst (type); } @2) @0))) + (simplify + (plusminus (mult:c @0 @2) @0) + (if (! ALL_FRACT_MODE_P (TYPE_MODE (type)) + && (! INTEGRAL_TYPE_P (type) + || TYPE_OVERFLOW_WRAPS (type) + || (tree_expr_nonzero_p (@0) + && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) + (mult (plusminus @2 { build_one_cst (type); }) @0 Now testing the patch looking for testsuite fallout.
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #5 from Richard Biener --- So I think the solution will be to remove this transform from early folding and make reassoc do this transform - which it already can do, but only for unsigned arithmetic. I'll give teaching reassoc to operate on signed arithmetic a quick shot...
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #4 from Richard Biener --- Created attachment 43003 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43003=edit patch introducing "late" gimple with -fwrapv semantics So that was for another PR, PR81554.
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 --- Comment #3 from Richard Biener --- (In reply to Jakub Jelinek from comment #2) > So is there anything we can do here? > Isn't the bigger problem that we no longer notice the multiplications can't > overflow? Yes, that's the issue with all arithmetic we drop to unsigned because of association. It's always a trade-off ... > With the int multiplications gone, we need to assume [1, INT_MAX] > ranges on b1, b2 and b3 in the loop (otherwise it wouldn't loop at all), and > [0, INT_MAX - 1] ranges for i1, i2 and i3. But, from the i1 * b2 * b3 + i2 > * b3 + (i3 - 1) we could have determined that b1 * b2 * b3 is [0, INT_MAX]. > > We do vectorize: > int > f (int *x, int b1, int b2, int b3) > { > int foo = 0; > for (int i1 = 0; i1 < b1; ++i1) > for (int i2 = 0; i2 < b2; ++i2) > for (int i3 = 0; i3 < b3; ++i3) > foo += x[(i1 * b2 + i2) * b3 + (i3 - 1)]; > return foo; > } > without gather loads, while the #c0 only with them (if available). > > Regarding the PR66313 optimization, first of all, what happened to the move > of it to match.pd? Given we have the reassoc pass doing this "properly" (well, but not for signed integer types...) I have not pursused fully moving it. We might want to move some of the more "obvious" cases (but IIRC some cases involving constants are already handled). So, the ultimate plan is to make reassoc stronger, rewriting assoicated signed arithmetic to unsigned where required. > As we vectorize: > int > f (int *x, int b1, int b2, int b3) > { > int foo = 0; > for (int i1 = 0; i1 < b1; ++i1) > for (int i2 = 0; i2 < b2; ++i2) > for (int i3 = 0; i3 < b3; ++i3) > { > int t1 = i1 * b2 * b3; > int t2 = i2 * b3; > int t3 = i3 - 1; > foo += x[t1 + t2 + t3]; > } > return foo; > } > > it seems that hasn't been done. > > Second thing, in fold_plusminus_mult_expr I don't understand the: > /* We are neither factoring zero nor minus one. */ > || TREE_CODE (same) == INTEGER_CST) > part, shouldn't that be || (TREE_CODE (same) == INTEGER_CST && > !integer_zerop (same) && !integer_minus_onep (same)) ? The comment says that same == 0 or same == -1 don't happen by construction. > Another thing is, for signed a, b, c we can't optimize a * b + a * c as > a * (b + c) if a can be -1 or 0, exhaustive proof for signed char: > int > main () > { > int a, b, c; > for (a = -128; a < 128; a++) > for (b = -128; b < 128; b++) > for (c = -128; c < 128; c++) > { > int ab = a * b; > int ac = a * c; > int ab_ac = ab + ac; > if (ab < -128 || ab >= 128 || ac < -128 || ac >= 128 || ab_ac < -128 > || > ab_ac >= 128) > continue; > /* Ok, in this case a * b + a * c is without UB. */ > int b_c = b + c; > #ifdef NEW > b_c = (signed char) b_c; > #endif > int r = a * b_c; > if (b_c < -128 || b_c >= 128 || r != ab_ac) > __builtin_printf ("%d * %d + %d * %d = %d wrong opt (%d)\n", a, b, > a, > c, ab_ac, r); > } > return 0; > } > > But, if we know for sure that a is not -1, we could optimize it as > a * (signed)((unsigned) b + c) > (which doesn't work for a == -1, but works for others) instead of > (signed)(a * ((unsigned) b + c)) > (which works for all, but is a too big hammer). > > Perhaps if this optimization is moved over to match.pd and we'd defer it if > a isn't constant until at least vrp1 and tune based on range info? > The range for b3 at least during vrp pass itself is [1, INT_MAX], so neither > -1 nor 0 are possible and we could safely reassociate it as a * (b + c) > rather than the variants with unsigned. > > Or do we want to somehow mark the MULT_EXPR we've created, propagate it > through the IL and if we in vrp notice this way marked MULT_EXPR (or IFN) we > fold it as integer multiplication? That would be ugly. I don't really see a good solutuon to this PR either :/ Delaying the folding to GIMPLE would be ok with me but as said I'd like to see reassoc enhanced rather than stuffing more and more "complex" patterns to match.pd. Ultimatively we'd want to compute SCEV for all relevant IVs early and cache them. That should be easy when only constants are involved but will be tricky (impossible) when any symbols are. In the case of this testcase it's all constants of course. I once did experiment with "lowering" GIMPLE to -fwrapv (as RTL is) before late reassoc and that "worked" (well, it of course regressed stuff, and it was merely to prove reassoc of signed arithmetic is useful).
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #2 from Jakub Jelinek --- So is there anything we can do here? Isn't the bigger problem that we no longer notice the multiplications can't overflow? With the int multiplications gone, we need to assume [1, INT_MAX] ranges on b1, b2 and b3 in the loop (otherwise it wouldn't loop at all), and [0, INT_MAX - 1] ranges for i1, i2 and i3. But, from the i1 * b2 * b3 + i2 * b3 + (i3 - 1) we could have determined that b1 * b2 * b3 is [0, INT_MAX]. We do vectorize: int f (int *x, int b1, int b2, int b3) { int foo = 0; for (int i1 = 0; i1 < b1; ++i1) for (int i2 = 0; i2 < b2; ++i2) for (int i3 = 0; i3 < b3; ++i3) foo += x[(i1 * b2 + i2) * b3 + (i3 - 1)]; return foo; } without gather loads, while the #c0 only with them (if available). Regarding the PR66313 optimization, first of all, what happened to the move of it to match.pd? As we vectorize: int f (int *x, int b1, int b2, int b3) { int foo = 0; for (int i1 = 0; i1 < b1; ++i1) for (int i2 = 0; i2 < b2; ++i2) for (int i3 = 0; i3 < b3; ++i3) { int t1 = i1 * b2 * b3; int t2 = i2 * b3; int t3 = i3 - 1; foo += x[t1 + t2 + t3]; } return foo; } it seems that hasn't been done. Second thing, in fold_plusminus_mult_expr I don't understand the: /* We are neither factoring zero nor minus one. */ || TREE_CODE (same) == INTEGER_CST) part, shouldn't that be || (TREE_CODE (same) == INTEGER_CST && !integer_zerop (same) && !integer_minus_onep (same)) ? Another thing is, for signed a, b, c we can't optimize a * b + a * c as a * (b + c) if a can be -1 or 0, exhaustive proof for signed char: int main () { int a, b, c; for (a = -128; a < 128; a++) for (b = -128; b < 128; b++) for (c = -128; c < 128; c++) { int ab = a * b; int ac = a * c; int ab_ac = ab + ac; if (ab < -128 || ab >= 128 || ac < -128 || ac >= 128 || ab_ac < -128 || ab_ac >= 128) continue; /* Ok, in this case a * b + a * c is without UB. */ int b_c = b + c; #ifdef NEW b_c = (signed char) b_c; #endif int r = a * b_c; if (b_c < -128 || b_c >= 128 || r != ab_ac) __builtin_printf ("%d * %d + %d * %d = %d wrong opt (%d)\n", a, b, a, c, ab_ac, r); } return 0; } But, if we know for sure that a is not -1, we could optimize it as a * (signed)((unsigned) b + c) (which doesn't work for a == -1, but works for others) instead of (signed)(a * ((unsigned) b + c)) (which works for all, but is a too big hammer). Perhaps if this optimization is moved over to match.pd and we'd defer it if a isn't constant until at least vrp1 and tune based on range info? The range for b3 at least during vrp pass itself is [1, INT_MAX], so neither -1 nor 0 are possible and we could safely reassociate it as a * (b + c) rather than the variants with unsigned. Or do we want to somehow mark the MULT_EXPR we've created, propagate it through the IL and if we in vrp notice this way marked MULT_EXPR (or IFN) we fold it as integer multiplication? That would be ugly.
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-06-13 Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Confirmed. SCEV finds (rightfully so) the evolution of the base is no longer affine (it may wrap).
[Bug tree-optimization/81082] [8 Regression] Failure to vectorise after reassociating index computation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81082 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||rguenth at gcc dot gnu.org Version|unknown |8.0 Target Milestone|--- |8.0 Summary|[7 Regression] Failure to |[8 Regression] Failure to |vectorise after |vectorise after |reassociating index |reassociating index |computation |computation