On Tue, 13 Jun 2017, Bin.Cheng wrote: > On Tue, Jun 13, 2017 at 12:48 PM, Richard Biener <rguent...@suse.de> wrote: > > On Tue, 13 Jun 2017, Richard Sandiford wrote: > > > >> Richard Biener <rguent...@suse.de> writes: > >> > On Tue, 13 Jun 2017, Richard Sandiford wrote: > >> >> Richard Biener <rguent...@suse.de> writes: > >> >> > So I've come back to PR66313 and found a solution to the tailrecursion > >> >> > missed optimization when fixing the factoring folding to use an > >> >> > unsigned > >> >> > type when we're not sure of overflow. > >> >> > > >> >> > The folding part is identical to my last try from 2015, the > >> >> > tailrecursion > >> >> > part makes us handle intermittent stmts that were introduced by > >> >> > foldings > >> >> > that "clobber" our quest walking the single-use chain of stmts between > >> >> > the call and the return (and failing at all stmts that are not part > >> >> > of said chain). A simple solution is to move the stmts that are not > >> >> > part of the chain and that we can move before the call. That handles > >> >> > the leaf conversions that now appear for tree-ssa/tailrecursion-6.c > >> >> > > >> >> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. > >> >> > > >> >> > Richard. > >> >> > > >> >> > 2017-05-31 Richard Biener <rguent...@suse.de> > >> >> > > >> >> > PR middle-end/66313 > >> >> > * fold-const.c (fold_plusminus_mult_expr): If the factored > >> >> > factor may be zero use a wrapping type for the inner operation. > >> >> > * tree-tailcall.c (independent_of_stmt_p): Pass in to_move bitmap > >> >> > and handle moved defs. > >> >> > (process_assignment): Properly guard the unary op case. Return a > >> >> > tri-state indicating that moving the stmt before the call may allow > >> >> > to continue. Pass through to_move. > >> >> > (find_tail_calls): Handle moving unrelated defs before > >> >> > the call. > >> >> > > >> >> > * c-c++-common/ubsan/pr66313.c: New testcase. > >> >> > * gcc.dg/tree-ssa/loop-15.c: Adjust. > >> >> > > >> >> > Index: gcc/fold-const.c > >> >> > =================================================================== > >> >> > *** gcc/fold-const.c.orig 2015-10-29 12:32:33.302782318 +0100 > >> >> > --- gcc/fold-const.c 2015-10-29 14:08:39.936497739 +0100 > >> >> > *************** fold_plusminus_mult_expr (location_t loc > >> >> > *** 6916,6925 **** > >> >> > } > >> >> > same = NULL_TREE; > >> >> > > >> >> > ! if (operand_equal_p (arg01, arg11, 0)) > >> >> > ! same = arg01, alt0 = arg00, alt1 = arg10; > >> >> > ! else if (operand_equal_p (arg00, arg10, 0)) > >> >> > same = arg00, alt0 = arg01, alt1 = arg11; > >> >> > else if (operand_equal_p (arg00, arg11, 0)) > >> >> > same = arg00, alt0 = arg01, alt1 = arg10; > >> >> > else if (operand_equal_p (arg01, arg10, 0)) > >> >> > --- 6916,6926 ---- > >> >> > } > >> >> > same = NULL_TREE; > >> >> > > >> >> > ! /* Prefer factoring a common non-constant. */ > >> >> > ! if (operand_equal_p (arg00, arg10, 0)) > >> >> > same = arg00, alt0 = arg01, alt1 = arg11; > >> >> > + else if (operand_equal_p (arg01, arg11, 0)) > >> >> > + same = arg01, alt0 = arg00, alt1 = arg10; > >> >> > else if (operand_equal_p (arg00, arg11, 0)) > >> >> > same = arg00, alt0 = arg01, alt1 = arg10; > >> >> > else if (operand_equal_p (arg01, arg10, 0)) > >> >> > *************** fold_plusminus_mult_expr (location_t loc > >> >> > *** 6974,6987 **** > >> >> > } > >> >> > } > >> >> > > >> >> > ! if (same) > >> >> > return fold_build2_loc (loc, MULT_EXPR, type, > >> >> > fold_build2_loc (loc, code, type, > >> >> > fold_convert_loc (loc, type, > >> >> > alt0), > >> >> > fold_convert_loc (loc, type, > >> >> > alt1)), > >> >> > fold_convert_loc (loc, type, same)); > >> >> > > >> >> > ! return NULL_TREE; > >> >> > } > >> >> > > >> >> > /* Subroutine of native_encode_expr. Encode the INTEGER_CST > >> >> > --- 6975,7010 ---- > >> >> > } > >> >> > } > >> >> > > >> >> > ! if (!same) > >> >> > ! return NULL_TREE; > >> >> > ! > >> >> > ! if (! INTEGRAL_TYPE_P (type) > >> >> > ! || TYPE_OVERFLOW_WRAPS (type) > >> >> > ! /* We are neither factoring zero nor minus one. */ > >> >> > ! || TREE_CODE (same) == INTEGER_CST) > >> >> > return fold_build2_loc (loc, MULT_EXPR, type, > >> >> > fold_build2_loc (loc, code, type, > >> >> > fold_convert_loc (loc, type, > >> >> > alt0), > >> >> > fold_convert_loc (loc, type, > >> >> > alt1)), > >> >> > fold_convert_loc (loc, type, same)); > >> >> > > >> >> > ! /* Same may be zero and thus the operation 'code' may overflow. > >> >> > Likewise > >> >> > ! same may be minus one and thus the multiplication may > >> >> > overflow. Perform > >> >> > ! the operations in an unsigned type. */ > >> >> > ! tree utype = unsigned_type_for (type); > >> >> > ! tree tem = fold_build2_loc (loc, code, utype, > >> >> > ! fold_convert_loc (loc, utype, alt0), > >> >> > ! fold_convert_loc (loc, utype, alt1)); > >> >> > ! /* If the sum evaluated to a constant that is not -INF the > >> >> > multiplication > >> >> > ! cannot overflow. */ > >> >> > ! if (TREE_CODE (tem) == INTEGER_CST > >> >> > ! && ! wi::eq_p (tem, wi::min_value (TYPE_PRECISION (utype), > >> >> > SIGNED))) > >> >> > ! return fold_build2_loc (loc, MULT_EXPR, type, > >> >> > ! fold_convert (type, tem), same); > >> >> > ! > >> >> > ! return fold_convert_loc (loc, type, > >> >> > ! fold_build2_loc (loc, MULT_EXPR, utype, > >> >> > tem, > >> >> > ! fold_convert_loc (loc, > >> >> > utype, same))); > >> >> > } > >> >> > > >> >> > /* Subroutine of native_encode_expr. Encode the INTEGER_CST > >> >> > >> >> Sorry if you already know, but this part means that we can no longer > >> >> vectorise: > >> >> > >> >> int > >> >> f (int *x, int b1, int b2, int b3) > >> >> { > >> >> int foo = 0; > >> >> for (int i1 = 0; i1 < b1; ++i1) > >> >> for (int i2 = 0; i2 < b2; ++i2) > >> >> for (int i3 = 0; i3 < b3; ++i3) > >> >> foo += x[i1 * b2 * b3 + i2 * b3 + (i3 - 1)]; > >> >> return foo; > >> >> } > >> >> > >> >> We now convert all the arithmetic in the [...] to unsigned int > >> >> and reassociate it so that the "- 1" is applied last. We then assume > >> >> that the overflow in this subtraction is well-defined and that the > >> >> &x[...] might not be linear for the inner loop. > >> > > >> > Can you file a PR? > >> > >> OK, filed as PR81082. > >> > >> > # i3_44 = PHI <i3_32(6), 0(4)> > >> > i3.2_7 = (unsigned int) i3_44; > >> > _9 = i3.2_7 + _34; > >> > _10 = (int) _9; > >> > _11 = (long unsigned int) _10; > >> > _12 = _11 * 4; > >> > _13 = x_30(D) + _12; > >> > _14 = *_13; > >> > ... > >> > i3_32 = i3_44 + 1; > >> > > >> > so we have *(x_30(D) + 4 * ((sizetype)(int)((unsigned){ 0, + , 1} + X)) > >> > > >> > It might mean that we need to avoid this re-association. Not sure how > >> > to detect worthwhile vs. not worthwhile cases during folding. At least > >> > I see no easy way to recover from it in SCEV analysis unless the > >> > number of iterations is constrained more than in your example. > >> > >> Yeah, in the example that this was reduced from, not reassociating > >> is far more preferable to changing the types. But like you say, > >> I've no idea how we'd know that at this stage. > > > > In the past I played with not obfuscating things during FE lowering > > so early, namely not expose the * 4 but keep the original types of > > array / pointer indexes. On the original mem-ref branch I had > > a IDX_EXPR that allowed to chain those (a widen-multiply-accumulate > > op). > > > > That said, the context where this association is not interesting is > > address context and the multiplication to the byte offset. > > > > Of course in your case there's also b3 factored out which looks > > like a generally interesting transform (but eventually harmful in address > > context again). > > > > From the FE we get > > > > *(x + (sizetype) ((long unsigned int) (((i1 * b2) * b3 + i2 * b3) + (i3 + > > -1)) * 4)) > > > > and SCEV uses the fact that the mults/adds have undefined behavior > > on overflow. The same missed optimization occurs if you make all those > > vars unsigned (with or without the folding in question): > But with unsigned type it's not missed optimization because > computation could overflow and result in non-scev address.
Correct. > In this > case the loop needs to be versioned under overflow/non-overflow to be > parallelized/vectorized, right? Or split for cases we can easily > infer boundary condition of overflow. Yes. Richard. > Thanks, > bin > > > > #define U unsigned > > int > > f (int *x, U int b1, U int b2, U int b3) > > { > > int foo = 0; > > for (U int i1 = 0; i1 < b1; ++i1) > > for (U int i2 = 0; i2 < b2; ++i2) > > for (U int i3 = 0; i3 < b3; ++i3) > > foo += x[i1 * b2 * b3 + i2 * b3 + (i3 - 1)]; > > return foo; > > } > > > > It works again for unsigned long as there can be no valid object > > where the computation could wrap around. > > > > Richard. > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB > > 21284 (AG Nuernberg) > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)