Richard Biener <rguent...@suse.de> writes: > On Tue, 25 Jan 2022, Jiufu Guo wrote: > >> Jiufu Guo <guoji...@linux.ibm.com> writes: >> >> > Richard Biener <rguent...@suse.de> writes: >> > >> >> On Thu, 13 Jan 2022, Jiufu Guo wrote: >> > ... >> >> >> >>> - /* No need to check sign of the new step since below code takes >> >>> care >> >>> - of this well. */ >> >>> + /* Like cases shown in PR100740/102131, negtive step is not safe. >> >>> */ >> >>> + if (tree_int_cst_sign_bit (step)) >> >>> + return false; >> >>> + >> >>> if (code != NE_EXPR >> >>> && (TREE_CODE (step) != INTEGER_CST >> >>> || !iv0->no_overflow || !iv1->no_overflow)) >> >> >> >> I think for NE_EXPR the sign is not relevant. I think the key is >> >> that we require that iv0->no_overflow and for non-equality checks >> >> adjusting X + C1 < Y + C2 to X + C1 - C2 < Y is only valid if that >> >> does not cause any overflow on either side. On the LHS this is >> > >> > Hi Richard, >> > >> > Thanks a lot for your comments and ideas! >> > >> > Right! The adjusting is safe only if we can make sure there is >> > no overflow/wrap on either side or say there is no overflow/wrap >> > on three 'iv's: {X,C1}, {Y,C2} and {X, C1 - C2}. Hi Richard,
Thanks for your quickly reply, and patient review! > > The point is that we may not change the iteration number at which > overflow occurs since that alters the result of the < compare. > Only if we know there is no overflow with the present expression > during the loop evaluation we can do the transform and then only > if we do not introduce overflow. Exactly, this is also what I mean :) > > We are basically doing the transform that fold_comparison > in fold-const.cc does: > > /* Transform comparisons of the form X +- C1 CMP Y +- C2 to > X CMP Y +- C2 +- C1 for signed X, Y. This is valid if > the resulting offset is smaller in absolute value than the > original one and has the same sign. */ > if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (arg0)) > && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (arg0)) > ... > This transform seems not the scenario which we are care about in number_of_iterations_cond. For example, 'X + 1 < Y + 4' ==> 'X < Y + 3' would be correct if no overflow happen. But for loop, the niter for '{X, 1} < {Y, 4}' would be totally different with niter for '{X, 0} < {Y, 3}'. for (iv0 = X, iv1 = Y; iv0 < iv1; iv0 += 1, iv1 += 4) in this loop, iv1 walks to overflow faster, step is 4. vs. for (iv0 = X, iv1 = Y; iv0 < iv1; iv1 += 3) (iv1 overflow slow) in this loop, iv1 overflows slower, step is 3. Actually, the transformation 'X + 1 < Y + 4' ==> 'X < Y + 3', may not always correct. e.g. for below code, X=6, and Y=2147483645 it may output "b0 + 1 < b1 + 4 is true". ```c int __attribute__ ((noinline)) foo (int b0, int b1) { return __builtin_printf ("b0 + 1 < b1 + 4 is %s\n", b0 + 1 < b1 + 4 ? "true" : "false"); } int main(int argc, char **argv) { if (argc < 2) return -1; int b0 = atoi(argv[1]); int b1 = atoi(argv[2]); return foo (b0, b1); } ``` >> > Or it may also ok if we can compute an assumption, under which >> > all three ivs are not overflowed/wrapped. >> > >> >> only guaranteed if the absolute value of C1 - C2 is smaller than >> >> that of C1 and if it has the same sign. >> > I'm thinking this in another way: >> > When trying to do this transform in number_of_iterations_cond, >> > GT/GE is inverted to LT/LE, then the compare code would be: >> > LT/LE or NE. >> > >> > For LT/LE, like {X, C1} < {Y, C2}, we can look it as iv0 is >> > chasing iv1. We would able to assume X < Y (may_be_zero would >> > be set later via number_of_iterations_lt/le). >> > 1. If C1 < C2, iv0 can never catch up iv1. For examples: >> > {X, 1} < {Y, 2}; {X, -2} < {Y, -1}; {X, -2} < {Y, 1}. >> > And there must be at least one overflow/wrap in iv0,iv1, or iv. >> > This indicates, if the sign of (C1 - C1) is negative, then the >> > transform would be incorrect. >> > 2. If C1 > C2, we still need to make sure all the ivs (iv0, >> > iv1 and combined iv) are not wrapped. >> > For C2 > 0, {Y,C2} should not cross MAX before {X, C1} catch up. >> > the assumption may like : (MAX-Y)/C2 > (Y-X)/(C1-C1) >> There is still some cases: iv0 step is too large, then iv0 wraps >> first, e.g. {MAX-5, 10} < {MAX-3, 1}. For this, the assumption >> would need to and with (MAX-X)/C1 > (Y-X)/(C1-C1). >> >> > For C1 < 0, {X,C1} should not down cross MIN >> > the assumption may like : (X-MIN)/-C1 > (Y-X)/(C1-C1) >> Also add the assumption: (Y-MIN)/-C2 > (Y-X)/(C1-C1) >> >> > For C1 > 0 and C2 < 0, iv0 and iv1 are walking to each other, >> > it would be almost safe. >> For this case, we may still add the assumption to avoid wraping >> at the first iteration. >> >> BR, >> Jiufu >> >> > >> > For NE, it seems more interesting. The transformation depends >> > on 3 things: 1. the relation between X and Y; 2 the sign >> > of (C1-C2); 3. if iv0 and iv1 can be equal finally. >> > The 3rd one may be more special. >> > The good news is, number_of_iterations_ne seems able to handle NE. >> > >> >> >> >> With the same reasoning we then know the new IV0 doesn't overflow. >> >> >> >> So something like the following. IIRC I've proposed sth similar >> >> a while back. I'm going to give it some testing, collecting >> >> testcases from related PRs. >> >> >> >> diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc >> >> index b767056aeb0..74fa4f66ee2 100644 ... >> >> + if (TREE_CODE (step) != INTEGER_CST >> >> + || !iv0->no_overflow || !iv1->no_overflow) >> >> + { >> > I was also trying to leverage no_overflow of iv0 and iv1. While it seems >> > the computation logic of no_overflow is related to the type of IV. If the >> > type of IV is signed, the C semantics may be used, overflow in signed >> > IV are treated UB, and then no_overflow would be true. >> > >> > For unsigned IV, no_overflow would be false, even for the cases which >> > looks like: >> > "{10, 2} < {20, 1}", which would be ok to compute niter. > > IIRC no_overflow is determined by SCEV which might also use niter > analysis. For the case of {10, +2} < {20, +1} there is no need to > compute it as {10, +1} < 20 and we hopefully deal with this in > other code paths (in fact with base and step all constant we > can simply solve the linear equation for 'n' - maybe that's a > capability we should add to number_of_iterations_cond). Thanks for point this out. Yes, for const base(s) and step(s), we have other code path to deal with (e.g. loop_niter_by_eval). For {10, +2}, what I really mean is about the no_overflow iv(s) on unsigned. Sorry the misleading words. For no_overflow, it is set at some places, including number_of_iterations_xxx. :), Before number_of_iterations_xxx, no_overflow could be calculated in simple_iv_with_niters: ```c iv->no_overflow = !folded_casts && nowrap_type_p (type); ``` nowrap_type_p checks if overflow is UB on type through macro TYPE_OVERFLOW_UNDEFINED. For signed, it is UB; for unsigned it is different. For example as below code, no_overflow is set as false for iv0 and iv1, and then niter was not computed quickly. ```c unsigned __attribute__ ((noinline)) foo (unsigned b0, unsigned b1) { unsigned n = 0; for (; b0 < b1; b0 += 2, b1 += 1) n++; return n; } int main() { return foo (10, 20); } ``` > > Running a small example through the debugger I can indeed see > no_overflow = true for this case (after a few times == false), > so we do have some ways of determining there is no overflow for this > case. > > I do not think we should try to add special-casings to this very > generic transform. > > I will look at the regression that was pointed out now. You may talking about pr81196.c, it seems relate to "{p0, 1} < {p1, -1}". It is not transformed to "{p0, 2} < {p1,0}" anymore, and niter is not computed, then vectorization does not happen. BR, Jiufu > > Richard.