On Tue, Feb 21, 2012 at 1:27 AM, Richard Guenther <richard.guent...@gmail.com> wrote: > On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <uweig...@de.ibm.com> wrote: >> Hello, >> >> we've noticed that the loop optimizer sometimes inserts weirdly >> inefficient code to compute the value of an induction variable >> at the end of the loop. >> >> As a test case stripped down to the core of the problem, consider: >> >> int test (int start, int end) >> { >> int i; >> >> for (i = start; i < end; i++) >> ; >> >> return i; >> } >> >> That function really ought to be equivalent to >> >> int test2 (int start, int end) >> { >> return start < end ? end : start; >> } >> >> And indeed on i386-linux-gnu, we get mostly identical code >> (except for the branch direction prediction). >> >> However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a), >> we see for the first function: >> >> test: >> cmp r0, r1 >> bge .L2 >> adds r3, r0, #1 >> mvns r0, r0 >> adds r1, r1, r0 >> adds r0, r3, r1 >> .L2: >> bx lr >> >> instead of simply (as for the second function): >> >> test2: >> cmp r1, r0 >> it ge >> movge r0, r1 >> bx lr >> >> >> >> Where does that weird addition-and-negation sequence come from? >> In 100t.dceloop we still have: >> >> <bb 4>: >> # i_9 = PHI <i_5(5), start_2(D)(3)> >> i_5 = i_9 + 1; >> if (end_4(D) > i_5) >> goto <bb 5>; >> else >> goto <bb 6>; >> >> <bb 5>: >> goto <bb 4>; >> >> <bb 6>: >> # i_1 = PHI <i_5(4)> >> >> >> But 102t.sccp has: >> >> <bb 4>: >> # i_9 = PHI <i_5(5), start_2(D)(3)> >> i_5 = i_9 + 1; >> if (end_4(D) > i_5) >> goto <bb 5>; >> else >> goto <bb 6>; >> >> <bb 5>: >> goto <bb 4>; >> >> <bb 6>: >> D.4123_10 = start_2(D) + 1; >> D.4124_11 = ~start_2(D); >> D.4125_12 = (unsigned int) D.4124_11; >> D.4126_13 = (unsigned int) end_4(D); >> D.4127_14 = D.4125_12 + D.4126_13; >> D.4128_15 = (int) D.4127_14; >> i_1 = D.4123_10 + D.4128_15; >> >> This is the sequence that ends up in the final assembler code. >> >> Note that this computes: >> >> i_1 = (start_2(D) + 1) >> + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D)) >> >> which is (since those casts are no-ops on the assembler level): >> >> i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D) >> >> which is due to two's-complement arithmetic: >> >> i_1 = start_2(D) - start_2(D) + end_4(D) >> >> that is simply: >> >> i_1 = end_4(D) >> >> >> Unfortunately, no tree-based optimization pass after 102t.sccp is able to >> clean up that mess. This is true even on *i386*: the only reason we get >> nice assembler there is due to *RTX* optimization, notably in combine. >> This hits on i386 because of an intermediate stage that adds two registers >> and the constant 1 (which matches the lea pattern). On arm, there is no >> instruction that would handle that intermediate stage, and combine is not >> powerful enough to perform the whole simplification in a single step. >> >> >> Now that sequence of gimple operations is generated in scev_const_prop >> in tree-scalar-evolution.c. First, the phi node corresponding to the >> loop exit value is evaluated: >> >> def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL); >> >> which returns a chrec of the form >> >> { 1, +, (start + 1) } >> >> This is then further simplified by >> >> def = compute_overall_effect_of_inner_loop (ex_loop, def); >> >> which first computes the number of loop iterations: >> >> tree nb_iter = number_of_latch_executions (loop); >> >> which returns a tree representing >> >> (unsigned int) ~start + (unsigned int) end >> >> (which at this point seems to be the validly most efficient way to >> compute end - start - 1). >> >> The chrec is then evaluated via chrec_apply which basically computes >> >> (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end) >> >> which ends up being converted to the long gimple sequence seen above. >> >> >> Now I guess my questions are: >> >> - In the computation shown above, should that tree have been folded >> into just "end" to start with? The tree is constructed at its > > The issue is that (start + 1) + 1 * (int) ... is computed using signed > integer arithmetic which may invoke undefined behavior when it wraps. > Thus we cannot re-associate it. I suppose if you'd arrange the code > to do the arithmetic in unsigned and only cast the result back to the > desired type we might fold it ... > >> outermost level in chrec_fold_plus, which simply calls fold_build2. >> While simplify-rtx.c has code to detect sequences of operations >> that cancel out while folding a PLUS or MINUS RTX, there does >> not appear to be corresponding code in fold-const.c to handle >> the equivalent optimization at the tree level. > > There are. fold-const.c has some re-association logic to catch this > and we have tree-ssa-reassoc.c. All of those only handle same-sign > operands though I think. > >> - If not, should there be another tree pass later on that ought to >> clean this up? I notice there is code to handle plus/minus >> sequences in tree-ssa-reassoc.c. But this doesn't seem to be >> able to handle this particular case, possibly because the presence >> of the casts (nop_exprs) ... >> >> - Anywhere else I'm overlooking right now? >> >> >> I'd appreciate any tips / suggestions how to fix this ... > > Look at doing the computation in unsigned arithmetic.
I have some patches which improve this but it is more of a cleanup and it is not fully done yet. Right now we catch the case in gfortran.dg/reassoc_6.f which has the above pattern but only in VPR and only during the simplifing phase, I am working on improving that to get to the point where we can recognize this while getting the range in VPR. Thanks, Andrew Pinski