Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Andrew Pinski Tue, 21 Feb 2012 12:16:14 -0800

On Tue, Feb 21, 2012 at 1:27 AM, Richard Guenther
<richard.guent...@gmail.com> wrote:
> On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <uweig...@de.ibm.com> wrote:
>> Hello,
>>
>> we've noticed that the loop optimizer sometimes inserts weirdly
>> inefficient code to compute the value of an induction variable
>> at the end of the loop.
>>
>> As a test case stripped down to the core of the problem, consider:
>>
>> int test (int start, int end)
>> {
>>  int i;
>>
>>  for (i = start; i < end; i++)
>>    ;
>>
>>  return i;
>> }
>>
>> That function really ought to be equivalent to
>>
>> int test2 (int start, int end)
>> {
>>  return start < end ? end : start;
>> }
>>
>> And indeed on i386-linux-gnu, we get mostly identical code
>> (except for the branch direction prediction).
>>
>> However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
>> we see for the first function:
>>
>> test:
>>        cmp     r0, r1
>>        bge     .L2
>>        adds    r3, r0, #1
>>        mvns    r0, r0
>>        adds    r1, r1, r0
>>        adds    r0, r3, r1
>> .L2:
>>        bx      lr
>>
>> instead of simply (as for the second function):
>>
>> test2:
>>        cmp     r1, r0
>>        it      ge
>>        movge   r0, r1
>>        bx      lr
>>
>>
>>
>> Where does that weird addition-and-negation sequence come from?
>> In 100t.dceloop we still have:
>>
>> <bb 4>:
>>  # i_9 = PHI <i_5(5), start_2(D)(3)>
>>  i_5 = i_9 + 1;
>>  if (end_4(D) > i_5)
>>    goto <bb 5>;
>>  else
>>    goto <bb 6>;
>>
>> <bb 5>:
>>  goto <bb 4>;
>>
>> <bb 6>:
>>  # i_1 = PHI <i_5(4)>
>>
>>
>> But 102t.sccp has:
>>
>> <bb 4>:
>>  # i_9 = PHI <i_5(5), start_2(D)(3)>
>>  i_5 = i_9 + 1;
>>  if (end_4(D) > i_5)
>>    goto <bb 5>;
>>  else
>>    goto <bb 6>;
>>
>> <bb 5>:
>>  goto <bb 4>;
>>
>> <bb 6>:
>>  D.4123_10 = start_2(D) + 1;
>>  D.4124_11 = ~start_2(D);
>>  D.4125_12 = (unsigned int) D.4124_11;
>>  D.4126_13 = (unsigned int) end_4(D);
>>  D.4127_14 = D.4125_12 + D.4126_13;
>>  D.4128_15 = (int) D.4127_14;
>>  i_1 = D.4123_10 + D.4128_15;
>>
>> This is the sequence that ends up in the final assembler code.
>>
>> Note that this computes:
>>
>>  i_1 = (start_2(D) + 1)
>>         + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))
>>
>> which is (since those casts are no-ops on the assembler level):
>>
>>  i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)
>>
>> which is due to two's-complement arithmetic:
>>
>>  i_1 = start_2(D) - start_2(D) + end_4(D)
>>
>> that is simply:
>>
>>  i_1 = end_4(D)
>>
>>
>> Unfortunately, no tree-based optimization pass after 102t.sccp is able to
>> clean up that mess.  This is true even on *i386*: the only reason we get
>> nice assembler there is due to *RTX* optimization, notably in combine.
>> This hits on i386 because of an intermediate stage that adds two registers
>> and the constant 1 (which matches the lea pattern).  On arm, there is no
>> instruction that would handle that intermediate stage, and combine is not
>> powerful enough to perform the whole simplification in a single step.
>>
>>
>> Now that sequence of gimple operations is generated in scev_const_prop
>> in tree-scalar-evolution.c.  First, the phi node corresponding to the
>> loop exit value is evaluated:
>>
>>  def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);
>>
>> which returns a chrec of the form
>>
>>  { 1, +, (start + 1) }
>>
>> This is then further simplified by
>>
>>  def = compute_overall_effect_of_inner_loop (ex_loop, def);
>>
>> which first computes the number of loop iterations:
>>
>>  tree nb_iter = number_of_latch_executions (loop);
>>
>> which returns a tree representing
>>
>>  (unsigned int) ~start + (unsigned int) end
>>
>> (which at this point seems to be the validly most efficient way to
>> compute end - start - 1).
>>
>> The chrec is then evaluated via chrec_apply which basically computes
>>
>>  (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)
>>
>> which ends up being converted to the long gimple sequence seen above.
>>
>>
>> Now I guess my questions are:
>>
>> - In the computation shown above, should that tree have been folded
>>  into just "end" to start with?  The tree is constructed at its
>
> The issue is that (start + 1) + 1 * (int) ... is computed using signed
> integer arithmetic which may invoke undefined behavior when it wraps.
> Thus we cannot re-associate it.  I suppose if you'd arrange the code
> to do the arithmetic in unsigned and only cast the result back to the
> desired type we might fold it ...
>
>>  outermost level in chrec_fold_plus, which simply calls fold_build2.
>>  While simplify-rtx.c has code to detect sequences of operations
>>  that cancel out while folding a PLUS or MINUS RTX, there does
>>  not appear to be corresponding code in fold-const.c to handle
>>  the equivalent optimization at the tree level.
>
> There are.  fold-const.c has some re-association logic to catch this
> and we have tree-ssa-reassoc.c.  All of those only handle same-sign
> operands though I think.
>
>> - If not, should there be another tree pass later on that ought to
>>  clean this up?  I notice there is code to handle plus/minus
>>  sequences in tree-ssa-reassoc.c.  But this doesn't seem to be
>>  able to handle this particular case, possibly because the presence
>>  of the casts (nop_exprs) ...
>>
>> - Anywhere else I'm overlooking right now?
>>
>>
>> I'd appreciate any tips / suggestions how to fix this ...
>
> Look at doing the computation in unsigned arithmetic.



I have some patches which improve this but it is more of a cleanup and
it is not fully done yet.
Right now we catch the case in gfortran.dg/reassoc_6.f which has the
above pattern but only in VPR and only during the simplifing phase, I
am working on improving that to get to the point where we can
recognize this while getting the range in VPR.

Thanks,
Andrew Pinski

Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Reply via email to