others compared to gcc-5.3.0

rguenther at suse dot de Thu, 15 Mar 2018 04:52:26 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359


--- Comment #38 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> Hi Richi.
> 
> (In reply to rguent...@suse.de from comment #31)
> 
> > I'd have not restricted the out-of-loop IV use to IV +- CST but
> > instead did the transform
> > 
> > +   LOOP:
> > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > +     ...
> > +     p_INC = p_8 - 1;
> > +     goto LOOP;
> > +     ... p_8 uses ...
> > 
> > to
> > 
> > +   LOOP:
> > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > +     ...
> > +     p_INC = p_8 - 1;
> > +     goto LOOP;
> >       newtem_12 = p_INC + 1; // undo IV increment
> >       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> > 
> > so it would always work if we can undo the IV increment.
> > 
> > The disadvantage might be that we then rely on RTL optimizations
> > to combine the original out-of-loop constant add with the
> > newtem computation but I guess that's not too much to ask ;)
> > k
> 
> It looks like RTL optimizations have a harder time optimizing things when I
> take the above approach.
> 
> Doing what you suggest, we end up optimizing this (simplified for brevity):
> 
>   <bb 3>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   p_19 = p_8 + 4294967295;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   ...
> 
>   <bb 5>
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> into this:
> 
>   <bb 3>:
>   # p_8 = PHI <p_16(2), p_19(3)>
>   p_19 = p_8 + 4294967295;
>   if (ui_7 > 9)
>   ...
> 
>   <bb 4>:
>   _25 = p_19 + 1;          ;; undo the increment
>   ...
> 
>   <bb 5>:
>   p_22 = _25 + 4294967294;
>   MEM[(char *)_25 + 4294967294B] = 45;

shouldn't the p_19 in the MEM remain?  Why a new BB for the adjustment?
(just asking...)

> I haven't dug into the RTL optimizations, but the end result is that we only
> get one auto-dec inside the loop, and some -2 indexing outside of it:
> 
>         strb    r1, [r4, #-1]!
>         lsr     r3, r3, #3
>         bhi     .L4
>         cmp     r6, #0
>         movlt   r2, #45
>         add     r3, r4, #1
>         strblt  r2, [r3, #-2]
>         sublt   r4, r4, #1
> 
> as opposed to mine:
> 
>   <bb 3>:
>   # p_8 = PHI <p_16(2), p_19(3)>
>   p_19 = p_8 + 4294967295;
>   if (ui_7 > 9)
>   ...
> 
>   <bb 5>:
>   p_22 = p_19 + 4294967295;
>   *p_22 = 45;
> 
> which gives us two auto-dec, and much tighter code:
> 
>         strb    r1, [r4, #-1]!
>         lsr     r3, r3, #3
>         bhi     .L4
>         cmp     r6, #0
>         movlt   r3, #45
>         strblt  r3, [r4, #-1]!
> 
> Would it be OK to go with my approach, or is worth looking into the rtl
> optimizers and seeing what can be done (boo! :)).

Would be interesting to see what is causing things to break.  If it's
only combine doing the trick then it will obviously be multi-uses
of the adjustment pseudo.  But eventually I thought fwprop would
come to the rescue...

I think a good approach would be to instead of just replacing
all out-of-loop uses with the new SSA name doing the adjustment
check if the use is in a "simple" (thus SSA name +- cst) stmt
and there forward the adjustment.

So have the best of both worlds.  I would have hoped this
extra complication isn't needed but as you figured...

[Bug middle-end/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0

Reply via email to