others compared to gcc-5.3.0

aldyh at gcc dot gnu.org Thu, 15 Mar 2018 04:26:27 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359


--- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
Hi Richi.

(In reply to rguent...@suse.de from comment #31)

> I'd have not restricted the out-of-loop IV use to IV +- CST but
> instead did the transform
> 
> +   LOOP:
> +     # p_8 = PHI <p_16(2), p_INC(3)>
> +     ...
> +     p_INC = p_8 - 1;
> +     goto LOOP;
> +     ... p_8 uses ...
> 
> to
> 
> +   LOOP:
> +     # p_8 = PHI <p_16(2), p_INC(3)>
> +     ...
> +     p_INC = p_8 - 1;
> +     goto LOOP;
>       newtem_12 = p_INC + 1; // undo IV increment
>       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> 
> so it would always work if we can undo the IV increment.
> 
> The disadvantage might be that we then rely on RTL optimizations
> to combine the original out-of-loop constant add with the
> newtem computation but I guess that's not too much to ask ;)
> k

It looks like RTL optimizations have a harder time optimizing things when I
take the above approach.

Doing what you suggest, we end up optimizing this (simplified for brevity):

  <bb 3>
  # p_8 = PHI <p_16(2), p_19(3)>
  p_19 = p_8 + 4294967295;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  ...

  <bb 5>
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

into this:

  <bb 3>:
  # p_8 = PHI <p_16(2), p_19(3)>
  p_19 = p_8 + 4294967295;
  if (ui_7 > 9)
  ...

  <bb 4>:
  _25 = p_19 + 1;          ;; undo the increment
  ...

  <bb 5>:
  p_22 = _25 + 4294967294;
  MEM[(char *)_25 + 4294967294B] = 45;

I haven't dug into the RTL optimizations, but the end result is that we only
get one auto-dec inside the loop, and some -2 indexing outside of it:

        strb    r1, [r4, #-1]!
        lsr     r3, r3, #3
        bhi     .L4
        cmp     r6, #0
        movlt   r2, #45
        add     r3, r4, #1
        strblt  r2, [r3, #-2]
        sublt   r4, r4, #1

as opposed to mine:

  <bb 3>:
  # p_8 = PHI <p_16(2), p_19(3)>
  p_19 = p_8 + 4294967295;
  if (ui_7 > 9)
  ...

  <bb 5>:
  p_22 = p_19 + 4294967295;
  *p_22 = 45;

which gives us two auto-dec, and much tighter code:

        strb    r1, [r4, #-1]!
        lsr     r3, r3, #3
        bhi     .L4
        cmp     r6, #0
        movlt   r3, #45
        strblt  r3, [r4, #-1]!

Would it be OK to go with my approach, or is worth looking into the rtl
optimizers and seeing what can be done (boo! :)).

Thanks.

[Bug middle-end/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0

Reply via email to