On 04/12/14 11:07, Richard Biener wrote:

On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.w...@arm.com> wrote:


which means re-associate the constant imm with the virtual frame pointer.

transform

      RA <- fixed_reg + RC
      RD <- MEM (RA + const_offset)

   into:

      RA <- fixed_reg + const_offset
      RD <- MEM (RA + RC)

then RA <- fixed_reg + const_offset is actually loop invariant, so the later
RTL GCSE PRE pass could catch it and do the hoisting, and thus ameliorate
what tree
level ivopts could not sort out.
There is a LIM pass after gimple ivopts - if the invariantness is already
visible there why not handle it there similar to the special-cases in
rewrite_bittest and rewrite_reciprocal?

maybe, needs further check.


And of course similar tricks could be applied on the RTL level to
RTL invariant motion?

Thanks. I just checked the code, yes, loop invariant motion pass
is the natural place to integrate such multi-insns invariant analysis trick.

those code could be integrated into loop-invariant.c cleanly, but
then I found although the invariant detected successfully but it's not moved
out of loop because of cost issue, and looks like the patch committed to fix 
PR33928
is trying to prevent such cheap address be hoisted to reduce register pressure.


 805       /* ??? Try to determine cheapness of address computation.  
Unfortunately
 806          the address cost is only a relative measure, we can't really 
compare
 807          it with any absolute number, but only with other address costs.
 808          But here we don't have any other addresses, so compare with a 
magic
 809          number anyway.  It has to be large enough to not regress PR33928
 810          (by avoiding to move reg+8,reg+16,reg+24 invariants), but small
 811          enough to not regress 410.bwaves either (by still moving reg+reg
 812          invariants).
 813          See http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
 814       inv->cheap_address = address_cost (SET_SRC (set), word_mode,
 815                                          ADDR_SPACE_GENERIC, speed) < 3;


I think that maybe necessary for x86 which is short of register, while for RISC,
it may not be that necessary, especially the whole register pressure is not big.

currently, what I could think of is for this transformation below, we should
increase the costs:
A
==
     RA <- virtual_stack_var + RC
     RD <- MEM (RA + const_offset)

  into:

B
==
     RA <- virtual_stack_var + const_offset  <--B
     RD <- MEM (RA + RC)

because the cost is not that cheap, if there is not re-assocation of 
virtual_stack_var
with const_offset, then lra elimination will create another instruction to hold 
the
elimination result, so format A will actually be

     RT <- real_stack_pointer + elimination_offset
     RA <- RT + RC
     RD <- MEM (RA + const_offset)

so, the re-assocation and later hoisting of invariant B could actually save two 
instructions
in the loop, this is why there are 15% perf gap for bzip2 under some situation.

On aarch64, for Seb's testcase,

before IV hoisting
===
        b       .L4
.L37:
        sub     w19, w19, #1
.L4:
        add     x1, x29, 48
        add     x0, x1, x0, sxtw
        ldrb    w0, [x0, -16]
        bl      foo
        mov     w0, w19
        cbnz    w19, .L37
        ldr     x19, [sp, 16]
        ldp     x29, x30, [sp], 48
        ret

after this transformation, and the IV hoisting
(need one extra callee saved, x20)
===
.L3:
        add     x20, x29, 32  <-- IV hoisted
        b       .L4
.L37:
        sub     w19, w19, #1
.L4:
        ldrb    w0, [x20, w0, sxtw]
        bl      foo
        mov     w0, w19
        cbnz    w19, .L37
        ldp     x19, x20, [sp, 16]
        ldp     x29, x30, [sp], 48
        ret

Oh, and the patch misses a testcase.

It's at the start of the email, will include in the patch.

void bar(int i) {
  char A[10];
  int d = 0;
  while (i > 0)
  A[d++] = i--;

  while (d > 0)
  foo(A[d--]);
}


Thanks,
Richard.

and this patch only tries to re-shuffle instructions within single basic
block which
is a inner loop which is perf critical.

I am reusing the loop info in fwprop because there is loop info and it's run
before
GCSE.

verified on aarch64 and mips64, the array base address hoisted out of loop.

bootstrap ok on x86-64 and aarch64.

comments?

thanks.

gcc/
   PR62173
   fwprop.c (prepare_for_gcse_pre): New function.
   (fwprop_done): Call it.


Reply via email to