Code optimization with GCSE

2009-07-14 Thread Jean Christophe Beyler
Dear all,

As some might know, I've been concentrating on optimizing the handling
of loads for my port of GCC. I'm now considering this code:
uint64_t data[107];
uint64_t foo (void)
{
uint64_t x0, x1, x2, x3, x4, x5,x6,x7;
uint64_t i;

for(i=0;i<107;i++) {
data[i] = i;
}

return data[0] + data[1] + data[2];
}

However, the code I get is:

la  r9,data
mov r8, r9
#LOOP, of no consequence, it uses r8 for the stores...

la  r7,data+8#Loads data+8 in r7 instead of using r9 directly
ldd r9,0(r9)   #This loads from r9 and the two next from r7
ldd r6,0(r7)
ldd r8,8(r7)


As you can see, the compiler uses r9 to store data and then uses that
for data[0] but also loads in r7 data+8 instead of directly using r9.
If I remove the loop then it does not do this.

Any ideas where I can look for this, or is this going to be difficult to fix?
Thanks !
Jean Christophe Beyler


Re: Code optimization with GCSE

2009-07-14 Thread Paolo Bonzini



As you can see, the compiler uses r9 to store data and then uses that
for data[0] but also loads in r7 data+8 instead of directly using r9.
If I remove the loop then it does not do this.


This optimization is done by CSE only, currently.  That's why it cannot 
look through loops.


Paolo


Re: Code optimization with GCSE

2009-07-15 Thread Jean Christophe Beyler
Ah ok, so I can see why it would not be able to perform that
optimization around the loop but I changed the code to simply have
this:

uint64_t foo (void)
{
return data[0] + data[1] + data[2];
}

And this generates :

la  r9,data
la  r7,data+8
ldd r6,0(r7)
ldd r8,0(r9)
ldd r7,16(r9)

I'm trying to see if there is a problem with my rtx costs function
because again, I don't understand why it would generate 2 la instead
of using an offset of 8 and 16.

Thanks for any input,
Jc

On Wed, Jul 15, 2009 at 1:29 AM, Paolo Bonzini wrote:
>
>> As you can see, the compiler uses r9 to store data and then uses that
>> for data[0] but also loads in r7 data+8 instead of directly using r9.
>> If I remove the loop then it does not do this.
>
> This optimization is done by CSE only, currently.  That's why it cannot look
> through loops.
>
> Paolo
>


Re: Code optimization with GCSE

2009-07-15 Thread Adam Nemet
Jean Christophe Beyler  writes:
> uint64_t foo (void)
> {
> return data[0] + data[1] + data[2];
> }
>
> And this generates :
>
> la  r9,data
> la  r7,data+8
> ldd r6,0(r7)
> ldd r8,0(r9)
> ldd r7,16(r9)
>
> I'm trying to see if there is a problem with my rtx costs function
> because again, I don't understand why it would generate 2 la instead
> of using an offset of 8 and 16.

You probably want to look at the RTL dumps.  This code should have been
expanded with the correct offsets (at least that is what happens on
MIPS).  I don't see how later passes would modify the code other than
removing 2 of the 3 "la rX, data" insns.

Adam


Re: Code optimization with GCSE

2009-07-15 Thread Jean Christophe Beyler
The subreg pass has this :

(insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
(const:DI (plus:DI (symbol_ref:DI ("data") )
(const_int 8 [0x8] 71 {movdi_internal} (nil))

(insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
(symbol_ref:DI ("data") )) 71
{movdi_internal} (nil))

...

(insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79)
(const:DI (plus:DI (symbol_ref:DI ("data") )
(const_int 16 [0x10] 71 {movdi_internal} (nil))


As we can see, all three are using the symbol_ref data before adding
their offset. But after cse, we get this:

(insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
(const:DI (plus:DI (symbol_ref:DI ("data") )
(const_int 8 [0x8] 71 {movdi_internal} (nil))

(insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
(symbol_ref:DI ("data") )) 71
{movdi_internal} (nil))

...

(insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79)
(plus:DI (reg/f:DI 75)
(const_int 16 [0x10]))) 2 {adddi3_port}
(expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("data")
table_size = 0, use_info->table_size = 0
;;  invalidated by call  2 [r2] 4 [r4] 5 [r5] 6 [r6] 7 [r7] 8 [r8] 9
[r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 14 [r14] 15 [r15] 16 [r16] 17
[r17] 18 [r18] 19 [r19] 20 [r20] 21 [r21] 22 [r22] 23 [r23] 24 [r24]
25 [r25] 26 [r26] 27 [r27] 28 [r28] 29 [r29] 30 [r30] 31 [r31] 32
[r32] 33 [r33] 34 [r34] 35 [r35] 36 [r36] 37 [r37] 38 [r38] 39 [r39]
40 [r40] 41 [r41] 42 [r42] 43 [r43] 44 [r44] 45 [r45] 46 [r46] 47
[r47] 63 [r63] 64 [$rap] 65 [cc] 66 [acc]
;;  hardware regs used   0 [r0] 1 [r1] 3 [r3]
;;  regular block artificial uses0 [r0] 1 [r1] 3 [r3] 62 [r62]
;;  eh block artificial uses 0 [r0] 1 [r1] 3 [r3] 62 [r62]
;;  entry block defs 0 [r0] 1 [r1] 3 [r3] 6 [r6] 8 [r8] 9 [r9] 10
[r10] 11 [r11] 12 [r12] 13 [r13] 14 [r14] 15 [r15] 62 [r62] 63 [r63]
;;  exit block uses  1 [r1] 3 [r3] 6 [r6] 62 [r62]
;;  regs ever live   6[r6]

( )->[0]->( 2 )
;; bb 0 artificial_defs: { d-1(0){ }d-1(1){ }d-1(3){ }d-1(6){ }d-1(8){
}d-1(9){ }d-1(10){ }d-1(11){ }d-1(12){ }d-1(13){ }d-1(14){ }d-1(15){
}d-1(62){ }d-1(63){ }}
;; bb 0 artificial_uses: { }

( 0 )->[2]->( 1 )
;; bb 2 artificial_defs: { }
;; bb 2 artificial_uses: { u-1(0){ }u-1(1){ }u-1(3){ }u-1(62){ }}

( 2 )->[1]->( )
;; bb 1 artificial_defs: { }
;; bb 1 artificial_uses: { u-1(1){ }u-1(3){ }u-1(6){ }u-1(62){ }}

Finding needed instructions:
  Adding insn 23 to worklist
Finished finding needed instructions:
processing block 2 live out =  0 [r0] 1 [r1] 3 [r3] 6 [r6] 62 [r62]
  Adding insn 17 to worklist
  Adding insn 13 to worklist
  Adding insn 12 to worklist
  Adding insn 11 to worklist
  Adding insn 10 to worklist
  Adding insn 9 to worklist
  Adding insn 8 to worklist
  Adding insn 7 to worklist
  Adding insn 6 to worklist
  Adding insn 5 to worklist
df_worklist_dataflow_overeager:n_basic_blocks 3 n_edges 2 count 3 (1)
;; Following path with 11 sets: 2
deferring rescan insn with uid = 10.
deferring rescan insn with uid = 17.


try_optimize_cfg iteration 1

(note 3 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(note 2 3 5 2 NOTE_INSN_FUNCTION_BEG)

(insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
(const:DI (plus:DI (symbol_ref:DI ("data") )
(const_int 8 [0x8] 71 {movdi_internal} (nil))

(insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
(symbol_ref:DI ("data") )) 71
{movdi_internal} (nil))

(insn 7 6 8 2 ex1b.c:8 (set (reg:DI 77 [ data+8 ])
(mem/s:DI (reg/f:DI 74) [2 data+8 S8 A64])) 71 {movdi_internal} (nil))

(insn 8 7 9 2 ex1b.c:8 (set (reg:DI 78 [ data ])
(mem/s:DI (reg/f:DI 75) [2 data+0 S8 A64])) 71 {movdi_internal} (nil))

(insn 9 8 10 2 ex1b.c:8 (set (reg:DI 76)
(plus:DI (reg:DI 77 [ data+8 ])
(reg:DI 78 [ data ]))) 2 {adddi3_port} (nil))

(insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79)
(plus:DI (reg/f:DI 75)
(const_int 16 [0x10]))) 2 {adddi3_port}
(expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("data")
)
(const_int 16 [0x10])))
(nil)))

(insn 11 10 12 2 ex1b.c:8 (set (reg:DI 80 [ data+16 ])
(mem/s:DI (reg/f:DI 79) [2 data+16 S8 A64])) 71 {movdi_internal} (nil))

(insn 12 11 13 2 ex1b.c:8 (set (reg:DI 73)
(plus:DI (reg:DI 76)
(reg:DI 80 [ data+16 ]))) 2 {adddi3_port} (nil))

(insn 13 12 17 2 ex1b.c:8 (set (reg:DI 72 [  ])
(reg:DI 73)) 71 {movdi_internal} (nil))

(insn 17 13 23 2 ex1b.c:10 (set (reg/i:DI 6 r6)
(reg:DI 73)) 71 {movdi_internal} (nil))

(insn 23 17 0 2 ex1b.c:10 (use (reg/i:DI 6 r6)) -1 (nil))
starting the processing of deferred insns
rescanning insn with uid = 10.
deleting insn with uid = 10.
rescanning insn with uid = 17.
deleting insn with uid = 17.
ending the processing of deferred insns




On Wed, Jul 15, 2009 at 12:25 PM, Adam Nemet wrote:
> Jean Christophe Beyler  writes:
>> uint64_t foo (void)
>> {
>>     return data[0] + data[1] + data[2];
>> }
>>
>> And this generates :
>>
>>     la  r9,data
>>     la  r

Re: Code optimization with GCSE

2009-07-16 Thread Adam Nemet
Jean Christophe Beyler writes:
> As we can see, all three are using the symbol_ref data before adding
> their offset. But after cse, we get this:
> 
> (insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
> (const:DI (plus:DI (symbol_ref:DI ("data") )
> (const_int 8 [0x8] 71 {movdi_internal} (nil))
> 
> (insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
> (symbol_ref:DI ("data") )) 71
> {movdi_internal} (nil))

You will have to debug CSE.  The actual replacment is done in this loop in
cse_insn:

  /* Terminate loop when replacement made.  This must terminate since
 the current contents will be tested and will always be valid.  */
  while (1)
{

Adam