Code optimization with GCSE
Dear all, As some might know, I've been concentrating on optimizing the handling of loads for my port of GCC. I'm now considering this code: uint64_t data[107]; uint64_t foo (void) { uint64_t x0, x1, x2, x3, x4, x5,x6,x7; uint64_t i; for(i=0;i<107;i++) { data[i] = i; } return data[0] + data[1] + data[2]; } However, the code I get is: la r9,data mov r8, r9 #LOOP, of no consequence, it uses r8 for the stores... la r7,data+8#Loads data+8 in r7 instead of using r9 directly ldd r9,0(r9) #This loads from r9 and the two next from r7 ldd r6,0(r7) ldd r8,8(r7) As you can see, the compiler uses r9 to store data and then uses that for data[0] but also loads in r7 data+8 instead of directly using r9. If I remove the loop then it does not do this. Any ideas where I can look for this, or is this going to be difficult to fix? Thanks ! Jean Christophe Beyler
Re: Code optimization with GCSE
As you can see, the compiler uses r9 to store data and then uses that for data[0] but also loads in r7 data+8 instead of directly using r9. If I remove the loop then it does not do this. This optimization is done by CSE only, currently. That's why it cannot look through loops. Paolo
Re: Code optimization with GCSE
Ah ok, so I can see why it would not be able to perform that optimization around the loop but I changed the code to simply have this: uint64_t foo (void) { return data[0] + data[1] + data[2]; } And this generates : la r9,data la r7,data+8 ldd r6,0(r7) ldd r8,0(r9) ldd r7,16(r9) I'm trying to see if there is a problem with my rtx costs function because again, I don't understand why it would generate 2 la instead of using an offset of 8 and 16. Thanks for any input, Jc On Wed, Jul 15, 2009 at 1:29 AM, Paolo Bonzini wrote: > >> As you can see, the compiler uses r9 to store data and then uses that >> for data[0] but also loads in r7 data+8 instead of directly using r9. >> If I remove the loop then it does not do this. > > This optimization is done by CSE only, currently. That's why it cannot look > through loops. > > Paolo >
Re: Code optimization with GCSE
Jean Christophe Beyler writes: > uint64_t foo (void) > { > return data[0] + data[1] + data[2]; > } > > And this generates : > > la r9,data > la r7,data+8 > ldd r6,0(r7) > ldd r8,0(r9) > ldd r7,16(r9) > > I'm trying to see if there is a problem with my rtx costs function > because again, I don't understand why it would generate 2 la instead > of using an offset of 8 and 16. You probably want to look at the RTL dumps. This code should have been expanded with the correct offsets (at least that is what happens on MIPS). I don't see how later passes would modify the code other than removing 2 of the 3 "la rX, data" insns. Adam
Re: Code optimization with GCSE
The subreg pass has this : (insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74) (const:DI (plus:DI (symbol_ref:DI ("data") ) (const_int 8 [0x8] 71 {movdi_internal} (nil)) (insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75) (symbol_ref:DI ("data") )) 71 {movdi_internal} (nil)) ... (insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79) (const:DI (plus:DI (symbol_ref:DI ("data") ) (const_int 16 [0x10] 71 {movdi_internal} (nil)) As we can see, all three are using the symbol_ref data before adding their offset. But after cse, we get this: (insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74) (const:DI (plus:DI (symbol_ref:DI ("data") ) (const_int 8 [0x8] 71 {movdi_internal} (nil)) (insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75) (symbol_ref:DI ("data") )) 71 {movdi_internal} (nil)) ... (insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79) (plus:DI (reg/f:DI 75) (const_int 16 [0x10]))) 2 {adddi3_port} (expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("data") table_size = 0, use_info->table_size = 0 ;; invalidated by call 2 [r2] 4 [r4] 5 [r5] 6 [r6] 7 [r7] 8 [r8] 9 [r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 14 [r14] 15 [r15] 16 [r16] 17 [r17] 18 [r18] 19 [r19] 20 [r20] 21 [r21] 22 [r22] 23 [r23] 24 [r24] 25 [r25] 26 [r26] 27 [r27] 28 [r28] 29 [r29] 30 [r30] 31 [r31] 32 [r32] 33 [r33] 34 [r34] 35 [r35] 36 [r36] 37 [r37] 38 [r38] 39 [r39] 40 [r40] 41 [r41] 42 [r42] 43 [r43] 44 [r44] 45 [r45] 46 [r46] 47 [r47] 63 [r63] 64 [$rap] 65 [cc] 66 [acc] ;; hardware regs used 0 [r0] 1 [r1] 3 [r3] ;; regular block artificial uses0 [r0] 1 [r1] 3 [r3] 62 [r62] ;; eh block artificial uses 0 [r0] 1 [r1] 3 [r3] 62 [r62] ;; entry block defs 0 [r0] 1 [r1] 3 [r3] 6 [r6] 8 [r8] 9 [r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 14 [r14] 15 [r15] 62 [r62] 63 [r63] ;; exit block uses 1 [r1] 3 [r3] 6 [r6] 62 [r62] ;; regs ever live 6[r6] ( )->[0]->( 2 ) ;; bb 0 artificial_defs: { d-1(0){ }d-1(1){ }d-1(3){ }d-1(6){ }d-1(8){ }d-1(9){ }d-1(10){ }d-1(11){ }d-1(12){ }d-1(13){ }d-1(14){ }d-1(15){ }d-1(62){ }d-1(63){ }} ;; bb 0 artificial_uses: { } ( 0 )->[2]->( 1 ) ;; bb 2 artificial_defs: { } ;; bb 2 artificial_uses: { u-1(0){ }u-1(1){ }u-1(3){ }u-1(62){ }} ( 2 )->[1]->( ) ;; bb 1 artificial_defs: { } ;; bb 1 artificial_uses: { u-1(1){ }u-1(3){ }u-1(6){ }u-1(62){ }} Finding needed instructions: Adding insn 23 to worklist Finished finding needed instructions: processing block 2 live out = 0 [r0] 1 [r1] 3 [r3] 6 [r6] 62 [r62] Adding insn 17 to worklist Adding insn 13 to worklist Adding insn 12 to worklist Adding insn 11 to worklist Adding insn 10 to worklist Adding insn 9 to worklist Adding insn 8 to worklist Adding insn 7 to worklist Adding insn 6 to worklist Adding insn 5 to worklist df_worklist_dataflow_overeager:n_basic_blocks 3 n_edges 2 count 3 (1) ;; Following path with 11 sets: 2 deferring rescan insn with uid = 10. deferring rescan insn with uid = 17. try_optimize_cfg iteration 1 (note 3 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (note 2 3 5 2 NOTE_INSN_FUNCTION_BEG) (insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74) (const:DI (plus:DI (symbol_ref:DI ("data") ) (const_int 8 [0x8] 71 {movdi_internal} (nil)) (insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75) (symbol_ref:DI ("data") )) 71 {movdi_internal} (nil)) (insn 7 6 8 2 ex1b.c:8 (set (reg:DI 77 [ data+8 ]) (mem/s:DI (reg/f:DI 74) [2 data+8 S8 A64])) 71 {movdi_internal} (nil)) (insn 8 7 9 2 ex1b.c:8 (set (reg:DI 78 [ data ]) (mem/s:DI (reg/f:DI 75) [2 data+0 S8 A64])) 71 {movdi_internal} (nil)) (insn 9 8 10 2 ex1b.c:8 (set (reg:DI 76) (plus:DI (reg:DI 77 [ data+8 ]) (reg:DI 78 [ data ]))) 2 {adddi3_port} (nil)) (insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79) (plus:DI (reg/f:DI 75) (const_int 16 [0x10]))) 2 {adddi3_port} (expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("data") ) (const_int 16 [0x10]))) (nil))) (insn 11 10 12 2 ex1b.c:8 (set (reg:DI 80 [ data+16 ]) (mem/s:DI (reg/f:DI 79) [2 data+16 S8 A64])) 71 {movdi_internal} (nil)) (insn 12 11 13 2 ex1b.c:8 (set (reg:DI 73) (plus:DI (reg:DI 76) (reg:DI 80 [ data+16 ]))) 2 {adddi3_port} (nil)) (insn 13 12 17 2 ex1b.c:8 (set (reg:DI 72 [ ]) (reg:DI 73)) 71 {movdi_internal} (nil)) (insn 17 13 23 2 ex1b.c:10 (set (reg/i:DI 6 r6) (reg:DI 73)) 71 {movdi_internal} (nil)) (insn 23 17 0 2 ex1b.c:10 (use (reg/i:DI 6 r6)) -1 (nil)) starting the processing of deferred insns rescanning insn with uid = 10. deleting insn with uid = 10. rescanning insn with uid = 17. deleting insn with uid = 17. ending the processing of deferred insns On Wed, Jul 15, 2009 at 12:25 PM, Adam Nemet wrote: > Jean Christophe Beyler writes: >> uint64_t foo (void) >> { >> return data[0] + data[1] + data[2]; >> } >> >> And this generates : >> >> la r9,data >> la r
Re: Code optimization with GCSE
Jean Christophe Beyler writes: > As we can see, all three are using the symbol_ref data before adding > their offset. But after cse, we get this: > > (insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74) > (const:DI (plus:DI (symbol_ref:DI ("data") ) > (const_int 8 [0x8] 71 {movdi_internal} (nil)) > > (insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75) > (symbol_ref:DI ("data") )) 71 > {movdi_internal} (nil)) You will have to debug CSE. The actual replacment is done in this loop in cse_insn: /* Terminate loop when replacement made. This must terminate since the current contents will be tested and will always be valid. */ while (1) { Adam