Re: Worse code generated by PRE
8 119 120 121 > ;; live in 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 113 114 115 > ;; live gen 110 118 119 120 121 > ;; live kill > > ;; Pred edge 2 [91.0%] (fallthru) > (note 34 33 35 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > > (insn 35 34 53 3 tst.c:4 (set (reg/f:SI 123 [ pBuf ]) > (plus:SI (reg/v/f:SI 114 [ pBuf ]) > (const_int 2 [0x2]))) 273 {addsi3} (nil)) > > (insn 53 35 36 3 tst.c:4 (set (reg/f:SI 118) > (reg/f:SI 123 [ pBuf ])) -1 (nil)) > > (insn 36 53 37 3 tst.c:4 (set (reg:HI 119) > (plus:HI (reg:HI 113 [ D.3441 ]) > (const_int -1 [0x]))) 276 {addhi3} > (expr_list:REG_DEAD (reg:HI 113 [ D.3441 ]) > (nil))) > > (insn 37 36 38 3 tst.c:4 (set (reg:SI 120) > (zero_extend:SI (reg:HI 119))) 1056 {zero_extendhisi2} > (expr_list:REG_DEAD (reg:HI 119) > (nil))) > > (insn 38 37 39 3 tst.c:4 (set (reg:SI 121) > (ashift:SI (reg:SI 120) > (const_int 1 [0x1]))) 389 {ashlsi3} (expr_list:REG_DEAD (reg:SI > 120) > (nil))) > > (insn 39 38 43 3 tst.c:4 (set (reg/f:SI 110 [ D.3464 ]) > (plus:SI (reg/f:SI 118) > (reg:SI 121))) 273 {addsi3} (expr_list:REG_DEAD (reg:SI 121) > (expr_list:REG_DEAD (reg/f:SI 118) > (nil > ;; End of basic block 3 -> ( 4) > ;; lr out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > ;; live out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > > > ;; Succ edge 4 [100.0%] (fallthru) > > ;; Start of basic block ( 6 3) -> 4 > ;; bb 4 artificial_defs: { } > ;; bb 4 artificial_uses: { u18(55){ }u19(57){ }u20(62){ }} > ;; lr in 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > ;; lr use 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > ;; lr def 114 122 > ;; live in 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > ;; live gen 114 122 > ;; live kill > > ;; Pred edge 6 [100.0%] (fallthru) > ;; Pred edge 3 [100.0%] (fallthru) > (code_label 43 39 40 4 3 "" [0 uses]) > > (note 40 43 41 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > > (insn 41 40 51 4 tst.c:14 (set (mem:HI (reg/v/f:SI 114 [ pBuf ]) [2 *pBuf+0 > S2 A16]) > (reg/v:HI 115 [ Val ])) 236 {*movhhi} (nil)) > > (insn 51 41 44 4 tst.c:14 (set (reg/v/f:SI 114 [ pBuf ]) > (reg/f:SI 123 [ pBuf ])) -1 (expr_list:REG_EQUAL (plus:SI (reg/v/f:SI > 114 [ pBuf ]) > (const_int 2 [0x2])) > (nil))) > > (insn 44 51 45 4 tst.c:13 (set (reg:BI 122) > (ne:BI (reg/v/f:SI 114 [ pBuf ]) > (reg/f:SI 110 [ D.3464 ]))) 1006 {cmp_simode} (nil)) > > (jump_insn 45 44 55 4 tst.c:13 (set (pc) > (if_then_else (ne (reg:BI 122) > (const_int 0 [0x0])) > (label_ref:SI 55) > (pc))) 1085 {cbranchbi4} (expr_list:REG_DEAD (reg:BI 122) > (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) > (expr_list:REG_PRED_WIDTH (const_int 4 [0x4]) > (nil > -> 55) > ;; End of basic block 4 -> ( 6 5) > ;; lr out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > ;; live out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 > > > ;; Succ edge 6 [91.0%] > ;; Succ edge 5 [9.0%] (fallthru) > > ;; Start of basic block ( 4) -> 6 > ;; bb 6 artificial_defs: { } > ;; bb 6 artificial_uses: { u-1(55){ }u-1(57){ }u-1(62){ }} > > ;; Pred edge 4 [91.0%] > (code_label 55 45 54 6 4 "" [1 uses]) > > (note 54 55 52 6 [bb 6] NOTE_INSN_BASIC_BLOCK) > > (insn 52 54 48 6 (set (reg/f:SI 123 [ pBuf ]) > (plus:SI (reg/v/f:SI 114 [ pBuf ]) > (const_int 2 [0x2]))) 273 {addsi3} (nil)) > ;; End of basic block 6 -> ( 4) > > > > > Thanks, > Bingfeng > > > > >> -Original Message- >> From: Richard Guenther [mailto:richard.guent...@gmail.com] >> Sent: 29 September 2010 13:33 >> To: Bingfeng Mei >> Cc: gcc@gcc.gnu.org >> Subject: Re: Worse code generated by PRE >> >> On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei wrote: >> > Hello, >> > I have been examining a significant performance regression >> > between 4.5 and 4.4 in our port. I found that Partial Redundancy >> > Elimination introduced in 4.5 causes the issue. The following >> > pseudo code explains the problem: >> > >> > BB 3: >> > r118 <- r114 + 2 >> > >> > BB 4: >> > R114 <- r114 + 2 >> > ... >> > Conditional jump to BB 4 >> > >> > After PRE >> > >> >
RE: Worse code generated by PRE
n 38 37 39 3 tst.c:4 (set (reg:SI 121) (ashift:SI (reg:SI 120) (const_int 1 [0x1]))) 389 {ashlsi3} (expr_list:REG_DEAD (reg:SI 120) (nil))) (insn 39 38 43 3 tst.c:4 (set (reg/f:SI 110 [ D.3464 ]) (plus:SI (reg/f:SI 118) (reg:SI 121))) 273 {addsi3} (expr_list:REG_DEAD (reg:SI 121) (expr_list:REG_DEAD (reg/f:SI 118) (nil ;; End of basic block 3 -> ( 4) ;; lr out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; live out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; Succ edge 4 [100.0%] (fallthru) ;; Start of basic block ( 6 3) -> 4 ;; bb 4 artificial_defs: { } ;; bb 4 artificial_uses: { u18(55){ }u19(57){ }u20(62){ }} ;; lr in55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; lr use 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; lr def 114 122 ;; live in 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; live gen 114 122 ;; live kill ;; Pred edge 6 [100.0%] (fallthru) ;; Pred edge 3 [100.0%] (fallthru) (code_label 43 39 40 4 3 "" [0 uses]) (note 40 43 41 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 41 40 51 4 tst.c:14 (set (mem:HI (reg/v/f:SI 114 [ pBuf ]) [2 *pBuf+0 S2 A16]) (reg/v:HI 115 [ Val ])) 236 {*movhhi} (nil)) (insn 51 41 44 4 tst.c:14 (set (reg/v/f:SI 114 [ pBuf ]) (reg/f:SI 123 [ pBuf ])) -1 (expr_list:REG_EQUAL (plus:SI (reg/v/f:SI 114 [ pBuf ]) (const_int 2 [0x2])) (nil))) (insn 44 51 45 4 tst.c:13 (set (reg:BI 122) (ne:BI (reg/v/f:SI 114 [ pBuf ]) (reg/f:SI 110 [ D.3464 ]))) 1006 {cmp_simode} (nil)) (jump_insn 45 44 55 4 tst.c:13 (set (pc) (if_then_else (ne (reg:BI 122) (const_int 0 [0x0])) (label_ref:SI 55) (pc))) 1085 {cbranchbi4} (expr_list:REG_DEAD (reg:BI 122) (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) (expr_list:REG_PRED_WIDTH (const_int 4 [0x4]) (nil -> 55) ;; End of basic block 4 -> ( 6 5) ;; lr out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; live out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115 ;; Succ edge 6 [91.0%] ;; Succ edge 5 [9.0%] (fallthru) ;; Start of basic block ( 4) -> 6 ;; bb 6 artificial_defs: { } ;; bb 6 artificial_uses: { u-1(55){ }u-1(57){ }u-1(62){ }} ;; Pred edge 4 [91.0%] (code_label 55 45 54 6 4 "" [1 uses]) (note 54 55 52 6 [bb 6] NOTE_INSN_BASIC_BLOCK) (insn 52 54 48 6 (set (reg/f:SI 123 [ pBuf ]) (plus:SI (reg/v/f:SI 114 [ pBuf ]) (const_int 2 [0x2]))) 273 {addsi3} (nil)) ;; End of basic block 6 -> ( 4) Thanks, Bingfeng > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 29 September 2010 13:33 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Worse code generated by PRE > > On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei wrote: > > Hello, > > I have been examining a significant performance regression > > between 4.5 and 4.4 in our port. I found that Partial Redundancy > > Elimination introduced in 4.5 causes the issue. The following > > pseudo code explains the problem: > > > > BB 3: > > r118 <- r114 + 2 > > > > BB 4: > > R114 <- r114 + 2 > > ... > > Conditional jump to BB 4 > > > > After PRE > > > > BB 3: > > r123 <- r114 + 2 > > r118 <- r123 > > > > BB 4: > > r114 <- r123 > > conditional jump to BB 5 > > > > BB5: > > r123 <- r114 + 2 > > jump to BB 4 > > > > > > A simple loop (BB 4) is divided into two basic blocks (BB 4 & 5). > > An extra jump instruction is introduced. On some targets, this > > jump can be removed by bb-reorder pass. On our target, it cannot > > be reordered due to complex doloop_end pattern we generate later. > > Additionally, since bb-reorder is done in very late phase, the code > > miss some optimization opportunity such as auto_inc_dec. I don't > > see any benefit here to do PRE like this. Maybe we should exclude > > such case in the first place? I read the relevant text in > > "Advanced Compiler Design Implementation", the example used is linear > > CFG and it doesn't mention how to handle loop case. > > PRE basically sinks the computation into the latch block (possibly > creating that). Note that without a testcase it's hard to tell whether > this is ok in general. PRE tries to avoid generation of new induction > variables and cross-iteration data-dependences, see > insert_into_preds_of_block. > Note that 4.4 in principle performs the same optimization (you might > figure that PRE in 4.4 is generally disabled for -Os but enabled in 4.5, > but only for hot execution traces following existing practice to tune > code-size/performance on a fine-grained basis). > > Richard. > > > Any suggestion is greatly appreciated. > > > > Thanks, > > Bingfeng Mei > > > > > > > > > > > > > > > >
Re: Worse code generated by PRE
On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei wrote: > Hello, > I have been examining a significant performance regression > between 4.5 and 4.4 in our port. I found that Partial Redundancy > Elimination introduced in 4.5 causes the issue. The following > pseudo code explains the problem: > > BB 3: > r118 <- r114 + 2 > > BB 4: > R114 <- r114 + 2 > ... > Conditional jump to BB 4 > > After PRE > > BB 3: > r123 <- r114 + 2 > r118 <- r123 > > BB 4: > r114 <- r123 > conditional jump to BB 5 > > BB5: > r123 <- r114 + 2 > jump to BB 4 > > > A simple loop (BB 4) is divided into two basic blocks (BB 4 & 5). > An extra jump instruction is introduced. On some targets, this > jump can be removed by bb-reorder pass. On our target, it cannot > be reordered due to complex doloop_end pattern we generate later. > Additionally, since bb-reorder is done in very late phase, the code > miss some optimization opportunity such as auto_inc_dec. I don't > see any benefit here to do PRE like this. Maybe we should exclude > such case in the first place? I read the relevant text in > "Advanced Compiler Design Implementation", the example used is linear > CFG and it doesn't mention how to handle loop case. PRE basically sinks the computation into the latch block (possibly creating that). Note that without a testcase it's hard to tell whether this is ok in general. PRE tries to avoid generation of new induction variables and cross-iteration data-dependences, see insert_into_preds_of_block. Note that 4.4 in principle performs the same optimization (you might figure that PRE in 4.4 is generally disabled for -Os but enabled in 4.5, but only for hot execution traces following existing practice to tune code-size/performance on a fine-grained basis). Richard. > Any suggestion is greatly appreciated. > > Thanks, > Bingfeng Mei > > > > > > > >