Re: Worse code generated by PRE

2010-09-29 Thread Xinliang David Li
8 119 120 121
> ;; live  in      55 [r55] 57 [r57] 62 [__arg_pointer_register__] 113 114 115
> ;; live  gen     110 118 119 120 121
> ;; live  kill
>
> ;; Pred edge  2 [91.0%]  (fallthru)
> (note 34 33 35 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
>
> (insn 35 34 53 3 tst.c:4 (set (reg/f:SI 123 [ pBuf ])
>        (plus:SI (reg/v/f:SI 114 [ pBuf ])
>            (const_int 2 [0x2]))) 273 {addsi3} (nil))
>
> (insn 53 35 36 3 tst.c:4 (set (reg/f:SI 118)
>        (reg/f:SI 123 [ pBuf ])) -1 (nil))
>
> (insn 36 53 37 3 tst.c:4 (set (reg:HI 119)
>        (plus:HI (reg:HI 113 [ D.3441 ])
>            (const_int -1 [0x]))) 276 {addhi3} 
> (expr_list:REG_DEAD (reg:HI 113 [ D.3441 ])
>        (nil)))
>
> (insn 37 36 38 3 tst.c:4 (set (reg:SI 120)
>        (zero_extend:SI (reg:HI 119))) 1056 {zero_extendhisi2} 
> (expr_list:REG_DEAD (reg:HI 119)
>        (nil)))
>
> (insn 38 37 39 3 tst.c:4 (set (reg:SI 121)
>        (ashift:SI (reg:SI 120)
>            (const_int 1 [0x1]))) 389 {ashlsi3} (expr_list:REG_DEAD (reg:SI 
> 120)
>        (nil)))
>
> (insn 39 38 43 3 tst.c:4 (set (reg/f:SI 110 [ D.3464 ])
>        (plus:SI (reg/f:SI 118)
>            (reg:SI 121))) 273 {addsi3} (expr_list:REG_DEAD (reg:SI 121)
>        (expr_list:REG_DEAD (reg/f:SI 118)
>            (nil
> ;; End of basic block 3 -> ( 4)
> ;; lr  out       55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
> ;; live  out     55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
>
>
> ;; Succ edge  4 [100.0%]  (fallthru)
>
> ;; Start of basic block ( 6 3) -> 4
> ;; bb 4 artificial_defs: { }
> ;; bb 4 artificial_uses: { u18(55){ }u19(57){ }u20(62){ }}
> ;; lr  in        55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
> ;; lr  use       55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
> ;; lr  def       114 122
> ;; live  in      55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
> ;; live  gen     114 122
> ;; live  kill
>
> ;; Pred edge  6 [100.0%]  (fallthru)
> ;; Pred edge  3 [100.0%]  (fallthru)
> (code_label 43 39 40 4 3 "" [0 uses])
>
> (note 40 43 41 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
>
> (insn 41 40 51 4 tst.c:14 (set (mem:HI (reg/v/f:SI 114 [ pBuf ]) [2 *pBuf+0 
> S2 A16])
>        (reg/v:HI 115 [ Val ])) 236 {*movhhi} (nil))
>
> (insn 51 41 44 4 tst.c:14 (set (reg/v/f:SI 114 [ pBuf ])
>        (reg/f:SI 123 [ pBuf ])) -1 (expr_list:REG_EQUAL (plus:SI (reg/v/f:SI 
> 114 [ pBuf ])
>            (const_int 2 [0x2]))
>        (nil)))
>
> (insn 44 51 45 4 tst.c:13 (set (reg:BI 122)
>        (ne:BI (reg/v/f:SI 114 [ pBuf ])
>            (reg/f:SI 110 [ D.3464 ]))) 1006 {cmp_simode} (nil))
>
> (jump_insn 45 44 55 4 tst.c:13 (set (pc)
>        (if_then_else (ne (reg:BI 122)
>                (const_int 0 [0x0]))
>            (label_ref:SI 55)
>            (pc))) 1085 {cbranchbi4} (expr_list:REG_DEAD (reg:BI 122)
>        (expr_list:REG_BR_PROB (const_int 9100 [0x238c])
>            (expr_list:REG_PRED_WIDTH (const_int 4 [0x4])
>                (nil
>  -> 55)
> ;; End of basic block 4 -> ( 6 5)
> ;; lr  out       55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
> ;; live  out     55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
>
>
> ;; Succ edge  6 [91.0%]
> ;; Succ edge  5 [9.0%]  (fallthru)
>
> ;; Start of basic block ( 4) -> 6
> ;; bb 6 artificial_defs: { }
> ;; bb 6 artificial_uses: { u-1(55){ }u-1(57){ }u-1(62){ }}
>
> ;; Pred edge  4 [91.0%]
> (code_label 55 45 54 6 4 "" [1 uses])
>
> (note 54 55 52 6 [bb 6] NOTE_INSN_BASIC_BLOCK)
>
> (insn 52 54 48 6 (set (reg/f:SI 123 [ pBuf ])
>        (plus:SI (reg/v/f:SI 114 [ pBuf ])
>            (const_int 2 [0x2]))) 273 {addsi3} (nil))
> ;; End of basic block 6 -> ( 4)
>
>
>
>
> Thanks,
> Bingfeng
>
>
>
>
>> -Original Message-
>> From: Richard Guenther [mailto:richard.guent...@gmail.com]
>> Sent: 29 September 2010 13:33
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: Worse code generated by PRE
>>
>> On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei  wrote:
>> > Hello,
>> > I have been examining a significant performance regression
>> > between 4.5 and 4.4 in our port. I found that Partial Redundancy
>> > Elimination introduced in 4.5 causes the issue. The following
>> > pseudo code explains the problem:
>> >
>> > BB 3:
>> > r118 <-  r114 + 2
>> >
>> > BB 4:
>> > R114 <-  r114 + 2
>> > ...
>> > Conditional jump to BB 4
>> >
>> > After PRE
>> >
>> > 

RE: Worse code generated by PRE

2010-09-29 Thread Bingfeng Mei
n 38 37 39 3 tst.c:4 (set (reg:SI 121)
(ashift:SI (reg:SI 120)
(const_int 1 [0x1]))) 389 {ashlsi3} (expr_list:REG_DEAD (reg:SI 120)
(nil)))

(insn 39 38 43 3 tst.c:4 (set (reg/f:SI 110 [ D.3464 ])
(plus:SI (reg/f:SI 118)
(reg:SI 121))) 273 {addsi3} (expr_list:REG_DEAD (reg:SI 121)
(expr_list:REG_DEAD (reg/f:SI 118)
(nil
;; End of basic block 3 -> ( 4)
;; lr  out   55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
;; live  out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115


;; Succ edge  4 [100.0%]  (fallthru)

;; Start of basic block ( 6 3) -> 4
;; bb 4 artificial_defs: { }
;; bb 4 artificial_uses: { u18(55){ }u19(57){ }u20(62){ }}
;; lr  in55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
;; lr  use   55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
;; lr  def   114 122
;; live  in  55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
;; live  gen 114 122
;; live  kill   

;; Pred edge  6 [100.0%]  (fallthru)
;; Pred edge  3 [100.0%]  (fallthru)
(code_label 43 39 40 4 3 "" [0 uses])

(note 40 43 41 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

(insn 41 40 51 4 tst.c:14 (set (mem:HI (reg/v/f:SI 114 [ pBuf ]) [2 *pBuf+0 S2 
A16])
(reg/v:HI 115 [ Val ])) 236 {*movhhi} (nil))

(insn 51 41 44 4 tst.c:14 (set (reg/v/f:SI 114 [ pBuf ])
(reg/f:SI 123 [ pBuf ])) -1 (expr_list:REG_EQUAL (plus:SI (reg/v/f:SI 
114 [ pBuf ])
(const_int 2 [0x2]))
(nil)))

(insn 44 51 45 4 tst.c:13 (set (reg:BI 122)
(ne:BI (reg/v/f:SI 114 [ pBuf ])
(reg/f:SI 110 [ D.3464 ]))) 1006 {cmp_simode} (nil))

(jump_insn 45 44 55 4 tst.c:13 (set (pc)
(if_then_else (ne (reg:BI 122)
(const_int 0 [0x0]))
(label_ref:SI 55)
(pc))) 1085 {cbranchbi4} (expr_list:REG_DEAD (reg:BI 122)
(expr_list:REG_BR_PROB (const_int 9100 [0x238c])
(expr_list:REG_PRED_WIDTH (const_int 4 [0x4])
(nil
 -> 55)
;; End of basic block 4 -> ( 6 5)
;; lr  out   55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115
;; live  out 55 [r55] 57 [r57] 62 [__arg_pointer_register__] 110 114 115


;; Succ edge  6 [91.0%] 
;; Succ edge  5 [9.0%]  (fallthru)

;; Start of basic block ( 4) -> 6
;; bb 6 artificial_defs: { }
;; bb 6 artificial_uses: { u-1(55){ }u-1(57){ }u-1(62){ }}

;; Pred edge  4 [91.0%] 
(code_label 55 45 54 6 4 "" [1 uses])

(note 54 55 52 6 [bb 6] NOTE_INSN_BASIC_BLOCK)

(insn 52 54 48 6 (set (reg/f:SI 123 [ pBuf ])
(plus:SI (reg/v/f:SI 114 [ pBuf ])
(const_int 2 [0x2]))) 273 {addsi3} (nil))
;; End of basic block 6 -> ( 4)




Thanks,
Bingfeng




> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: 29 September 2010 13:33
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Worse code generated by PRE
> 
> On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei  wrote:
> > Hello,
> > I have been examining a significant performance regression
> > between 4.5 and 4.4 in our port. I found that Partial Redundancy
> > Elimination introduced in 4.5 causes the issue. The following
> > pseudo code explains the problem:
> >
> > BB 3:
> > r118 <-  r114 + 2
> >
> > BB 4:
> > R114 <-  r114 + 2
> > ...
> > Conditional jump to BB 4
> >
> > After PRE
> >
> > BB 3:
> > r123 <-  r114 + 2
> > r118 <-  r123
> >
> > BB 4:
> > r114 <- r123
> > conditional jump to BB 5
> >
> > BB5:
> > r123 <- r114 + 2
> > jump to BB 4
> >
> >
> > A simple loop (BB 4) is divided into two basic blocks (BB 4 & 5).
> > An extra jump instruction is introduced. On some targets, this
> > jump can be removed by bb-reorder pass. On our target, it cannot
> > be reordered due to complex doloop_end pattern we generate later.
> > Additionally, since bb-reorder is done in very late phase, the code
> > miss some optimization opportunity such as auto_inc_dec. I don't
> > see any benefit here to do PRE like this. Maybe we should exclude
> > such case in the first place? I read the relevant text in
> > "Advanced Compiler Design Implementation", the example used is linear
> > CFG and it doesn't mention how to handle loop case.
> 
> PRE basically sinks the computation into the latch block (possibly
> creating that).  Note that without a testcase it's hard to tell whether
> this is ok in general.  PRE tries to avoid generation of new induction
> variables and cross-iteration data-dependences, see
> insert_into_preds_of_block.
> Note that 4.4 in principle performs the same optimization (you might
> figure that PRE in 4.4 is generally disabled for -Os but enabled in 4.5,
> but only for hot execution traces following existing practice to tune
> code-size/performance on a fine-grained basis).
> 
> Richard.
> 
> > Any suggestion is greatly appreciated.
> >
> > Thanks,
> > Bingfeng Mei
> >
> >
> >
> >
> >
> >
> >
> >




Re: Worse code generated by PRE

2010-09-29 Thread Richard Guenther
On Wed, Sep 29, 2010 at 2:16 PM, Bingfeng Mei  wrote:
> Hello,
> I have been examining a significant performance regression
> between 4.5 and 4.4 in our port. I found that Partial Redundancy
> Elimination introduced in 4.5 causes the issue. The following
> pseudo code explains the problem:
>
> BB 3:
> r118 <-  r114 + 2
>
> BB 4:
> R114 <-  r114 + 2
> ...
> Conditional jump to BB 4
>
> After PRE
>
> BB 3:
> r123 <-  r114 + 2
> r118 <-  r123
>
> BB 4:
> r114 <- r123
> conditional jump to BB 5
>
> BB5:
> r123 <- r114 + 2
> jump to BB 4
>
>
> A simple loop (BB 4) is divided into two basic blocks (BB 4 & 5).
> An extra jump instruction is introduced. On some targets, this
> jump can be removed by bb-reorder pass. On our target, it cannot
> be reordered due to complex doloop_end pattern we generate later.
> Additionally, since bb-reorder is done in very late phase, the code
> miss some optimization opportunity such as auto_inc_dec. I don't
> see any benefit here to do PRE like this. Maybe we should exclude
> such case in the first place? I read the relevant text in
> "Advanced Compiler Design Implementation", the example used is linear
> CFG and it doesn't mention how to handle loop case.

PRE basically sinks the computation into the latch block (possibly
creating that).  Note that without a testcase it's hard to tell whether
this is ok in general.  PRE tries to avoid generation of new induction
variables and cross-iteration data-dependences, see insert_into_preds_of_block.
Note that 4.4 in principle performs the same optimization (you might
figure that PRE in 4.4 is generally disabled for -Os but enabled in 4.5,
but only for hot execution traces following existing practice to tune
code-size/performance on a fine-grained basis).

Richard.

> Any suggestion is greatly appreciated.
>
> Thanks,
> Bingfeng Mei
>
>
>
>
>
>
>
>