Re: extend fwprop optimization

2013-04-03 Thread Jakub Jelinek
On Thu, Mar 28, 2013 at 04:49:47PM +0100, Uros Bizjak wrote:
 2013-03-27  Wei Mi  w...@google.com
 
   * config/i386/i386.md: Do shift truncation in define_insn
   instead of define_insn_and_split.
 
 Please write ChangeLog as:
 
   * config/i386/i386.md (*ashlmode3_mask): Rewrite as define_insn.
   Truncate operand 2 using %b asm operand modifier.
   (*shift_insnmode3_mask): Ditto.
   (*rotate_insnmode3_mask): Ditto.
 
 OK for mainline and all release branches with these changes.

This broke bootstrap on x86_64-linux as well as i686-linux on the 4.6
branch.  Fixed thusly, committed as obvious after bootstrapping/regtesting
on those targets.

2013-04-03  Jakub Jelinek  ja...@redhat.com

* config/i386/i386.md (*shiftrt_insnmode3_mask): Use
shiftrt instead of shift.

--- gcc/config/i386/i386.md.jj  2013-04-03 16:11:07.0 +0200
+++ gcc/config/i386/i386.md 2013-04-03 17:42:15.034672014 +0200
@@ -9827,7 +9827,7 @@ (define_insn *shiftrt_insnmode3_mas
 (INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1
 {
-  return shift{imodesuffix}\t{%b2, %0|%0, %b2};
+  return shiftrt{imodesuffix}\t{%b2, %0|%0, %b2};
 }
   [(set_attr type ishift)
(set_attr mode MODE)])


Jakub


Re: extend fwprop optimization

2013-04-03 Thread Wei Mi
Thanks for helping fixing it. I will take care to verify regression
and bootstrap before checkin to release branches next time.

Regards,
Wei.

On Wed, Apr 3, 2013 at 11:08 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Thu, Mar 28, 2013 at 04:49:47PM +0100, Uros Bizjak wrote:
 2013-03-27  Wei Mi  w...@google.com

   * config/i386/i386.md: Do shift truncation in define_insn
   instead of define_insn_and_split.

 Please write ChangeLog as:

   * config/i386/i386.md (*ashlmode3_mask): Rewrite as define_insn.
   Truncate operand 2 using %b asm operand modifier.
   (*shift_insnmode3_mask): Ditto.
   (*rotate_insnmode3_mask): Ditto.

 OK for mainline and all release branches with these changes.

 This broke bootstrap on x86_64-linux as well as i686-linux on the 4.6
 branch.  Fixed thusly, committed as obvious after bootstrapping/regtesting
 on those targets.

 2013-04-03  Jakub Jelinek  ja...@redhat.com

 * config/i386/i386.md (*shiftrt_insnmode3_mask): Use
 shiftrt instead of shift.

 --- gcc/config/i386/i386.md.jj  2013-04-03 16:11:07.0 +0200
 +++ gcc/config/i386/i386.md 2013-04-03 17:42:15.034672014 +0200
 @@ -9827,7 +9827,7 @@ (define_insn *shiftrt_insnmode3_mas
  (INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
== GET_MODE_BITSIZE (MODEmode)-1
  {
 -  return shift{imodesuffix}\t{%b2, %0|%0, %b2};
 +  return shiftrt{imodesuffix}\t{%b2, %0|%0, %b2};
  }
[(set_attr type ishift)
 (set_attr mode MODE)])


 Jakub


Re: extend fwprop optimization

2013-04-02 Thread Uros Bizjak
On Tue, Apr 2, 2013 at 7:43 AM, Wei Mi w...@google.com wrote:
 I attached the patch.4 based on r197308. r197308 changes shift-and
 type truncation from define_insn_and_split to define_insn.  patch.4
 changes ix86_rtx_costs for shift-and type rtx to get the correct cost
 for the result after the shift-and truncation.

 With the patch.1 ~ patch.4, fwprop extension could handle the
 motivational case 1.c attached by removing all the redundent x  63
 operations.

 patch.1~patch.4 regression and bootstrap ok on
 x86_64-unknown-linux-gnu. Is it ok for trunk?

 2013-04-01  Wei Mi  w...@google.com

 * config/i386/i386.c (ix86_rtx_costs): Set proper rtx cost for
 ashlmode3_mask, *shift_insnmode3_mask and
*rotate_insnmode3_mask in i386.md.

Patch 4 is OK for mainline and also for release branches that were
changed by your previous i386.md patch.

Thanks,
Uros.


Re: extend fwprop optimization

2013-04-01 Thread Wei Mi
1.c attached.

On Mon, Apr 1, 2013 at 10:43 PM, Wei Mi w...@google.com wrote:
 I attached the patch.4 based on r197308. r197308 changes shift-and
 type truncation from define_insn_and_split to define_insn.  patch.4
 changes ix86_rtx_costs for shift-and type rtx to get the correct cost
 for the result after the shift-and truncation.

 With the patch.1 ~ patch.4, fwprop extension could handle the
 motivational case 1.c attached by removing all the redundent x  63
 operations.

 patch.1~patch.4 regression and bootstrap ok on
 x86_64-unknown-linux-gnu. Is it ok for trunk?

 Thanks,
 Wei.

 On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote:
 Hi,

 On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com 
 wrote:
 On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
 For the motivational case, I need insn splitting to get the cost
 right. insn splitting is not very intrusive. All I need is to call
 split_insns func.

 It may not look very intrusive, but there's a lot happening in the
 back ground. You're creating a lot of new RTL, and then just throw it
 away again. You fake the compiler into thinking you're much deeper in
 the pipeline than you really are. You're assuming there are no
 side-effects other than that some insn gets split, but there are back
 ends where splitters may have side-effects.

 Ok, then I will remove the split_insns call.


 Even though I've asked twice now, you still have not explained this
 motivational case, except to say that there is one. *What* are you
 trying to do, *what* is not happening without the splits, and *what*
 happens if you split. Only if you explain that in a lot more detail
 than I have a motivational case then we can look into what is a
 proper solution.

 :-). Sorry, I didn't say it clearly. The motivational case is the one
 mentioned in the following posts (split_insns changes a  (b  63) to
 a  b).
 http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
 http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html

 If I remove the split_insns call and related cost estimation
 adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop
 here looks like a reverse process of cse, the total cost after fwprop
 change is increased.

 Def insn 18:
 Use insn 23
 Use insn 22

 If we include the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
 not change after fwprop + insn splitting.
   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with 
 insn 22
 Total benefit is 5, fwprop will go on.

 If we remove the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
 insn 23 will increase after fwprop.
   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
 with insn 22
 Total benefit is -3, fwprop will punt.

 How about adding the (a  (b63) == a  b) transformation in
 simplify_binary_operation_1, becuase (a  (b63) == a  b) is a
 kind of architecture specific expr simplification? Then fwprop could
 do the propagation as I expect.


 The problem with some of the splitters is that they exist to break up
 RTL from 'expand' to initially keep some pattern together to allow the
 code transformation passes to handle the pattern as one instruction.
 This made sense when RTL was the only intermediate representation and
 splitting too early would inhibit some optimizations. But I would
 expect most (if not all) such cases to be less relevant because of the
 GIMPLE middle-end work. The only splitters you can trigger are the
 pre-reload splitters (all the reload_completed conditions obviously
 can't trigger if you're splitting from fwprop). Perhaps those
 splitters can/should run earlier, or be made obsolete by expanding
 directly to the post-splitting insns.

 Unfortunately, it's not possible to tell for your case, because you
 haven't explained it yet...


 So how about keep split_insns and remove peephole in the cost estimation 
 func?

 I'd strongly oppose this. I do not believe this is necessary, and I
 think it's conceptually wrong.


 What happens if you propagate into an insn that uses the same register
 twice? Will the DU chains still be valid (I don't think that's
 guaranteed)?

 I think the DU chains still be valid. If propagate into the insn that
 uses the same register twice, the two uses will be replaced when the
 first use is seen (propagate_rtx_1 will propagate all the occurrances
 of the same reg in the use insn).  When the second use is seen, the
 df_use and use insn in its insn_info are still available.
 forward_propagate_into will early return after check reg_mentioned_p
 (DF_REF_REG (use), parent) and find out no reg is used  any more.

 With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
 still valid.

 I think DF_REF_LOC of USE may be invalid if dangling rtx will be
 recycled by 

Re: extend fwprop optimization

2013-03-28 Thread Uros Bizjak
On Thu, Mar 28, 2013 at 5:34 AM, Wei Mi w...@google.com wrote:
 I am not familiar how to use define_subst, so I write a patch that
 changes define_insn_and_split to define_insn. bootstrapped and
 regression tested on x86_64-unknown-linux-gnu.

 A question is: after that change, Is there anyway I can make
 targetm.rtx_costs() aware about the truncation, .i.e the cost is only
 a shift instead of shift + and.

Please also change all operand 2 predicates to register_operand.

2013-03-27  Wei Mi  w...@google.com

* config/i386/i386.md: Do shift truncation in define_insn
instead of define_insn_and_split.

Please write ChangeLog as:

* config/i386/i386.md (*ashlmode3_mask): Rewrite as define_insn.
Truncate operand 2 using %b asm operand modifier.
(*shift_insnmode3_mask): Ditto.
(*rotate_insnmode3_mask): Ditto.

OK for mainline and all release branches with these changes.

Thanks,
Uros.


Re: extend fwprop optimization

2013-03-27 Thread Wei Mi
I am not familiar how to use define_subst, so I write a patch that
changes define_insn_and_split to define_insn. bootstrapped and
regression tested on x86_64-unknown-linux-gnu.

A question is: after that change, Is there anyway I can make
targetm.rtx_costs() aware about the truncation, .i.e the cost is only
a shift instead of shift + and.

Thanks,
Wei.

On Tue, Mar 26, 2013 at 11:23 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener
 richard.guent...@gmail.com wrote:
 I am trying to figure out a way not to lose the opportunity when shift
 truncation is not combined in a bit test pattern. Can we keep the
 explicit truncation in RTL, but generate truncation code in assembly?
 Then only shift truncation which not combined in a bit test
 pattershift truncationn will happen.

 (define_insn *shift_insn_andmode
   [(set (match_operand:SWI48 0 nonimmediate_operand =rm)
 (any_shiftrt:SWI48
   (match_operand:SWI48 1 nonimmediate_operand 0)
   (subreg:QI
 (and:SI
   (match_operand:SI 2 nonimmediate_operand c)
   (match_operand:SI 3 const_int_operand n)) 0)))
(clobber (reg:CC FLAGS_REG))]
   ix86_binary_operator_ok (CODE, MODEmode, operands)
 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
else
   shift\t{%2, %0|%0, %2};
 }

 Sorry, rectify a mistake:

 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return shift\t{%2, %0|%0, %2};
else
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
 }

 I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
 is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
 differences.  So there is no inconsistency.  The i386 backend seems to
 try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
 (well, it's false, so it technically doesn't exist for i386) and recognizes
 the shift with truncate with the *shift_insnmode3_mask splitter.
 But I'm not sure why it bothers to do it with a splitter instead of just
 with a define_insn?  Because the split code,

   [(parallel [(set (match_dup 0)
(any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
   (clobber (reg:CC FLAGS_REG))])]

 is wrong and could be combined into a bit-test instruction.  No?

 That is, why not have define_insn variants for shift instructions with
 explicit truncation?

 You are right, the split is harmful in this case.

 It looks to me, that explicit truncation can be added to split
 patterns in the most elegant way using proposed define_subst
 infrastructure.

 Uros.


changelog
Description: Binary data


patch
Description: Binary data


Re: extend fwprop optimization

2013-03-26 Thread Richard Biener
On Mon, Mar 25, 2013 at 6:33 PM, Wei Mi w...@google.com wrote:
 I am trying to figure out a way not to lose the opportunity when shift
 truncation is not combined in a bit test pattern. Can we keep the
 explicit truncation in RTL, but generate truncation code in assembly?
 Then only shift truncation which not combined in a bit test
 pattershift truncationn will happen.

 (define_insn *shift_insn_andmode
   [(set (match_operand:SWI48 0 nonimmediate_operand =rm)
 (any_shiftrt:SWI48
   (match_operand:SWI48 1 nonimmediate_operand 0)
   (subreg:QI
 (and:SI
   (match_operand:SI 2 nonimmediate_operand c)
   (match_operand:SI 3 const_int_operand n)) 0)))
(clobber (reg:CC FLAGS_REG))]
   ix86_binary_operator_ok (CODE, MODEmode, operands)
 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
else
   shift\t{%2, %0|%0, %2};
 }

 Sorry, rectify a mistake:

 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return shift\t{%2, %0|%0, %2};
else
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
 }

I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
differences.  So there is no inconsistency.  The i386 backend seems to
try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
(well, it's false, so it technically doesn't exist for i386) and recognizes
the shift with truncate with the *shift_insnmode3_mask splitter.
But I'm not sure why it bothers to do it with a splitter instead of just
with a define_insn?  Because the split code,

  [(parallel [(set (match_dup 0)
   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
  (clobber (reg:CC FLAGS_REG))])]

is wrong and could be combined into a bit-test instruction.  No?

That is, why not have define_insn variants for shift instructions with
explicit truncation?

Richard.


 Thanks,
 Wei.


Re: extend fwprop optimization

2013-03-26 Thread Uros Bizjak
On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener
richard.guent...@gmail.com wrote:
 I am trying to figure out a way not to lose the opportunity when shift
 truncation is not combined in a bit test pattern. Can we keep the
 explicit truncation in RTL, but generate truncation code in assembly?
 Then only shift truncation which not combined in a bit test
 pattershift truncationn will happen.

 (define_insn *shift_insn_andmode
   [(set (match_operand:SWI48 0 nonimmediate_operand =rm)
 (any_shiftrt:SWI48
   (match_operand:SWI48 1 nonimmediate_operand 0)
   (subreg:QI
 (and:SI
   (match_operand:SI 2 nonimmediate_operand c)
   (match_operand:SI 3 const_int_operand n)) 0)))
(clobber (reg:CC FLAGS_REG))]
   ix86_binary_operator_ok (CODE, MODEmode, operands)
 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
else
   shift\t{%2, %0|%0, %2};
 }

 Sorry, rectify a mistake:

 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return shift\t{%2, %0|%0, %2};
else
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
 }

 I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
 is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
 differences.  So there is no inconsistency.  The i386 backend seems to
 try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
 (well, it's false, so it technically doesn't exist for i386) and recognizes
 the shift with truncate with the *shift_insnmode3_mask splitter.
 But I'm not sure why it bothers to do it with a splitter instead of just
 with a define_insn?  Because the split code,

   [(parallel [(set (match_dup 0)
(any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
   (clobber (reg:CC FLAGS_REG))])]

 is wrong and could be combined into a bit-test instruction.  No?

 That is, why not have define_insn variants for shift instructions with
 explicit truncation?

You are right, the split is harmful in this case.

It looks to me, that explicit truncation can be added to split
patterns in the most elegant way using proposed define_subst
infrastructure.

Uros.


Re: extend fwprop optimization

2013-03-25 Thread Richard Biener
On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi w...@google.com wrote:
 This is the patch to add the shift truncation in
 simplify_binary_operation_1. I add a new hook
 TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
 whether we can do shift truncation. I didn't use
 TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
 uses the macro SHIFT_COUNT_TRUNCATED. If I change
 SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
 TARGET_SHIFT_TRUNCATION_MASK, I need to give
 TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
 trivial to get at many places in existing code.

 patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

Doing this might prove dangerous in case some pass may later decide
to use an instruction that behaves in different ways.  Consider

   tem = 1 (n  255);   // count truncated
   x = y  tem;  // bittest instruction bit nr _not_ truncated

so if tem is expanded to use a shift instruction which truncates the shift
count the explicit and is dropped.  If later combine comes around and
combines the bit-test to use the bittest instruction which does not
implicitely truncate the cound you have generated wrong-code.

So we need to make sure any explicit truncation originally in place
is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should
not exist at all, but instead there would be two patterns for shifts
with implicit truncation - one involving the truncation (canonicalized to
bitwise and) and one not involving the truncation.

Richard.

 Thanks,
 Wei.

 On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote:
 Hi,

 On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com 
 wrote:
 On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
 For the motivational case, I need insn splitting to get the cost
 right. insn splitting is not very intrusive. All I need is to call
 split_insns func.

 It may not look very intrusive, but there's a lot happening in the
 back ground. You're creating a lot of new RTL, and then just throw it
 away again. You fake the compiler into thinking you're much deeper in
 the pipeline than you really are. You're assuming there are no
 side-effects other than that some insn gets split, but there are back
 ends where splitters may have side-effects.

 Ok, then I will remove the split_insns call.


 Even though I've asked twice now, you still have not explained this
 motivational case, except to say that there is one. *What* are you
 trying to do, *what* is not happening without the splits, and *what*
 happens if you split. Only if you explain that in a lot more detail
 than I have a motivational case then we can look into what is a
 proper solution.

 :-). Sorry, I didn't say it clearly. The motivational case is the one
 mentioned in the following posts (split_insns changes a  (b  63) to
 a  b).
 http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
 http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html

 If I remove the split_insns call and related cost estimation
 adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop
 here looks like a reverse process of cse, the total cost after fwprop
 change is increased.

 Def insn 18:
 Use insn 23
 Use insn 22

 If we include the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
 not change after fwprop + insn splitting.
   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with 
 insn 22
 Total benefit is 5, fwprop will go on.

 If we remove the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
 insn 23 will increase after fwprop.
   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
 with insn 22
 Total benefit is -3, fwprop will punt.

 How about adding the (a  (b63) == a  b) transformation in
 simplify_binary_operation_1, becuase (a  (b63) == a  b) is a
 kind of architecture specific expr simplification? Then fwprop could
 do the propagation as I expect.


 The problem with some of the splitters is that they exist to break up
 RTL from 'expand' to initially keep some pattern together to allow the
 code transformation passes to handle the pattern as one instruction.
 This made sense when RTL was the only intermediate representation and
 splitting too early would inhibit some optimizations. But I would
 expect most (if not all) such cases to be less relevant because of the
 GIMPLE middle-end work. The only splitters you can trigger are the
 pre-reload splitters (all the reload_completed conditions obviously
 can't trigger if you're splitting from fwprop). Perhaps those
 splitters can/should run earlier, or be made obsolete by expanding
 directly to the post-splitting insns.

 Unfortunately, it's not possible to tell for your case, because you
 haven't explained it yet...


 So how 

Re: extend fwprop optimization

2013-03-25 Thread Wei Mi
On Mon, Mar 25, 2013 at 2:35 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi w...@google.com wrote:
 This is the patch to add the shift truncation in
 simplify_binary_operation_1. I add a new hook
 TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
 whether we can do shift truncation. I didn't use
 TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
 uses the macro SHIFT_COUNT_TRUNCATED. If I change
 SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
 TARGET_SHIFT_TRUNCATION_MASK, I need to give
 TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
 trivial to get at many places in existing code.

 patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

 Doing this might prove dangerous in case some pass may later decide
 to use an instruction that behaves in different ways.  Consider

tem = 1 (n  255);   // count truncated
x = y  tem;  // bittest instruction bit nr _not_ truncated

 so if tem is expanded to use a shift instruction which truncates the shift
 count the explicit and is dropped.  If later combine comes around and
 combines the bit-test to use the bittest instruction which does not
 implicitely truncate the cound you have generated wrong-code.


So it means the existing truncation pattern defined in insn split is
also incorrect because the truncated shift may be combined into a bit
test pattern?

// The following define_insn_and_split will do shift truncation.
(define_insn_and_split *shift_insnmode3_mask
  [(set (match_operand:SWI48 0 nonimmediate_operand =rm)
(any_shiftrt:SWI48
  (match_operand:SWI48 1 nonimmediate_operand 0)
  (subreg:QI
(and:SI
  (match_operand:SI 2 nonimmediate_operand c)
  (match_operand:SI 3 const_int_operand n)) 0)))
   (clobber (reg:CC FLAGS_REG))]
  ix86_binary_operator_ok (CODE, MODEmode, operands)
(INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
  == GET_MODE_BITSIZE (MODEmode)-1
  #
   1
  [(parallel [(set (match_dup 0)
   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
  (clobber (reg:CC FLAGS_REG))])]
{
  if (can_create_pseudo_p ())
operands [2] = force_reg (SImode, operands[2]);

  operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0);
}
  [(set_attr type ishift)
   (set_attr mode MODE)])

 So we need to make sure any explicit truncation originally in place
 is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should
 not exist at all, but instead there would be two patterns for shifts
 with implicit truncation - one involving the truncation (canonicalized to
 bitwise and) and one not involving the truncation.

 Richard.


I am trying to figure out a way not to lose the opportunity when shift
truncation is not combined in a bit test pattern. Can we keep the
explicit truncation in RTL, but generate truncation code in assembly?
Then only shift truncation which not combined in a bit test
pattershift truncationn will happen.

(define_insn *shift_insn_andmode
  [(set (match_operand:SWI48 0 nonimmediate_operand =rm)
(any_shiftrt:SWI48
  (match_operand:SWI48 1 nonimmediate_operand 0)
  (subreg:QI
(and:SI
  (match_operand:SI 2 nonimmediate_operand c)
  (match_operand:SI 3 const_int_operand n)) 0)))
   (clobber (reg:CC FLAGS_REG))]
  ix86_binary_operator_ok (CODE, MODEmode, operands)
{
   if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
  == GET_MODE_BITSIZE (MODEmode)-1)
  return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
   else
  shift\t{%2, %0|%0, %2};
}

Thanks,
Wei.


Re: extend fwprop optimization

2013-03-25 Thread Wei Mi
 I am trying to figure out a way not to lose the opportunity when shift
 truncation is not combined in a bit test pattern. Can we keep the
 explicit truncation in RTL, but generate truncation code in assembly?
 Then only shift truncation which not combined in a bit test
 pattershift truncationn will happen.

 (define_insn *shift_insn_andmode
   [(set (match_operand:SWI48 0 nonimmediate_operand =rm)
 (any_shiftrt:SWI48
   (match_operand:SWI48 1 nonimmediate_operand 0)
   (subreg:QI
 (and:SI
   (match_operand:SI 2 nonimmediate_operand c)
   (match_operand:SI 3 const_int_operand n)) 0)))
(clobber (reg:CC FLAGS_REG))]
   ix86_binary_operator_ok (CODE, MODEmode, operands)
 {
if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
   == GET_MODE_BITSIZE (MODEmode)-1)
   return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
else
   shift\t{%2, %0|%0, %2};
 }

Sorry, rectify a mistake:

{
   if ((INTVAL (operands[3])  (GET_MODE_BITSIZE (MODEmode)-1))
  == GET_MODE_BITSIZE (MODEmode)-1)
  return shift\t{%2, %0|%0, %2};
   else
  return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2};
}

Thanks,
Wei.


Re: extend fwprop optimization

2013-03-24 Thread Oleg Endo
Hi,

On Sat, 2013-03-23 at 21:18 -0700, Wei Mi wrote:
 This is the patch to add the shift truncation in
 simplify_binary_operation_1. I add a new hook
 TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
 whether we can do shift truncation. I didn't use
 TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
 uses the macro SHIFT_COUNT_TRUNCATED. If I change
 SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
 TARGET_SHIFT_TRUNCATION_MASK, I need to give
 TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
 trivial to get at many places in existing code.
 

During 4.8 development there was a similar issue with the
TARGET_CANONICALIZE_COMPARISON hook.  As a temporary solution the
rtx_code has been passed as int.  I think the story started here:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00379.html

The conclusion regarding rtx_code ...
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00646.html

Maybe this should be addressed first, since now there are at least two
cases where it's in the way.

Cheers,
Oleg



 patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.
 
 Thanks,
 Wei.
 
 On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote:
  Hi,
 
  On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com 
  wrote:
  On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
  For the motivational case, I need insn splitting to get the cost
  right. insn splitting is not very intrusive. All I need is to call
  split_insns func.
 
  It may not look very intrusive, but there's a lot happening in the
  back ground. You're creating a lot of new RTL, and then just throw it
  away again. You fake the compiler into thinking you're much deeper in
  the pipeline than you really are. You're assuming there are no
  side-effects other than that some insn gets split, but there are back
  ends where splitters may have side-effects.
 
  Ok, then I will remove the split_insns call.
 
 
  Even though I've asked twice now, you still have not explained this
  motivational case, except to say that there is one. *What* are you
  trying to do, *what* is not happening without the splits, and *what*
  happens if you split. Only if you explain that in a lot more detail
  than I have a motivational case then we can look into what is a
  proper solution.
 
  :-). Sorry, I didn't say it clearly. The motivational case is the one
  mentioned in the following posts (split_insns changes a  (b  63) to
  a  b).
  http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
  http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
 
  If I remove the split_insns call and related cost estimation
  adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop
  here looks like a reverse process of cse, the total cost after fwprop
  change is increased.
 
  Def insn 18:
  Use insn 23
  Use insn 22
 
  If we include the split_insns cost estimation adjustment.
extra benefit by removing def insn 18 = 5
change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
  not change after fwprop + insn splitting.
change[1]: benefit = 0, verified - ok  // The insn 23 is the same with 
  insn 22
  Total benefit is 5, fwprop will go on.
 
  If we remove the split_insns cost estimation adjustment.
extra benefit by removing def insn 18 = 5
change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
  insn 23 will increase after fwprop.
change[1]: benefit = -4, verified - ok   // The insn 23 is the same
  with insn 22
  Total benefit is -3, fwprop will punt.
 
  How about adding the (a  (b63) == a  b) transformation in
  simplify_binary_operation_1, becuase (a  (b63) == a  b) is a
  kind of architecture specific expr simplification? Then fwprop could
  do the propagation as I expect.
 
 
  The problem with some of the splitters is that they exist to break up
  RTL from 'expand' to initially keep some pattern together to allow the
  code transformation passes to handle the pattern as one instruction.
  This made sense when RTL was the only intermediate representation and
  splitting too early would inhibit some optimizations. But I would
  expect most (if not all) such cases to be less relevant because of the
  GIMPLE middle-end work. The only splitters you can trigger are the
  pre-reload splitters (all the reload_completed conditions obviously
  can't trigger if you're splitting from fwprop). Perhaps those
  splitters can/should run earlier, or be made obsolete by expanding
  directly to the post-splitting insns.
 
  Unfortunately, it's not possible to tell for your case, because you
  haven't explained it yet...
 
 
  So how about keep split_insns and remove peephole in the cost estimation 
  func?
 
  I'd strongly oppose this. I do not believe this is necessary, and I
  think it's conceptually wrong.
 
 
  What happens if you propagate into an insn that uses the same register
  twice? Will the DU chains still be valid (I don't think 

Re: extend fwprop optimization

2013-03-23 Thread Wei Mi
This is the patch to add the shift truncation in
simplify_binary_operation_1. I add a new hook
TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
whether we can do shift truncation. I didn't use
TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
uses the macro SHIFT_COUNT_TRUNCATED. If I change
SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
TARGET_SHIFT_TRUNCATION_MASK, I need to give
TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
trivial to get at many places in existing code.

patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

Thanks,
Wei.

On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote:
 Hi,

 On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com 
 wrote:
 On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
 For the motivational case, I need insn splitting to get the cost
 right. insn splitting is not very intrusive. All I need is to call
 split_insns func.

 It may not look very intrusive, but there's a lot happening in the
 back ground. You're creating a lot of new RTL, and then just throw it
 away again. You fake the compiler into thinking you're much deeper in
 the pipeline than you really are. You're assuming there are no
 side-effects other than that some insn gets split, but there are back
 ends where splitters may have side-effects.

 Ok, then I will remove the split_insns call.


 Even though I've asked twice now, you still have not explained this
 motivational case, except to say that there is one. *What* are you
 trying to do, *what* is not happening without the splits, and *what*
 happens if you split. Only if you explain that in a lot more detail
 than I have a motivational case then we can look into what is a
 proper solution.

 :-). Sorry, I didn't say it clearly. The motivational case is the one
 mentioned in the following posts (split_insns changes a  (b  63) to
 a  b).
 http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
 http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html

 If I remove the split_insns call and related cost estimation
 adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop
 here looks like a reverse process of cse, the total cost after fwprop
 change is increased.

 Def insn 18:
 Use insn 23
 Use insn 22

 If we include the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
 not change after fwprop + insn splitting.
   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 
 22
 Total benefit is 5, fwprop will go on.

 If we remove the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
 insn 23 will increase after fwprop.
   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
 with insn 22
 Total benefit is -3, fwprop will punt.

 How about adding the (a  (b63) == a  b) transformation in
 simplify_binary_operation_1, becuase (a  (b63) == a  b) is a
 kind of architecture specific expr simplification? Then fwprop could
 do the propagation as I expect.


 The problem with some of the splitters is that they exist to break up
 RTL from 'expand' to initially keep some pattern together to allow the
 code transformation passes to handle the pattern as one instruction.
 This made sense when RTL was the only intermediate representation and
 splitting too early would inhibit some optimizations. But I would
 expect most (if not all) such cases to be less relevant because of the
 GIMPLE middle-end work. The only splitters you can trigger are the
 pre-reload splitters (all the reload_completed conditions obviously
 can't trigger if you're splitting from fwprop). Perhaps those
 splitters can/should run earlier, or be made obsolete by expanding
 directly to the post-splitting insns.

 Unfortunately, it's not possible to tell for your case, because you
 haven't explained it yet...


 So how about keep split_insns and remove peephole in the cost estimation 
 func?

 I'd strongly oppose this. I do not believe this is necessary, and I
 think it's conceptually wrong.


 What happens if you propagate into an insn that uses the same register
 twice? Will the DU chains still be valid (I don't think that's
 guaranteed)?

 I think the DU chains still be valid. If propagate into the insn that
 uses the same register twice, the two uses will be replaced when the
 first use is seen (propagate_rtx_1 will propagate all the occurrances
 of the same reg in the use insn).  When the second use is seen, the
 df_use and use insn in its insn_info are still available.
 forward_propagate_into will early return after check reg_mentioned_p
 (DF_REF_REG (use), parent) and find out no reg is used  any more.

 With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
 still valid.

 I think DF_REF_LOC of USE may be 

Re: extend fwprop optimization

2013-03-17 Thread Andrew Pinski
On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote:
 Hi,

 On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com 
 wrote:
 On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
 For the motivational case, I need insn splitting to get the cost
 right. insn splitting is not very intrusive. All I need is to call
 split_insns func.

 It may not look very intrusive, but there's a lot happening in the
 back ground. You're creating a lot of new RTL, and then just throw it
 away again. You fake the compiler into thinking you're much deeper in
 the pipeline than you really are. You're assuming there are no
 side-effects other than that some insn gets split, but there are back
 ends where splitters may have side-effects.

 Ok, then I will remove the split_insns call.


 Even though I've asked twice now, you still have not explained this
 motivational case, except to say that there is one. *What* are you
 trying to do, *what* is not happening without the splits, and *what*
 happens if you split. Only if you explain that in a lot more detail
 than I have a motivational case then we can look into what is a
 proper solution.

 :-). Sorry, I didn't say it clearly. The motivational case is the one
 mentioned in the following posts (split_insns changes a  (b  63) to
 a  b).
 http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
 http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html




 If I remove the split_insns call and related cost estimation
 adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop
 here looks like a reverse process of cse, the total cost after fwprop
 change is increased.

 Def insn 18:
 Use insn 23
 Use insn 22

 If we include the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
 not change after fwprop + insn splitting.
   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 
 22
 Total benefit is 5, fwprop will go on.

 If we remove the split_insns cost estimation adjustment.
   extra benefit by removing def insn 18 = 5
   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
 insn 23 will increase after fwprop.
   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
 with insn 22
 Total benefit is -3, fwprop will punt.

 How about adding the (a  (b63) == a  b) transformation in
 simplify_binary_operation_1, becuase (a  (b63) == a  b) is a
 kind of architecture specific expr simplification? Then fwprop could
 do the propagation as I expect.


 The problem with some of the splitters is that they exist to break up
 RTL from 'expand' to initially keep some pattern together to allow the
 code transformation passes to handle the pattern as one instruction.
 This made sense when RTL was the only intermediate representation and
 splitting too early would inhibit some optimizations. But I would
 expect most (if not all) such cases to be less relevant because of the
 GIMPLE middle-end work. The only splitters you can trigger are the
 pre-reload splitters (all the reload_completed conditions obviously
 can't trigger if you're splitting from fwprop). Perhaps those
 splitters can/should run earlier, or be made obsolete by expanding
 directly to the post-splitting insns.

 Unfortunately, it's not possible to tell for your case, because you
 haven't explained it yet...


 So how about keep split_insns and remove peephole in the cost estimation 
 func?

 I'd strongly oppose this. I do not believe this is necessary, and I
 think it's conceptually wrong.


 What happens if you propagate into an insn that uses the same register
 twice? Will the DU chains still be valid (I don't think that's
 guaranteed)?

 I think the DU chains still be valid. If propagate into the insn that
 uses the same register twice, the two uses will be replaced when the
 first use is seen (propagate_rtx_1 will propagate all the occurrances
 of the same reg in the use insn).  When the second use is seen, the
 df_use and use insn in its insn_info are still available.
 forward_propagate_into will early return after check reg_mentioned_p
 (DF_REF_REG (use), parent) and find out no reg is used  any more.

 With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
 still valid.

 I think DF_REF_LOC of USE may be invalid if dangling rtx will be
 recycled by garbage collection very soon (I don't know when GC will
 happen). Although DF_REF_LOC of USE maybe invalid, the early return in
 forward_propagate_into ensure it will not cause any correctness
 problem.


 In any case, returning to the RD problem for DU/UD chains is probably
 a good idea, now that RD is not such a hog anymore. In effect fwprop.c
 would return to what it looked like before the patch of r149010.

 I remove MD problem and use DU/UD instead.


 As a way forward on all of this, I'd suggest the following steps, each
 with a separate patch:

 Thanks for the suggestion!

 1. 

Re: extend fwprop optimization

2013-03-16 Thread Steven Bosscher
On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
 For the motivational case, I need insn splitting to get the cost
 right. insn splitting is not very intrusive. All I need is to call
 split_insns func.

It may not look very intrusive, but there's a lot happening in the
back ground. You're creating a lot of new RTL, and then just throw it
away again. You fake the compiler into thinking you're much deeper in
the pipeline than you really are. You're assuming there are no
side-effects other than that some insn gets split, but there are back
ends where splitters may have side-effects.

Even though I've asked twice now, you still have not explained this
motivational case, except to say that there is one. *What* are you
trying to do, *what* is not happening without the splits, and *what*
happens if you split. Only if you explain that in a lot more detail
than I have a motivational case then we can look into what is a
proper solution.

The problem with some of the splitters is that they exist to break up
RTL from 'expand' to initially keep some pattern together to allow the
code transformation passes to handle the pattern as one instruction.
This made sense when RTL was the only intermediate representation and
splitting too early would inhibit some optimizations. But I would
expect most (if not all) such cases to be less relevant because of the
GIMPLE middle-end work. The only splitters you can trigger are the
pre-reload splitters (all the reload_completed conditions obviously
can't trigger if you're splitting from fwprop). Perhaps those
splitters can/should run earlier, or be made obsolete by expanding
directly to the post-splitting insns.

Unfortunately, it's not possible to tell for your case, because you
haven't explained it yet...


 So how about keep split_insns and remove peephole in the cost estimation func?

I'd strongly oppose this. I do not believe this is necessary, and I
think it's conceptually wrong.


 What happens if you propagate into an insn that uses the same register
 twice? Will the DU chains still be valid (I don't think that's
 guaranteed)?

 I think the DU chains still be valid. If propagate into the insn that
 uses the same register twice, the two uses will be replaced when the
 first use is seen (propagate_rtx_1 will propagate all the occurrances
 of the same reg in the use insn).  When the second use is seen, the
 df_use and use insn in its insn_info are still available.
 forward_propagate_into will early return after check reg_mentioned_p
 (DF_REF_REG (use), parent) and find out no reg is used  any more.

With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
still valid.

In any case, returning to the RD problem for DU/UD chains is probably
a good idea, now that RD is not such a hog anymore. In effect fwprop.c
would return to what it looked like before the patch of r149010.

As a way forward on all of this, I'd suggest the following steps, each
with a separate patch:
1. replace the MD problem with RD again, and build full DU/UD chains.
2. post all the recog changes separately, with minimum impact on the
parts of the compiler you don't really change. (For apply_change_group
you could even choose to overload it, or use a NUM argument with a
default value -- not sure if default argument values are OK for GCC
tho'.)
3. implement propagation into multiple USEs, but without the splitting
and peepholing.
4. see about fixing the back end to either split earlier or expand to
the desired patterns directly.

Ciao!
Steven


Re: extend fwprop optimization

2013-03-12 Thread Wei Mi
Thanks for the helpful comments! I have some replies inlined.

Regards,
Wei.

On Mon, Mar 11, 2013 at 12:52 PM, Steven Bosscher stevenb@gmail.com wrote:
 On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
 This is the fwprop extension patch which is put in order. Regression
 test and bootstrap pass. Please help to review its rationality. The
 following is a brief description what I have done in the patch.

 In order to make fwprop more effective in rtl optimization, we extend
 it to handle general expressions instead of the three cases listed in
 the head comment in fwprop.c. The major changes include a) We need to
 check propagation correctness for src exprs of def which contain mem
 references. Previous fwprop for the three cases above doesn't have the
 problem. b) We need a better cost model because the benefit is usually
 not so apparent as the three cases above.

 For a general fwprop problem, there are two possible sources where
 benefit comes from. The frist is the new use insn after propagation
 and simplification may have lower cost than itself before propagation,
 or propagation may create a new insn, that could be splitted or
 peephole optimized later and get a lower cost. The second is that if
 all the uses are replaced with the src of the def insn, the def insn
 could be deleted.

 So instead of check each def-use pair independently, we use DU chain
 to track all the uses for a def. For each def-use pair, we attempt the
 propagation, record the change candidate in changes[] array, but we
 wait to confirm the changes until all the pairs with the same def are
 iterated. The changes confirmation is done in the func
 confirm_change_group_by_cost. We only do this for fwprop. For
 fwprop_addr, the benefit of each change is ensured by
 propagation_rtx_1 using should_replace_address, so we just confirm all
 the changes without checking benefit again.

 Hello Wei Mi,

 So IIUC, in essence you are doing:

 main:
   FOR_EACH_BB:
 FOR_BB_INSNS, non-debug insns only:
   for each df_ref DEF operand on insn:
 iterate_def_uses

 iterate_def_uses:
   for each UD chain from DEF to USE(i):
 forward_propagate_into
   confirm changes by total benefit

 I still like the idea, but there are also still a few design issues
 to resolve.

 Some of the same comments as before apply: Do you really, really,
 reallyreally have to go so low-level as to insn splitting, peephole
 optimizations, and even register allocation, to get the cost right?
 That will almost certainly not be acceptable, and I for one would
 oppose such a change. It's IMHO a violation of proper engineering when
 your medium-to-high level code transformations have to do that. If you
 have strong reasons for your approach, it'd be helpful if you can
 explain them so that we can together look for a less intrusive
 solution (e.g. splitting earlier, adjusting the cost model, etc.).


For the motivational case, I need insn splitting to get the cost
right. insn splitting is not very intrusive. All I need is to call
split_insns func. I think split_insns is just a pattern matching func
just like recog(), which is called at many places. Peephole is not
necessary (I add it in order to find as many oppotunities as possible,
but from my trace analysis, it doesn't help much). To call
peephole2_insn() is indeed intrusive, because peephole assumes reg
allocation is completed, I have to insert the ugly workaround below.
peephole also requires setting DF_LR_RUN_DCE flag and some
initialization of peep2_insn_data array.

So how about keep split_insns and remove peephole in the cost estimation func?

 So things like:
 +  /* we may call peephole2_insns in fwprop phase to estimate how
 + peephole will affect the cost of the insn transformed by fwprop.
 + fwprop is done before ira phase. In that case, we simply return
 + a new pseudo register.  */
 +  if (!strncmp (current_pass-name, fwprop, 6))
 +return gen_reg_rtx (mode);

 and

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c(revision 196270)
 +++ config/i386/i386.c(working copy)
 @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
  {
rtx tmp;

 -  /* We play register width games, which are only valid after reload.  */
 -  gcc_assert (reload_completed);
 +  /* We play register width games, which are only valid after reload.
 + An exception: fwprop call peephole to estimate the change benefit,
 + and peephole will call this func. That is before reload complete.
 + It will not bring any problem because the peephole2_insns call is
 + only used for cost estimation in fwprop, and its change will be
 + abandoned immediately after the cost estimation.  */
 +  if (strncmp (current_pass-name, fwprop, 6))
 +gcc_assert (reload_completed);

 are IMHO not OK.


They are intrusive and inserted for peephole.

 Note that your patch is a bit difficult to read at some points because
 you 

Re: extend fwprop optimization

2013-03-11 Thread Jeff Law

On 03/10/2013 11:52 PM, Wei Mi wrote:

Hi,

This is the fwprop extension patch which is put in order. Regression
test and bootstrap pass. Please help to review its rationality. The
following is a brief description what I have done in the patch.

In order to make fwprop more effective in rtl optimization, we extend
it to handle general expressions instead of the three cases listed in
the head comment in fwprop.c. The major changes include a) We need to
check propagation correctness for src exprs of def which contain mem
references. Previous fwprop for the three cases above doesn't have the
problem. b) We need a better cost model because the benefit is usually
not so apparent as the three cases above.

For a general fwprop problem, there are two possible sources where
benefit comes from. The frist is the new use insn after propagation
and simplification may have lower cost than itself before propagation,
or propagation may create a new insn, that could be splitted or
peephole optimized later and get a lower cost. The second is that if
all the uses are replaced with the src of the def insn, the def insn
could be deleted.

So instead of check each def-use pair independently, we use DU chain
to track all the uses for a def. For each def-use pair, we attempt the
propagation, record the change candidate in changes[] array, but we
wait to confirm the changes until all the pairs with the same def are
iterated. The changes confirmation is done in the func
confirm_change_group_by_cost. We only do this for fwprop. For
fwprop_addr, the benefit of each change is ensured by
propagation_rtx_1 using should_replace_address, so we just confirm all
the changes without checking benefit again.
Can you please attach this to the 4.9 pending patches tracker bug. 
We're really focused on trying to get 4.8 out the door and this doesn't 
seem like suitable material for GCC 4.8.


I haven't looked at the details of the patch at all yet and doubt I 
would prior to GCC 4.8 going out the door.


Thanks,
jeff



Re: extend fwprop optimization

2013-03-11 Thread Steven Bosscher
On Mon, Mar 11, 2013 at 7:10 PM, Jeff Law l...@redhat.com wrote:
 On 03/10/2013 11:52 PM, Wei Mi wrote:

 Hi,

 This is the fwprop extension patch which is put in order. Regression
 test and bootstrap pass. Please help to review its rationality. The
 following is a brief description what I have done in the patch.

 In order to make fwprop more effective in rtl optimization, we extend
 it to handle general expressions instead of the three cases listed in
 the head comment in fwprop.c. The major changes include a) We need to
 check propagation correctness for src exprs of def which contain mem
 references. Previous fwprop for the three cases above doesn't have the
 problem. b) We need a better cost model because the benefit is usually
 not so apparent as the three cases above.

 For a general fwprop problem, there are two possible sources where
 benefit comes from. The frist is the new use insn after propagation
 and simplification may have lower cost than itself before propagation,
 or propagation may create a new insn, that could be splitted or
 peephole optimized later and get a lower cost. The second is that if
 all the uses are replaced with the src of the def insn, the def insn
 could be deleted.

 So instead of check each def-use pair independently, we use DU chain
 to track all the uses for a def. For each def-use pair, we attempt the
 propagation, record the change candidate in changes[] array, but we
 wait to confirm the changes until all the pairs with the same def are
 iterated. The changes confirmation is done in the func
 confirm_change_group_by_cost. We only do this for fwprop. For
 fwprop_addr, the benefit of each change is ensured by
 propagation_rtx_1 using should_replace_address, so we just confirm all
 the changes without checking benefit again.

 Can you please attach this to the 4.9 pending patches tracker bug. We're
 really focused on trying to get 4.8 out the door and this doesn't seem like
 suitable material for GCC 4.8.

 I haven't looked at the details of the patch at all yet and doubt I would
 prior to GCC 4.8 going out the door.

 Thanks,
 jeff


Jeff,

The world has more people than you, and with different interests. This
patch was posted here for comments on the idea, and while I'm sure
your feedback would be very valuable, it is no more required for
discussing this patch than it is for releasing GCC 4.8.

Ciao!
Steven


Re: extend fwprop optimization

2013-03-11 Thread Steven Bosscher
On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
 This is the fwprop extension patch which is put in order. Regression
 test and bootstrap pass. Please help to review its rationality. The
 following is a brief description what I have done in the patch.

 In order to make fwprop more effective in rtl optimization, we extend
 it to handle general expressions instead of the three cases listed in
 the head comment in fwprop.c. The major changes include a) We need to
 check propagation correctness for src exprs of def which contain mem
 references. Previous fwprop for the three cases above doesn't have the
 problem. b) We need a better cost model because the benefit is usually
 not so apparent as the three cases above.

 For a general fwprop problem, there are two possible sources where
 benefit comes from. The frist is the new use insn after propagation
 and simplification may have lower cost than itself before propagation,
 or propagation may create a new insn, that could be splitted or
 peephole optimized later and get a lower cost. The second is that if
 all the uses are replaced with the src of the def insn, the def insn
 could be deleted.

 So instead of check each def-use pair independently, we use DU chain
 to track all the uses for a def. For each def-use pair, we attempt the
 propagation, record the change candidate in changes[] array, but we
 wait to confirm the changes until all the pairs with the same def are
 iterated. The changes confirmation is done in the func
 confirm_change_group_by_cost. We only do this for fwprop. For
 fwprop_addr, the benefit of each change is ensured by
 propagation_rtx_1 using should_replace_address, so we just confirm all
 the changes without checking benefit again.

Hello Wei Mi,

So IIUC, in essence you are doing:

main:
  FOR_EACH_BB:
FOR_BB_INSNS, non-debug insns only:
  for each df_ref DEF operand on insn:
iterate_def_uses

iterate_def_uses:
  for each UD chain from DEF to USE(i):
forward_propagate_into
  confirm changes by total benefit

I still like the idea, but there are also still a few design issues
to resolve.

Some of the same comments as before apply: Do you really, really,
reallyreally have to go so low-level as to insn splitting, peephole
optimizations, and even register allocation, to get the cost right?
That will almost certainly not be acceptable, and I for one would
oppose such a change. It's IMHO a violation of proper engineering when
your medium-to-high level code transformations have to do that. If you
have strong reasons for your approach, it'd be helpful if you can
explain them so that we can together look for a less intrusive
solution (e.g. splitting earlier, adjusting the cost model, etc.).

So things like:
 +  /* we may call peephole2_insns in fwprop phase to estimate how
 + peephole will affect the cost of the insn transformed by fwprop.
 + fwprop is done before ira phase. In that case, we simply return
 + a new pseudo register.  */
 +  if (!strncmp (current_pass-name, fwprop, 6))
 +return gen_reg_rtx (mode);

and

 Index: config/i386/i386.c
 ===
 --- config/i386/i386.c(revision 196270)
 +++ config/i386/i386.c(working copy)
 @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
  {
rtx tmp;

 -  /* We play register width games, which are only valid after reload.  */
 -  gcc_assert (reload_completed);
 +  /* We play register width games, which are only valid after reload.
 + An exception: fwprop call peephole to estimate the change benefit,
 + and peephole will call this func. That is before reload complete.
 + It will not bring any problem because the peephole2_insns call is
 + only used for cost estimation in fwprop, and its change will be
 + abandoned immediately after the cost estimation.  */
 +  if (strncmp (current_pass-name, fwprop, 6))
 +gcc_assert (reload_completed);

are IMHO not OK.

Note that your patch is a bit difficult to read at some points because
you have included a bunch of non-changes (whitespaces fixes --
necessary cleanups but not relevant for your patch), see e.g. the
changed lines that contain lra_in_progress. Also the changes like:
  static bool
 -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
 +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)

which are quite distracting, making it harder to see what has *really* changed.

You should probably just a helper function apply_change_group_num()
and avoid all the apply_change_group use fixups.


In fwprop.c:
 +  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
 + estimation in estimate_split_and_peephole.  */
 +  df_set_flags (DF_LR_RUN_DCE);
df_md_add_problem ();
df_note_add_problem ();
 -  df_analyze ();
 +  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
 +  df_analyze ();

you add DU and UD 

Re: extend fwprop optimization

2013-02-27 Thread Wei Mi
On Tue, Feb 26, 2013 at 2:59 AM, Steven Bosscher stevenb@gmail.com wrote:
 On Tue, Feb 26, 2013 at 2:12 AM, Wei Mi wrote:
 But it is not a good transformation unless we know insn split will
 change a  (b  63) to a  b; Here we want to see what the rtl looks
 like after insn splitting in fwprop cost estimation (We call
 split_insns in estimate_split_and_peephole(), but not to do insn
 splitting actually in this phase).

 So you're splitting to find out that the shift is truncated to 5 or 6
 bits. That looks like what you really want is to have
 SHIFT_COUNT_TRUNCATED working for your target. It isn't defined for
 i386:

 /* Define if shifts truncate the shift count which implies one can
omit a sign-extension or zero-extension of a shift count.

On i386, shifts do truncate the count.  But bit test instructions
take the modulo of the bit offset operand.  */

 /* #define SHIFT_COUNT_TRUNCATED */

 Perhaps SHIFT_COUNT_TRUNCATED should be turned into a target hook, and
 take an rtx_code (or a pattern) to let the target decide whether a
 truncation is applicable or not.

 This is a target thing, so perhaps Uros has some ideas about this.

 I'm guessing cse.c would then handle your code transformation already,
 or can be made to do so without a lot of extra work, e.g. teach
 fold_rtx about such (shift (...) (and (...))) transformations that are
 really truncations.

Thanks for pointing out fold_rtx. I took a look at it and cse
yesterday, and I agreed with you fold_rtx could be extended to handle
the motivational case. But I still think fwprop extension could be
meaningful generally.

1. fold_rtx doesn't handling all the propagation-simplification tasks.
It only handles some typical cases. I think cse doesn't want to be
very cumbersome to include all the fwprop's functionality. fwprop
extension tries to generally handle the propagation-simplification
problem. I think cse contains fold_rtx partially because existing
fwprop and combine are not ideal. If fwprop could handle the general
case, cse could simply try to find common subexpression.

2. fold_rtx does the simplification only based on the current insn,
while fwprop extension tries to consider the def-uses group in a
whole. When all the uses could be propagated, we have the choices: a)
do all the propagations then delete the def insn, even if some
propagations may not be beneficial. b) only select beneficial
propagations and leave the def insn there. fwprop extension has a cost
model to choose which way to go.

What do you think?

Thanks,
Wei.


Re: extend fwprop optimization

2013-02-27 Thread Steven Bosscher
On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
 What do you think?

I think you'll not be able to teach fold_rtx to perform the
transformation you want it to do without having SHIFT_COUNT_TRUNCATED
set for i386. I already tried it the other day, but GCC won't do the
truncation without knowing the insn is really a shift insn and
shift_truncation_mask returns something useful.

Ciao!
Steven


Index: cse.c
===
--- cse.c   (revision 196182)
+++ cse.c   (working copy)
@@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)

switch (GET_CODE (folded_arg))
  {
+ case SUBREG:
+   /* If the SUBREG_REG comes in from an AND, and this is not a
+  paradoxical subreg, then try to fold the SUBREG.  */
+   if (REG_P (SUBREG_REG (folded_arg))
+! paradoxical_subreg_p (folded_arg))
+ {
+   rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
+   if (y != 0)
+ y = simplify_gen_binary(AND, GET_MODE (folded_arg),
+ XEXP(y, 0), XEXP(y, 1));
+   if (y != 0)
+ folded_arg = y;
+ }
+   /* ... fall through ...  */
  case MEM:
  case REG:
- case SUBREG:
const_arg = equiv_constant (folded_arg);
break;


Re: extend fwprop optimization

2013-02-27 Thread Wei Mi
Yes, I agree with you. fold_rtx also needs to be extended because now
it only handles the case similar as follows for shift insn:
  a = b op const1
  c = a  const2
for our motivational case, the second operand of the first insn is a
reg instead of a const. We also need to add the truncation support for
our case in simplify_binary_operation.

I will send out a more official patch about fwprop extension soon.
Then it may be easier to talk about its rationality.

Thanks,
Wei.

On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher stevenb@gmail.com wrote:
 On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
 What do you think?

 I think you'll not be able to teach fold_rtx to perform the
 transformation you want it to do without having SHIFT_COUNT_TRUNCATED
 set for i386. I already tried it the other day, but GCC won't do the
 truncation without knowing the insn is really a shift insn and
 shift_truncation_mask returns something useful.

 Ciao!
 Steven


 Index: cse.c
 ===
 --- cse.c   (revision 196182)
 +++ cse.c   (working copy)
 @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)

 switch (GET_CODE (folded_arg))
   {
 + case SUBREG:
 +   /* If the SUBREG_REG comes in from an AND, and this is not a
 +  paradoxical subreg, then try to fold the SUBREG.  */
 +   if (REG_P (SUBREG_REG (folded_arg))
 +! paradoxical_subreg_p (folded_arg))
 + {
 +   rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
 +   if (y != 0)
 + y = simplify_gen_binary(AND, GET_MODE (folded_arg),
 + XEXP(y, 0), XEXP(y, 1));
 +   if (y != 0)
 + folded_arg = y;
 + }
 +   /* ... fall through ...  */
   case MEM:
   case REG:
 - case SUBREG:
 const_arg = equiv_constant (folded_arg);
 break;


Re: extend fwprop optimization

2013-02-26 Thread Steven Bosscher
On Tue, Feb 26, 2013 at 2:12 AM, Wei Mi wrote:
 But it is not a good transformation unless we know insn split will
 change a  (b  63) to a  b; Here we want to see what the rtl looks
 like after insn splitting in fwprop cost estimation (We call
 split_insns in estimate_split_and_peephole(), but not to do insn
 splitting actually in this phase).

So you're splitting to find out that the shift is truncated to 5 or 6
bits. That looks like what you really want is to have
SHIFT_COUNT_TRUNCATED working for your target. It isn't defined for
i386:

/* Define if shifts truncate the shift count which implies one can
   omit a sign-extension or zero-extension of a shift count.

   On i386, shifts do truncate the count.  But bit test instructions
   take the modulo of the bit offset operand.  */

/* #define SHIFT_COUNT_TRUNCATED */

Perhaps SHIFT_COUNT_TRUNCATED should be turned into a target hook, and
take an rtx_code (or a pattern) to let the target decide whether a
truncation is applicable or not.

This is a target thing, so perhaps Uros has some ideas about this.

I'm guessing cse.c would then handle your code transformation already,
or can be made to do so without a lot of extra work, e.g. teach
fold_rtx about such (shift (...) (and (...))) transformations that are
really truncations.

Ciao!
Steven


Re: extend fwprop optimization

2013-02-25 Thread Steven Bosscher
On Tue, Feb 26, 2013 at 12:32 AM, Wei Mi wrote:
 We also take insn splitting and peephole into consideration,
 .i.e, the cost of the change is the cost after insn splitting and
 peephole which may be applied to the insn changed. This is useful for
 the motivational case,

It also goes against everything ever done before in RTL land.

If you need early splitting like this, then these insns should
probably be splitted earlier. And trying to apply peephole2's at this
stage is, ehm, very creative... I'm surprised it works at all,
peephole2 is supposed to work on strict RTL (i.e. post-reload, all
constraints matched, hard regiters only, etc.). NB, you can't use
peephole2 unconditionally, not all targets have them. See
HAVE_peephole2.

Can you explain step-by-step what is going on, that you need this?

 -  break;
 +  /* Compare registers by number.  */
 +case REG:
 +  return REG_P (reg)  REGNO (in) == REGNO (reg);

This will not work for hard registers.

FWIW, en passant you've made fwprop quadratic in the number of insns
in a basic block, in initialize_before_estimate_peephole potentially
calling simulate_backwards_to_point repeatedly on every insn in a
basic block.

Also: no ChangeLog, not following code style conventions, no comments,
entire blocks of recently added code disappearing (e.g. the Do not
replace an existing REG_EQUAL note stuff)...

Don't mean to be too harsh, but in this form I'm not going to look at it.

Ciao!
Steven


Re: extend fwprop optimization

2013-02-25 Thread Wei Mi
Hi,

On Mon, Feb 25, 2013 at 4:08 PM, Steven Bosscher stevenb@gmail.com wrote:
 On Tue, Feb 26, 2013 at 12:32 AM, Wei Mi wrote:
 We also take insn splitting and peephole into consideration,
 .i.e, the cost of the change is the cost after insn splitting and
 peephole which may be applied to the insn changed. This is useful for
 the motivational case,

 It also goes against everything ever done before in RTL land.

 If you need early splitting like this, then these insns should
 probably be splitted earlier. And trying to apply peephole2's at this
 stage is, ehm, very creative... I'm surprised it works at all,
 peephole2 is supposed to work on strict RTL (i.e. post-reload, all
 constraints matched, hard regiters only, etc.). NB, you can't use
 peephole2 unconditionally, not all targets have them. See
 HAVE_peephole2.

 Can you explain step-by-step what is going on, that you need this?


I am not trying to actually do the transformation of split and
peephole. Just want to know how does the insn look like after the
split and peephole, then we can decide whether to do fwprop based on
more precise cost caculation.

For the motivational case
IR before fwprop:
(insn 18 17 19 2 (parallel [
(set (reg:SI 75 [ D.2322 ])
(and:SI (reg:SI 88 [ D.2325 ])
(const_int 63 [0x3f])))
(clobber (reg:CC 17 flags))
]) 1.c:25 402 {*andsi_1}
 (expr_list:REG_DEAD (reg:SI 88 [ D.2325 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 22 21 23 2 (parallel [
(set (reg:DI 91 [ D.2324 ])
(ashift:DI (reg:DI 71 [ D.2324 ])
(subreg:QI (reg:SI 75 [ D.2322 ]) 0)))
(clobber (reg:CC 17 flags))
]) 1.c:25 522 {*ashldi3_1}
 (expr_list:REG_DEAD (reg:DI 71 [ D.2324 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 23 22 24 2 (parallel [
(set (reg:DI 92 [ D.2324 ])
(lshiftrt:DI (reg:DI 91 [ D.2324 ])
(subreg:QI (reg:SI 75 [ D.2322 ]) 0)))
(clobber (reg:CC 17 flags))
]) 1.c:25 556 {*lshrdi3_1}
 (expr_list:REG_DEAD (reg:DI 91 [ D.2324 ])
(expr_list:REG_DEAD (reg:SI 75 [ D.2322 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)

After propagation, we get
(insn 22 21 23 2 (parallel [
(set (reg:DI 91 [ D.2324 ])
(ashift:DI (reg:DI 71 [ D.2324 ])
(subreg:QI (and:SI (reg:SI 88 [ D.2325 ])
(const_int 63 [0x3f])) 0)))
(clobber (reg:CC 17 flags))
]) 1.c:25 518 {*ashldi3_mask}
 (expr_list:REG_DEAD (reg:DI 71 [ D.2324 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 23 22 24 2 (parallel [
(set (reg:DI 92 [ D.2324 ])
(lshiftrt:DI (reg:DI 91 [ D.2324 ])
(subreg:QI (and:SI (reg:SI 88 [ D.2325 ])
(const_int 63 [0x3f])) 0)))
(clobber (reg:CC 17 flags))
]) 1.c:25 539 {*lshrdi3_mask}
 (expr_list:REG_DEAD (reg:DI 91 [ D.2324 ])
(expr_list:REG_DEAD (reg:SI 75 [ D.2322 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)

But it is not a good transformation unless we know insn split will
change a  (b  63) to a  b; Here we want to see what the rtl looks
like after insn splitting in fwprop cost estimation (We call
split_insns in estimate_split_and_peephole(), but not to do insn
splitting actually in this phase). After checking the result of
split_insns, we decide it is beneficial to do the propagation here
because insn splitting will optimize the fwprop results.

After insn splitting.
(insn 39 21 40 2 (parallel [
(set (reg:DI 92 [ D.2324 ])
(ashift:DI (reg:DI 71 [ D.2324 ])
(subreg:QI (reg:SI 88 [ D.2325 ]) 0)))
(clobber (reg:CC 17 flags))
]) 1.c:25 -1
 (nil))
(insn 40 39 24 2 (parallel [
(set (reg:DI 92 [ D.2324 ])
(lshiftrt:DI (reg:DI 92 [ D.2324 ])
(subreg:QI (reg:SI 88 [ D.2325 ]) 0)))
(clobber (reg:CC 17 flags))
]) 1.c:25 -1
 (nil))

Similarly I think if we can know what the insn looks like after
peephole, it will be better for fwprop. But I don't have a testcase to
prove it is better to consider peephole. Maybe I should do some
testing how much benefit we will get if consider peephole impact in
fwprop.

 -  break;
 +  /* Compare registers by number.  */
 +case REG:
 +  return REG_P (reg)  REGNO (in) == REGNO (reg);

 This will not work for hard registers.

 FWIW, en passant you've made fwprop quadratic in the number of insns
 in a basic block, in initialize_before_estimate_peephole potentially
 calling simulate_backwards_to_point repeatedly on every insn in a
 basic