Re: extend fwprop optimization
On Thu, Mar 28, 2013 at 04:49:47PM +0100, Uros Bizjak wrote: 2013-03-27 Wei Mi w...@google.com * config/i386/i386.md: Do shift truncation in define_insn instead of define_insn_and_split. Please write ChangeLog as: * config/i386/i386.md (*ashlmode3_mask): Rewrite as define_insn. Truncate operand 2 using %b asm operand modifier. (*shift_insnmode3_mask): Ditto. (*rotate_insnmode3_mask): Ditto. OK for mainline and all release branches with these changes. This broke bootstrap on x86_64-linux as well as i686-linux on the 4.6 branch. Fixed thusly, committed as obvious after bootstrapping/regtesting on those targets. 2013-04-03 Jakub Jelinek ja...@redhat.com * config/i386/i386.md (*shiftrt_insnmode3_mask): Use shiftrt instead of shift. --- gcc/config/i386/i386.md.jj 2013-04-03 16:11:07.0 +0200 +++ gcc/config/i386/i386.md 2013-04-03 17:42:15.034672014 +0200 @@ -9827,7 +9827,7 @@ (define_insn *shiftrt_insnmode3_mas (INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1 { - return shift{imodesuffix}\t{%b2, %0|%0, %b2}; + return shiftrt{imodesuffix}\t{%b2, %0|%0, %b2}; } [(set_attr type ishift) (set_attr mode MODE)]) Jakub
Re: extend fwprop optimization
Thanks for helping fixing it. I will take care to verify regression and bootstrap before checkin to release branches next time. Regards, Wei. On Wed, Apr 3, 2013 at 11:08 AM, Jakub Jelinek ja...@redhat.com wrote: On Thu, Mar 28, 2013 at 04:49:47PM +0100, Uros Bizjak wrote: 2013-03-27 Wei Mi w...@google.com * config/i386/i386.md: Do shift truncation in define_insn instead of define_insn_and_split. Please write ChangeLog as: * config/i386/i386.md (*ashlmode3_mask): Rewrite as define_insn. Truncate operand 2 using %b asm operand modifier. (*shift_insnmode3_mask): Ditto. (*rotate_insnmode3_mask): Ditto. OK for mainline and all release branches with these changes. This broke bootstrap on x86_64-linux as well as i686-linux on the 4.6 branch. Fixed thusly, committed as obvious after bootstrapping/regtesting on those targets. 2013-04-03 Jakub Jelinek ja...@redhat.com * config/i386/i386.md (*shiftrt_insnmode3_mask): Use shiftrt instead of shift. --- gcc/config/i386/i386.md.jj 2013-04-03 16:11:07.0 +0200 +++ gcc/config/i386/i386.md 2013-04-03 17:42:15.034672014 +0200 @@ -9827,7 +9827,7 @@ (define_insn *shiftrt_insnmode3_mas (INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1 { - return shift{imodesuffix}\t{%b2, %0|%0, %b2}; + return shiftrt{imodesuffix}\t{%b2, %0|%0, %b2}; } [(set_attr type ishift) (set_attr mode MODE)]) Jakub
Re: extend fwprop optimization
On Tue, Apr 2, 2013 at 7:43 AM, Wei Mi w...@google.com wrote: I attached the patch.4 based on r197308. r197308 changes shift-and type truncation from define_insn_and_split to define_insn. patch.4 changes ix86_rtx_costs for shift-and type rtx to get the correct cost for the result after the shift-and truncation. With the patch.1 ~ patch.4, fwprop extension could handle the motivational case 1.c attached by removing all the redundent x 63 operations. patch.1~patch.4 regression and bootstrap ok on x86_64-unknown-linux-gnu. Is it ok for trunk? 2013-04-01 Wei Mi w...@google.com * config/i386/i386.c (ix86_rtx_costs): Set proper rtx cost for ashlmode3_mask, *shift_insnmode3_mask and *rotate_insnmode3_mask in i386.md. Patch 4 is OK for mainline and also for release branches that were changed by your previous i386.md patch. Thanks, Uros.
Re: extend fwprop optimization
1.c attached. On Mon, Apr 1, 2013 at 10:43 PM, Wei Mi w...@google.com wrote: I attached the patch.4 based on r197308. r197308 changes shift-and type truncation from define_insn_and_split to define_insn. patch.4 changes ix86_rtx_costs for shift-and type rtx to get the correct cost for the result after the shift-and truncation. With the patch.1 ~ patch.4, fwprop extension could handle the motivational case 1.c attached by removing all the redundent x 63 operations. patch.1~patch.4 regression and bootstrap ok on x86_64-unknown-linux-gnu. Is it ok for trunk? Thanks, Wei. On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote: Hi, On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Ok, then I will remove the split_insns call. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than I have a motivational case then we can look into what is a proper solution. :-). Sorry, I didn't say it clearly. The motivational case is the one mentioned in the following posts (split_insns changes a (b 63) to a b). http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html If I remove the split_insns call and related cost estimation adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop here looks like a reverse process of cse, the total cost after fwprop change is increased. Def insn 18: Use insn 23 Use insn 22 If we include the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = 0, verified - ok // The cost of insn 22 will not change after fwprop + insn splitting. change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 Total benefit is 5, fwprop will go on. If we remove the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = -4, verified - ok // The costs of insn 22 and insn 23 will increase after fwprop. change[1]: benefit = -4, verified - ok // The insn 23 is the same with insn 22 Total benefit is -3, fwprop will punt. How about adding the (a (b63) == a b) transformation in simplify_binary_operation_1, becuase (a (b63) == a b) is a kind of architecture specific expr simplification? Then fwprop could do the propagation as I expect. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... So how about keep split_insns and remove peephole in the cost estimation func? I'd strongly oppose this. I do not believe this is necessary, and I think it's conceptually wrong. What happens if you propagate into an insn that uses the same register twice? Will the DU chains still be valid (I don't think that's guaranteed)? I think the DU chains still be valid. If propagate into the insn that uses the same register twice, the two uses will be replaced when the first use is seen (propagate_rtx_1 will propagate all the occurrances of the same reg in the use insn). When the second use is seen, the df_use and use insn in its insn_info are still available. forward_propagate_into will early return after check reg_mentioned_p (DF_REF_REG (use), parent) and find out no reg is used any more. With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is still valid. I think DF_REF_LOC of USE may be invalid if dangling rtx will be recycled by
Re: extend fwprop optimization
On Thu, Mar 28, 2013 at 5:34 AM, Wei Mi w...@google.com wrote: I am not familiar how to use define_subst, so I write a patch that changes define_insn_and_split to define_insn. bootstrapped and regression tested on x86_64-unknown-linux-gnu. A question is: after that change, Is there anyway I can make targetm.rtx_costs() aware about the truncation, .i.e the cost is only a shift instead of shift + and. Please also change all operand 2 predicates to register_operand. 2013-03-27 Wei Mi w...@google.com * config/i386/i386.md: Do shift truncation in define_insn instead of define_insn_and_split. Please write ChangeLog as: * config/i386/i386.md (*ashlmode3_mask): Rewrite as define_insn. Truncate operand 2 using %b asm operand modifier. (*shift_insnmode3_mask): Ditto. (*rotate_insnmode3_mask): Ditto. OK for mainline and all release branches with these changes. Thanks, Uros.
Re: extend fwprop optimization
I am not familiar how to use define_subst, so I write a patch that changes define_insn_and_split to define_insn. bootstrapped and regression tested on x86_64-unknown-linux-gnu. A question is: after that change, Is there anyway I can make targetm.rtx_costs() aware about the truncation, .i.e the cost is only a shift instead of shift + and. Thanks, Wei. On Tue, Mar 26, 2013 at 11:23 AM, Uros Bizjak ubiz...@gmail.com wrote: On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener richard.guent...@gmail.com wrote: I am trying to figure out a way not to lose the opportunity when shift truncation is not combined in a bit test pattern. Can we keep the explicit truncation in RTL, but generate truncation code in assembly? Then only shift truncation which not combined in a bit test pattershift truncationn will happen. (define_insn *shift_insn_andmode [(set (match_operand:SWI48 0 nonimmediate_operand =rm) (any_shiftrt:SWI48 (match_operand:SWI48 1 nonimmediate_operand 0) (subreg:QI (and:SI (match_operand:SI 2 nonimmediate_operand c) (match_operand:SI 3 const_int_operand n)) 0))) (clobber (reg:CC FLAGS_REG))] ix86_binary_operator_ok (CODE, MODEmode, operands) { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; else shift\t{%2, %0|%0, %2}; } Sorry, rectify a mistake: { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return shift\t{%2, %0|%0, %2}; else return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; } I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction differences. So there is no inconsistency. The i386 backend seems to try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist (well, it's false, so it technically doesn't exist for i386) and recognizes the shift with truncate with the *shift_insnmode3_mask splitter. But I'm not sure why it bothers to do it with a splitter instead of just with a define_insn? Because the split code, [(parallel [(set (match_dup 0) (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] is wrong and could be combined into a bit-test instruction. No? That is, why not have define_insn variants for shift instructions with explicit truncation? You are right, the split is harmful in this case. It looks to me, that explicit truncation can be added to split patterns in the most elegant way using proposed define_subst infrastructure. Uros. changelog Description: Binary data patch Description: Binary data
Re: extend fwprop optimization
On Mon, Mar 25, 2013 at 6:33 PM, Wei Mi w...@google.com wrote: I am trying to figure out a way not to lose the opportunity when shift truncation is not combined in a bit test pattern. Can we keep the explicit truncation in RTL, but generate truncation code in assembly? Then only shift truncation which not combined in a bit test pattershift truncationn will happen. (define_insn *shift_insn_andmode [(set (match_operand:SWI48 0 nonimmediate_operand =rm) (any_shiftrt:SWI48 (match_operand:SWI48 1 nonimmediate_operand 0) (subreg:QI (and:SI (match_operand:SI 2 nonimmediate_operand c) (match_operand:SI 3 const_int_operand n)) 0))) (clobber (reg:CC FLAGS_REG))] ix86_binary_operator_ok (CODE, MODEmode, operands) { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; else shift\t{%2, %0|%0, %2}; } Sorry, rectify a mistake: { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return shift\t{%2, %0|%0, %2}; else return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; } I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction differences. So there is no inconsistency. The i386 backend seems to try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist (well, it's false, so it technically doesn't exist for i386) and recognizes the shift with truncate with the *shift_insnmode3_mask splitter. But I'm not sure why it bothers to do it with a splitter instead of just with a define_insn? Because the split code, [(parallel [(set (match_dup 0) (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] is wrong and could be combined into a bit-test instruction. No? That is, why not have define_insn variants for shift instructions with explicit truncation? Richard. Thanks, Wei.
Re: extend fwprop optimization
On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener richard.guent...@gmail.com wrote: I am trying to figure out a way not to lose the opportunity when shift truncation is not combined in a bit test pattern. Can we keep the explicit truncation in RTL, but generate truncation code in assembly? Then only shift truncation which not combined in a bit test pattershift truncationn will happen. (define_insn *shift_insn_andmode [(set (match_operand:SWI48 0 nonimmediate_operand =rm) (any_shiftrt:SWI48 (match_operand:SWI48 1 nonimmediate_operand 0) (subreg:QI (and:SI (match_operand:SI 2 nonimmediate_operand c) (match_operand:SI 3 const_int_operand n)) 0))) (clobber (reg:CC FLAGS_REG))] ix86_binary_operator_ok (CODE, MODEmode, operands) { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; else shift\t{%2, %0|%0, %2}; } Sorry, rectify a mistake: { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return shift\t{%2, %0|%0, %2}; else return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; } I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction differences. So there is no inconsistency. The i386 backend seems to try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist (well, it's false, so it technically doesn't exist for i386) and recognizes the shift with truncate with the *shift_insnmode3_mask splitter. But I'm not sure why it bothers to do it with a splitter instead of just with a define_insn? Because the split code, [(parallel [(set (match_dup 0) (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] is wrong and could be combined into a bit-test instruction. No? That is, why not have define_insn variants for shift instructions with explicit truncation? You are right, the split is harmful in this case. It looks to me, that explicit truncation can be added to split patterns in the most elegant way using proposed define_subst infrastructure. Uros.
Re: extend fwprop optimization
On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi w...@google.com wrote: This is the patch to add the shift truncation in simplify_binary_operation_1. I add a new hook TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide whether we can do shift truncation. I didn't use TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it uses the macro SHIFT_COUNT_TRUNCATED. If I change SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in TARGET_SHIFT_TRUNCATION_MASK, I need to give TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't trivial to get at many places in existing code. patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. Doing this might prove dangerous in case some pass may later decide to use an instruction that behaves in different ways. Consider tem = 1 (n 255); // count truncated x = y tem; // bittest instruction bit nr _not_ truncated so if tem is expanded to use a shift instruction which truncates the shift count the explicit and is dropped. If later combine comes around and combines the bit-test to use the bittest instruction which does not implicitely truncate the cound you have generated wrong-code. So we need to make sure any explicit truncation originally in place is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should not exist at all, but instead there would be two patterns for shifts with implicit truncation - one involving the truncation (canonicalized to bitwise and) and one not involving the truncation. Richard. Thanks, Wei. On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote: Hi, On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Ok, then I will remove the split_insns call. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than I have a motivational case then we can look into what is a proper solution. :-). Sorry, I didn't say it clearly. The motivational case is the one mentioned in the following posts (split_insns changes a (b 63) to a b). http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html If I remove the split_insns call and related cost estimation adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop here looks like a reverse process of cse, the total cost after fwprop change is increased. Def insn 18: Use insn 23 Use insn 22 If we include the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = 0, verified - ok // The cost of insn 22 will not change after fwprop + insn splitting. change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 Total benefit is 5, fwprop will go on. If we remove the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = -4, verified - ok // The costs of insn 22 and insn 23 will increase after fwprop. change[1]: benefit = -4, verified - ok // The insn 23 is the same with insn 22 Total benefit is -3, fwprop will punt. How about adding the (a (b63) == a b) transformation in simplify_binary_operation_1, becuase (a (b63) == a b) is a kind of architecture specific expr simplification? Then fwprop could do the propagation as I expect. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... So how
Re: extend fwprop optimization
On Mon, Mar 25, 2013 at 2:35 AM, Richard Biener richard.guent...@gmail.com wrote: On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi w...@google.com wrote: This is the patch to add the shift truncation in simplify_binary_operation_1. I add a new hook TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide whether we can do shift truncation. I didn't use TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it uses the macro SHIFT_COUNT_TRUNCATED. If I change SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in TARGET_SHIFT_TRUNCATION_MASK, I need to give TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't trivial to get at many places in existing code. patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. Doing this might prove dangerous in case some pass may later decide to use an instruction that behaves in different ways. Consider tem = 1 (n 255); // count truncated x = y tem; // bittest instruction bit nr _not_ truncated so if tem is expanded to use a shift instruction which truncates the shift count the explicit and is dropped. If later combine comes around and combines the bit-test to use the bittest instruction which does not implicitely truncate the cound you have generated wrong-code. So it means the existing truncation pattern defined in insn split is also incorrect because the truncated shift may be combined into a bit test pattern? // The following define_insn_and_split will do shift truncation. (define_insn_and_split *shift_insnmode3_mask [(set (match_operand:SWI48 0 nonimmediate_operand =rm) (any_shiftrt:SWI48 (match_operand:SWI48 1 nonimmediate_operand 0) (subreg:QI (and:SI (match_operand:SI 2 nonimmediate_operand c) (match_operand:SI 3 const_int_operand n)) 0))) (clobber (reg:CC FLAGS_REG))] ix86_binary_operator_ok (CODE, MODEmode, operands) (INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1 # 1 [(parallel [(set (match_dup 0) (any_shiftrt:SWI48 (match_dup 1) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] { if (can_create_pseudo_p ()) operands [2] = force_reg (SImode, operands[2]); operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0); } [(set_attr type ishift) (set_attr mode MODE)]) So we need to make sure any explicit truncation originally in place is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should not exist at all, but instead there would be two patterns for shifts with implicit truncation - one involving the truncation (canonicalized to bitwise and) and one not involving the truncation. Richard. I am trying to figure out a way not to lose the opportunity when shift truncation is not combined in a bit test pattern. Can we keep the explicit truncation in RTL, but generate truncation code in assembly? Then only shift truncation which not combined in a bit test pattershift truncationn will happen. (define_insn *shift_insn_andmode [(set (match_operand:SWI48 0 nonimmediate_operand =rm) (any_shiftrt:SWI48 (match_operand:SWI48 1 nonimmediate_operand 0) (subreg:QI (and:SI (match_operand:SI 2 nonimmediate_operand c) (match_operand:SI 3 const_int_operand n)) 0))) (clobber (reg:CC FLAGS_REG))] ix86_binary_operator_ok (CODE, MODEmode, operands) { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; else shift\t{%2, %0|%0, %2}; } Thanks, Wei.
Re: extend fwprop optimization
I am trying to figure out a way not to lose the opportunity when shift truncation is not combined in a bit test pattern. Can we keep the explicit truncation in RTL, but generate truncation code in assembly? Then only shift truncation which not combined in a bit test pattershift truncationn will happen. (define_insn *shift_insn_andmode [(set (match_operand:SWI48 0 nonimmediate_operand =rm) (any_shiftrt:SWI48 (match_operand:SWI48 1 nonimmediate_operand 0) (subreg:QI (and:SI (match_operand:SI 2 nonimmediate_operand c) (match_operand:SI 3 const_int_operand n)) 0))) (clobber (reg:CC FLAGS_REG))] ix86_binary_operator_ok (CODE, MODEmode, operands) { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; else shift\t{%2, %0|%0, %2}; } Sorry, rectify a mistake: { if ((INTVAL (operands[3]) (GET_MODE_BITSIZE (MODEmode)-1)) == GET_MODE_BITSIZE (MODEmode)-1) return shift\t{%2, %0|%0, %2}; else return and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}; } Thanks, Wei.
Re: extend fwprop optimization
Hi, On Sat, 2013-03-23 at 21:18 -0700, Wei Mi wrote: This is the patch to add the shift truncation in simplify_binary_operation_1. I add a new hook TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide whether we can do shift truncation. I didn't use TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it uses the macro SHIFT_COUNT_TRUNCATED. If I change SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in TARGET_SHIFT_TRUNCATION_MASK, I need to give TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't trivial to get at many places in existing code. During 4.8 development there was a similar issue with the TARGET_CANONICALIZE_COMPARISON hook. As a temporary solution the rtx_code has been passed as int. I think the story started here: http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00379.html The conclusion regarding rtx_code ... http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00646.html Maybe this should be addressed first, since now there are at least two cases where it's in the way. Cheers, Oleg patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. Thanks, Wei. On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote: Hi, On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Ok, then I will remove the split_insns call. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than I have a motivational case then we can look into what is a proper solution. :-). Sorry, I didn't say it clearly. The motivational case is the one mentioned in the following posts (split_insns changes a (b 63) to a b). http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html If I remove the split_insns call and related cost estimation adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop here looks like a reverse process of cse, the total cost after fwprop change is increased. Def insn 18: Use insn 23 Use insn 22 If we include the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = 0, verified - ok // The cost of insn 22 will not change after fwprop + insn splitting. change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 Total benefit is 5, fwprop will go on. If we remove the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = -4, verified - ok // The costs of insn 22 and insn 23 will increase after fwprop. change[1]: benefit = -4, verified - ok // The insn 23 is the same with insn 22 Total benefit is -3, fwprop will punt. How about adding the (a (b63) == a b) transformation in simplify_binary_operation_1, becuase (a (b63) == a b) is a kind of architecture specific expr simplification? Then fwprop could do the propagation as I expect. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... So how about keep split_insns and remove peephole in the cost estimation func? I'd strongly oppose this. I do not believe this is necessary, and I think it's conceptually wrong. What happens if you propagate into an insn that uses the same register twice? Will the DU chains still be valid (I don't think
Re: extend fwprop optimization
This is the patch to add the shift truncation in simplify_binary_operation_1. I add a new hook TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide whether we can do shift truncation. I didn't use TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it uses the macro SHIFT_COUNT_TRUNCATED. If I change SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in TARGET_SHIFT_TRUNCATION_MASK, I need to give TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't trivial to get at many places in existing code. patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu. Thanks, Wei. On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote: Hi, On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Ok, then I will remove the split_insns call. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than I have a motivational case then we can look into what is a proper solution. :-). Sorry, I didn't say it clearly. The motivational case is the one mentioned in the following posts (split_insns changes a (b 63) to a b). http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html If I remove the split_insns call and related cost estimation adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop here looks like a reverse process of cse, the total cost after fwprop change is increased. Def insn 18: Use insn 23 Use insn 22 If we include the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = 0, verified - ok // The cost of insn 22 will not change after fwprop + insn splitting. change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 Total benefit is 5, fwprop will go on. If we remove the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = -4, verified - ok // The costs of insn 22 and insn 23 will increase after fwprop. change[1]: benefit = -4, verified - ok // The insn 23 is the same with insn 22 Total benefit is -3, fwprop will punt. How about adding the (a (b63) == a b) transformation in simplify_binary_operation_1, becuase (a (b63) == a b) is a kind of architecture specific expr simplification? Then fwprop could do the propagation as I expect. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... So how about keep split_insns and remove peephole in the cost estimation func? I'd strongly oppose this. I do not believe this is necessary, and I think it's conceptually wrong. What happens if you propagate into an insn that uses the same register twice? Will the DU chains still be valid (I don't think that's guaranteed)? I think the DU chains still be valid. If propagate into the insn that uses the same register twice, the two uses will be replaced when the first use is seen (propagate_rtx_1 will propagate all the occurrances of the same reg in the use insn). When the second use is seen, the df_use and use insn in its insn_info are still available. forward_propagate_into will early return after check reg_mentioned_p (DF_REF_REG (use), parent) and find out no reg is used any more. With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is still valid. I think DF_REF_LOC of USE may be
Re: extend fwprop optimization
On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi w...@google.com wrote: Hi, On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Ok, then I will remove the split_insns call. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than I have a motivational case then we can look into what is a proper solution. :-). Sorry, I didn't say it clearly. The motivational case is the one mentioned in the following posts (split_insns changes a (b 63) to a b). http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html If I remove the split_insns call and related cost estimation adjustment, the fwprop 18--22 and 18--23 will punt, because fwprop here looks like a reverse process of cse, the total cost after fwprop change is increased. Def insn 18: Use insn 23 Use insn 22 If we include the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = 0, verified - ok // The cost of insn 22 will not change after fwprop + insn splitting. change[1]: benefit = 0, verified - ok // The insn 23 is the same with insn 22 Total benefit is 5, fwprop will go on. If we remove the split_insns cost estimation adjustment. extra benefit by removing def insn 18 = 5 change[0]: benefit = -4, verified - ok // The costs of insn 22 and insn 23 will increase after fwprop. change[1]: benefit = -4, verified - ok // The insn 23 is the same with insn 22 Total benefit is -3, fwprop will punt. How about adding the (a (b63) == a b) transformation in simplify_binary_operation_1, becuase (a (b63) == a b) is a kind of architecture specific expr simplification? Then fwprop could do the propagation as I expect. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... So how about keep split_insns and remove peephole in the cost estimation func? I'd strongly oppose this. I do not believe this is necessary, and I think it's conceptually wrong. What happens if you propagate into an insn that uses the same register twice? Will the DU chains still be valid (I don't think that's guaranteed)? I think the DU chains still be valid. If propagate into the insn that uses the same register twice, the two uses will be replaced when the first use is seen (propagate_rtx_1 will propagate all the occurrances of the same reg in the use insn). When the second use is seen, the df_use and use insn in its insn_info are still available. forward_propagate_into will early return after check reg_mentioned_p (DF_REF_REG (use), parent) and find out no reg is used any more. With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is still valid. I think DF_REF_LOC of USE may be invalid if dangling rtx will be recycled by garbage collection very soon (I don't know when GC will happen). Although DF_REF_LOC of USE maybe invalid, the early return in forward_propagate_into ensure it will not cause any correctness problem. In any case, returning to the RD problem for DU/UD chains is probably a good idea, now that RD is not such a hog anymore. In effect fwprop.c would return to what it looked like before the patch of r149010. I remove MD problem and use DU/UD instead. As a way forward on all of this, I'd suggest the following steps, each with a separate patch: Thanks for the suggestion! 1.
Re: extend fwprop optimization
On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote: For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. It may not look very intrusive, but there's a lot happening in the back ground. You're creating a lot of new RTL, and then just throw it away again. You fake the compiler into thinking you're much deeper in the pipeline than you really are. You're assuming there are no side-effects other than that some insn gets split, but there are back ends where splitters may have side-effects. Even though I've asked twice now, you still have not explained this motivational case, except to say that there is one. *What* are you trying to do, *what* is not happening without the splits, and *what* happens if you split. Only if you explain that in a lot more detail than I have a motivational case then we can look into what is a proper solution. The problem with some of the splitters is that they exist to break up RTL from 'expand' to initially keep some pattern together to allow the code transformation passes to handle the pattern as one instruction. This made sense when RTL was the only intermediate representation and splitting too early would inhibit some optimizations. But I would expect most (if not all) such cases to be less relevant because of the GIMPLE middle-end work. The only splitters you can trigger are the pre-reload splitters (all the reload_completed conditions obviously can't trigger if you're splitting from fwprop). Perhaps those splitters can/should run earlier, or be made obsolete by expanding directly to the post-splitting insns. Unfortunately, it's not possible to tell for your case, because you haven't explained it yet... So how about keep split_insns and remove peephole in the cost estimation func? I'd strongly oppose this. I do not believe this is necessary, and I think it's conceptually wrong. What happens if you propagate into an insn that uses the same register twice? Will the DU chains still be valid (I don't think that's guaranteed)? I think the DU chains still be valid. If propagate into the insn that uses the same register twice, the two uses will be replaced when the first use is seen (propagate_rtx_1 will propagate all the occurrances of the same reg in the use insn). When the second use is seen, the df_use and use insn in its insn_info are still available. forward_propagate_into will early return after check reg_mentioned_p (DF_REF_REG (use), parent) and find out no reg is used any more. With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is still valid. In any case, returning to the RD problem for DU/UD chains is probably a good idea, now that RD is not such a hog anymore. In effect fwprop.c would return to what it looked like before the patch of r149010. As a way forward on all of this, I'd suggest the following steps, each with a separate patch: 1. replace the MD problem with RD again, and build full DU/UD chains. 2. post all the recog changes separately, with minimum impact on the parts of the compiler you don't really change. (For apply_change_group you could even choose to overload it, or use a NUM argument with a default value -- not sure if default argument values are OK for GCC tho'.) 3. implement propagation into multiple USEs, but without the splitting and peepholing. 4. see about fixing the back end to either split earlier or expand to the desired patterns directly. Ciao! Steven
Re: extend fwprop optimization
Thanks for the helpful comments! I have some replies inlined. Regards, Wei. On Mon, Mar 11, 2013 at 12:52 PM, Steven Bosscher stevenb@gmail.com wrote: On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote: This is the fwprop extension patch which is put in order. Regression test and bootstrap pass. Please help to review its rationality. The following is a brief description what I have done in the patch. In order to make fwprop more effective in rtl optimization, we extend it to handle general expressions instead of the three cases listed in the head comment in fwprop.c. The major changes include a) We need to check propagation correctness for src exprs of def which contain mem references. Previous fwprop for the three cases above doesn't have the problem. b) We need a better cost model because the benefit is usually not so apparent as the three cases above. For a general fwprop problem, there are two possible sources where benefit comes from. The frist is the new use insn after propagation and simplification may have lower cost than itself before propagation, or propagation may create a new insn, that could be splitted or peephole optimized later and get a lower cost. The second is that if all the uses are replaced with the src of the def insn, the def insn could be deleted. So instead of check each def-use pair independently, we use DU chain to track all the uses for a def. For each def-use pair, we attempt the propagation, record the change candidate in changes[] array, but we wait to confirm the changes until all the pairs with the same def are iterated. The changes confirmation is done in the func confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr, the benefit of each change is ensured by propagation_rtx_1 using should_replace_address, so we just confirm all the changes without checking benefit again. Hello Wei Mi, So IIUC, in essence you are doing: main: FOR_EACH_BB: FOR_BB_INSNS, non-debug insns only: for each df_ref DEF operand on insn: iterate_def_uses iterate_def_uses: for each UD chain from DEF to USE(i): forward_propagate_into confirm changes by total benefit I still like the idea, but there are also still a few design issues to resolve. Some of the same comments as before apply: Do you really, really, reallyreally have to go so low-level as to insn splitting, peephole optimizations, and even register allocation, to get the cost right? That will almost certainly not be acceptable, and I for one would oppose such a change. It's IMHO a violation of proper engineering when your medium-to-high level code transformations have to do that. If you have strong reasons for your approach, it'd be helpful if you can explain them so that we can together look for a less intrusive solution (e.g. splitting earlier, adjusting the cost model, etc.). For the motivational case, I need insn splitting to get the cost right. insn splitting is not very intrusive. All I need is to call split_insns func. I think split_insns is just a pattern matching func just like recog(), which is called at many places. Peephole is not necessary (I add it in order to find as many oppotunities as possible, but from my trace analysis, it doesn't help much). To call peephole2_insn() is indeed intrusive, because peephole assumes reg allocation is completed, I have to insert the ugly workaround below. peephole also requires setting DF_LR_RUN_DCE flag and some initialization of peep2_insn_data array. So how about keep split_insns and remove peephole in the cost estimation func? So things like: + /* we may call peephole2_insns in fwprop phase to estimate how + peephole will affect the cost of the insn transformed by fwprop. + fwprop is done before ira phase. In that case, we simply return + a new pseudo register. */ + if (!strncmp (current_pass-name, fwprop, 6)) +return gen_reg_rtx (mode); and Index: config/i386/i386.c === --- config/i386/i386.c(revision 196270) +++ config/i386/i386.c(working copy) @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest) { rtx tmp; - /* We play register width games, which are only valid after reload. */ - gcc_assert (reload_completed); + /* We play register width games, which are only valid after reload. + An exception: fwprop call peephole to estimate the change benefit, + and peephole will call this func. That is before reload complete. + It will not bring any problem because the peephole2_insns call is + only used for cost estimation in fwprop, and its change will be + abandoned immediately after the cost estimation. */ + if (strncmp (current_pass-name, fwprop, 6)) +gcc_assert (reload_completed); are IMHO not OK. They are intrusive and inserted for peephole. Note that your patch is a bit difficult to read at some points because you
Re: extend fwprop optimization
On 03/10/2013 11:52 PM, Wei Mi wrote: Hi, This is the fwprop extension patch which is put in order. Regression test and bootstrap pass. Please help to review its rationality. The following is a brief description what I have done in the patch. In order to make fwprop more effective in rtl optimization, we extend it to handle general expressions instead of the three cases listed in the head comment in fwprop.c. The major changes include a) We need to check propagation correctness for src exprs of def which contain mem references. Previous fwprop for the three cases above doesn't have the problem. b) We need a better cost model because the benefit is usually not so apparent as the three cases above. For a general fwprop problem, there are two possible sources where benefit comes from. The frist is the new use insn after propagation and simplification may have lower cost than itself before propagation, or propagation may create a new insn, that could be splitted or peephole optimized later and get a lower cost. The second is that if all the uses are replaced with the src of the def insn, the def insn could be deleted. So instead of check each def-use pair independently, we use DU chain to track all the uses for a def. For each def-use pair, we attempt the propagation, record the change candidate in changes[] array, but we wait to confirm the changes until all the pairs with the same def are iterated. The changes confirmation is done in the func confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr, the benefit of each change is ensured by propagation_rtx_1 using should_replace_address, so we just confirm all the changes without checking benefit again. Can you please attach this to the 4.9 pending patches tracker bug. We're really focused on trying to get 4.8 out the door and this doesn't seem like suitable material for GCC 4.8. I haven't looked at the details of the patch at all yet and doubt I would prior to GCC 4.8 going out the door. Thanks, jeff
Re: extend fwprop optimization
On Mon, Mar 11, 2013 at 7:10 PM, Jeff Law l...@redhat.com wrote: On 03/10/2013 11:52 PM, Wei Mi wrote: Hi, This is the fwprop extension patch which is put in order. Regression test and bootstrap pass. Please help to review its rationality. The following is a brief description what I have done in the patch. In order to make fwprop more effective in rtl optimization, we extend it to handle general expressions instead of the three cases listed in the head comment in fwprop.c. The major changes include a) We need to check propagation correctness for src exprs of def which contain mem references. Previous fwprop for the three cases above doesn't have the problem. b) We need a better cost model because the benefit is usually not so apparent as the three cases above. For a general fwprop problem, there are two possible sources where benefit comes from. The frist is the new use insn after propagation and simplification may have lower cost than itself before propagation, or propagation may create a new insn, that could be splitted or peephole optimized later and get a lower cost. The second is that if all the uses are replaced with the src of the def insn, the def insn could be deleted. So instead of check each def-use pair independently, we use DU chain to track all the uses for a def. For each def-use pair, we attempt the propagation, record the change candidate in changes[] array, but we wait to confirm the changes until all the pairs with the same def are iterated. The changes confirmation is done in the func confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr, the benefit of each change is ensured by propagation_rtx_1 using should_replace_address, so we just confirm all the changes without checking benefit again. Can you please attach this to the 4.9 pending patches tracker bug. We're really focused on trying to get 4.8 out the door and this doesn't seem like suitable material for GCC 4.8. I haven't looked at the details of the patch at all yet and doubt I would prior to GCC 4.8 going out the door. Thanks, jeff Jeff, The world has more people than you, and with different interests. This patch was posted here for comments on the idea, and while I'm sure your feedback would be very valuable, it is no more required for discussing this patch than it is for releasing GCC 4.8. Ciao! Steven
Re: extend fwprop optimization
On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote: This is the fwprop extension patch which is put in order. Regression test and bootstrap pass. Please help to review its rationality. The following is a brief description what I have done in the patch. In order to make fwprop more effective in rtl optimization, we extend it to handle general expressions instead of the three cases listed in the head comment in fwprop.c. The major changes include a) We need to check propagation correctness for src exprs of def which contain mem references. Previous fwprop for the three cases above doesn't have the problem. b) We need a better cost model because the benefit is usually not so apparent as the three cases above. For a general fwprop problem, there are two possible sources where benefit comes from. The frist is the new use insn after propagation and simplification may have lower cost than itself before propagation, or propagation may create a new insn, that could be splitted or peephole optimized later and get a lower cost. The second is that if all the uses are replaced with the src of the def insn, the def insn could be deleted. So instead of check each def-use pair independently, we use DU chain to track all the uses for a def. For each def-use pair, we attempt the propagation, record the change candidate in changes[] array, but we wait to confirm the changes until all the pairs with the same def are iterated. The changes confirmation is done in the func confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr, the benefit of each change is ensured by propagation_rtx_1 using should_replace_address, so we just confirm all the changes without checking benefit again. Hello Wei Mi, So IIUC, in essence you are doing: main: FOR_EACH_BB: FOR_BB_INSNS, non-debug insns only: for each df_ref DEF operand on insn: iterate_def_uses iterate_def_uses: for each UD chain from DEF to USE(i): forward_propagate_into confirm changes by total benefit I still like the idea, but there are also still a few design issues to resolve. Some of the same comments as before apply: Do you really, really, reallyreally have to go so low-level as to insn splitting, peephole optimizations, and even register allocation, to get the cost right? That will almost certainly not be acceptable, and I for one would oppose such a change. It's IMHO a violation of proper engineering when your medium-to-high level code transformations have to do that. If you have strong reasons for your approach, it'd be helpful if you can explain them so that we can together look for a less intrusive solution (e.g. splitting earlier, adjusting the cost model, etc.). So things like: + /* we may call peephole2_insns in fwprop phase to estimate how + peephole will affect the cost of the insn transformed by fwprop. + fwprop is done before ira phase. In that case, we simply return + a new pseudo register. */ + if (!strncmp (current_pass-name, fwprop, 6)) +return gen_reg_rtx (mode); and Index: config/i386/i386.c === --- config/i386/i386.c(revision 196270) +++ config/i386/i386.c(working copy) @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest) { rtx tmp; - /* We play register width games, which are only valid after reload. */ - gcc_assert (reload_completed); + /* We play register width games, which are only valid after reload. + An exception: fwprop call peephole to estimate the change benefit, + and peephole will call this func. That is before reload complete. + It will not bring any problem because the peephole2_insns call is + only used for cost estimation in fwprop, and its change will be + abandoned immediately after the cost estimation. */ + if (strncmp (current_pass-name, fwprop, 6)) +gcc_assert (reload_completed); are IMHO not OK. Note that your patch is a bit difficult to read at some points because you have included a bunch of non-changes (whitespaces fixes -- necessary cleanups but not relevant for your patch), see e.g. the changed lines that contain lra_in_progress. Also the changes like: static bool -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags) +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed) which are quite distracting, making it harder to see what has *really* changed. You should probably just a helper function apply_change_group_num() and avoid all the apply_change_group use fixups. In fwprop.c: + /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost + estimation in estimate_split_and_peephole. */ + df_set_flags (DF_LR_RUN_DCE); df_md_add_problem (); df_note_add_problem (); - df_analyze (); + df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN); df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES); + df_analyze (); you add DU and UD
Re: extend fwprop optimization
On Tue, Feb 26, 2013 at 2:59 AM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Feb 26, 2013 at 2:12 AM, Wei Mi wrote: But it is not a good transformation unless we know insn split will change a (b 63) to a b; Here we want to see what the rtl looks like after insn splitting in fwprop cost estimation (We call split_insns in estimate_split_and_peephole(), but not to do insn splitting actually in this phase). So you're splitting to find out that the shift is truncated to 5 or 6 bits. That looks like what you really want is to have SHIFT_COUNT_TRUNCATED working for your target. It isn't defined for i386: /* Define if shifts truncate the shift count which implies one can omit a sign-extension or zero-extension of a shift count. On i386, shifts do truncate the count. But bit test instructions take the modulo of the bit offset operand. */ /* #define SHIFT_COUNT_TRUNCATED */ Perhaps SHIFT_COUNT_TRUNCATED should be turned into a target hook, and take an rtx_code (or a pattern) to let the target decide whether a truncation is applicable or not. This is a target thing, so perhaps Uros has some ideas about this. I'm guessing cse.c would then handle your code transformation already, or can be made to do so without a lot of extra work, e.g. teach fold_rtx about such (shift (...) (and (...))) transformations that are really truncations. Thanks for pointing out fold_rtx. I took a look at it and cse yesterday, and I agreed with you fold_rtx could be extended to handle the motivational case. But I still think fwprop extension could be meaningful generally. 1. fold_rtx doesn't handling all the propagation-simplification tasks. It only handles some typical cases. I think cse doesn't want to be very cumbersome to include all the fwprop's functionality. fwprop extension tries to generally handle the propagation-simplification problem. I think cse contains fold_rtx partially because existing fwprop and combine are not ideal. If fwprop could handle the general case, cse could simply try to find common subexpression. 2. fold_rtx does the simplification only based on the current insn, while fwprop extension tries to consider the def-uses group in a whole. When all the uses could be propagated, we have the choices: a) do all the propagations then delete the def insn, even if some propagations may not be beneficial. b) only select beneficial propagations and leave the def insn there. fwprop extension has a cost model to choose which way to go. What do you think? Thanks, Wei.
Re: extend fwprop optimization
On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote: What do you think? I think you'll not be able to teach fold_rtx to perform the transformation you want it to do without having SHIFT_COUNT_TRUNCATED set for i386. I already tried it the other day, but GCC won't do the truncation without knowing the insn is really a shift insn and shift_truncation_mask returns something useful. Ciao! Steven Index: cse.c === --- cse.c (revision 196182) +++ cse.c (working copy) @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn) switch (GET_CODE (folded_arg)) { + case SUBREG: + /* If the SUBREG_REG comes in from an AND, and this is not a + paradoxical subreg, then try to fold the SUBREG. */ + if (REG_P (SUBREG_REG (folded_arg)) +! paradoxical_subreg_p (folded_arg)) + { + rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND); + if (y != 0) + y = simplify_gen_binary(AND, GET_MODE (folded_arg), + XEXP(y, 0), XEXP(y, 1)); + if (y != 0) + folded_arg = y; + } + /* ... fall through ... */ case MEM: case REG: - case SUBREG: const_arg = equiv_constant (folded_arg); break;
Re: extend fwprop optimization
Yes, I agree with you. fold_rtx also needs to be extended because now it only handles the case similar as follows for shift insn: a = b op const1 c = a const2 for our motivational case, the second operand of the first insn is a reg instead of a const. We also need to add the truncation support for our case in simplify_binary_operation. I will send out a more official patch about fwprop extension soon. Then it may be easier to talk about its rationality. Thanks, Wei. On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher stevenb@gmail.com wrote: On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote: What do you think? I think you'll not be able to teach fold_rtx to perform the transformation you want it to do without having SHIFT_COUNT_TRUNCATED set for i386. I already tried it the other day, but GCC won't do the truncation without knowing the insn is really a shift insn and shift_truncation_mask returns something useful. Ciao! Steven Index: cse.c === --- cse.c (revision 196182) +++ cse.c (working copy) @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn) switch (GET_CODE (folded_arg)) { + case SUBREG: + /* If the SUBREG_REG comes in from an AND, and this is not a + paradoxical subreg, then try to fold the SUBREG. */ + if (REG_P (SUBREG_REG (folded_arg)) +! paradoxical_subreg_p (folded_arg)) + { + rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND); + if (y != 0) + y = simplify_gen_binary(AND, GET_MODE (folded_arg), + XEXP(y, 0), XEXP(y, 1)); + if (y != 0) + folded_arg = y; + } + /* ... fall through ... */ case MEM: case REG: - case SUBREG: const_arg = equiv_constant (folded_arg); break;
Re: extend fwprop optimization
On Tue, Feb 26, 2013 at 2:12 AM, Wei Mi wrote: But it is not a good transformation unless we know insn split will change a (b 63) to a b; Here we want to see what the rtl looks like after insn splitting in fwprop cost estimation (We call split_insns in estimate_split_and_peephole(), but not to do insn splitting actually in this phase). So you're splitting to find out that the shift is truncated to 5 or 6 bits. That looks like what you really want is to have SHIFT_COUNT_TRUNCATED working for your target. It isn't defined for i386: /* Define if shifts truncate the shift count which implies one can omit a sign-extension or zero-extension of a shift count. On i386, shifts do truncate the count. But bit test instructions take the modulo of the bit offset operand. */ /* #define SHIFT_COUNT_TRUNCATED */ Perhaps SHIFT_COUNT_TRUNCATED should be turned into a target hook, and take an rtx_code (or a pattern) to let the target decide whether a truncation is applicable or not. This is a target thing, so perhaps Uros has some ideas about this. I'm guessing cse.c would then handle your code transformation already, or can be made to do so without a lot of extra work, e.g. teach fold_rtx about such (shift (...) (and (...))) transformations that are really truncations. Ciao! Steven
Re: extend fwprop optimization
On Tue, Feb 26, 2013 at 12:32 AM, Wei Mi wrote: We also take insn splitting and peephole into consideration, .i.e, the cost of the change is the cost after insn splitting and peephole which may be applied to the insn changed. This is useful for the motivational case, It also goes against everything ever done before in RTL land. If you need early splitting like this, then these insns should probably be splitted earlier. And trying to apply peephole2's at this stage is, ehm, very creative... I'm surprised it works at all, peephole2 is supposed to work on strict RTL (i.e. post-reload, all constraints matched, hard regiters only, etc.). NB, you can't use peephole2 unconditionally, not all targets have them. See HAVE_peephole2. Can you explain step-by-step what is going on, that you need this? - break; + /* Compare registers by number. */ +case REG: + return REG_P (reg) REGNO (in) == REGNO (reg); This will not work for hard registers. FWIW, en passant you've made fwprop quadratic in the number of insns in a basic block, in initialize_before_estimate_peephole potentially calling simulate_backwards_to_point repeatedly on every insn in a basic block. Also: no ChangeLog, not following code style conventions, no comments, entire blocks of recently added code disappearing (e.g. the Do not replace an existing REG_EQUAL note stuff)... Don't mean to be too harsh, but in this form I'm not going to look at it. Ciao! Steven
Re: extend fwprop optimization
Hi, On Mon, Feb 25, 2013 at 4:08 PM, Steven Bosscher stevenb@gmail.com wrote: On Tue, Feb 26, 2013 at 12:32 AM, Wei Mi wrote: We also take insn splitting and peephole into consideration, .i.e, the cost of the change is the cost after insn splitting and peephole which may be applied to the insn changed. This is useful for the motivational case, It also goes against everything ever done before in RTL land. If you need early splitting like this, then these insns should probably be splitted earlier. And trying to apply peephole2's at this stage is, ehm, very creative... I'm surprised it works at all, peephole2 is supposed to work on strict RTL (i.e. post-reload, all constraints matched, hard regiters only, etc.). NB, you can't use peephole2 unconditionally, not all targets have them. See HAVE_peephole2. Can you explain step-by-step what is going on, that you need this? I am not trying to actually do the transformation of split and peephole. Just want to know how does the insn look like after the split and peephole, then we can decide whether to do fwprop based on more precise cost caculation. For the motivational case IR before fwprop: (insn 18 17 19 2 (parallel [ (set (reg:SI 75 [ D.2322 ]) (and:SI (reg:SI 88 [ D.2325 ]) (const_int 63 [0x3f]))) (clobber (reg:CC 17 flags)) ]) 1.c:25 402 {*andsi_1} (expr_list:REG_DEAD (reg:SI 88 [ D.2325 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 22 21 23 2 (parallel [ (set (reg:DI 91 [ D.2324 ]) (ashift:DI (reg:DI 71 [ D.2324 ]) (subreg:QI (reg:SI 75 [ D.2322 ]) 0))) (clobber (reg:CC 17 flags)) ]) 1.c:25 522 {*ashldi3_1} (expr_list:REG_DEAD (reg:DI 71 [ D.2324 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 23 22 24 2 (parallel [ (set (reg:DI 92 [ D.2324 ]) (lshiftrt:DI (reg:DI 91 [ D.2324 ]) (subreg:QI (reg:SI 75 [ D.2322 ]) 0))) (clobber (reg:CC 17 flags)) ]) 1.c:25 556 {*lshrdi3_1} (expr_list:REG_DEAD (reg:DI 91 [ D.2324 ]) (expr_list:REG_DEAD (reg:SI 75 [ D.2322 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil) After propagation, we get (insn 22 21 23 2 (parallel [ (set (reg:DI 91 [ D.2324 ]) (ashift:DI (reg:DI 71 [ D.2324 ]) (subreg:QI (and:SI (reg:SI 88 [ D.2325 ]) (const_int 63 [0x3f])) 0))) (clobber (reg:CC 17 flags)) ]) 1.c:25 518 {*ashldi3_mask} (expr_list:REG_DEAD (reg:DI 71 [ D.2324 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 23 22 24 2 (parallel [ (set (reg:DI 92 [ D.2324 ]) (lshiftrt:DI (reg:DI 91 [ D.2324 ]) (subreg:QI (and:SI (reg:SI 88 [ D.2325 ]) (const_int 63 [0x3f])) 0))) (clobber (reg:CC 17 flags)) ]) 1.c:25 539 {*lshrdi3_mask} (expr_list:REG_DEAD (reg:DI 91 [ D.2324 ]) (expr_list:REG_DEAD (reg:SI 75 [ D.2322 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil) But it is not a good transformation unless we know insn split will change a (b 63) to a b; Here we want to see what the rtl looks like after insn splitting in fwprop cost estimation (We call split_insns in estimate_split_and_peephole(), but not to do insn splitting actually in this phase). After checking the result of split_insns, we decide it is beneficial to do the propagation here because insn splitting will optimize the fwprop results. After insn splitting. (insn 39 21 40 2 (parallel [ (set (reg:DI 92 [ D.2324 ]) (ashift:DI (reg:DI 71 [ D.2324 ]) (subreg:QI (reg:SI 88 [ D.2325 ]) 0))) (clobber (reg:CC 17 flags)) ]) 1.c:25 -1 (nil)) (insn 40 39 24 2 (parallel [ (set (reg:DI 92 [ D.2324 ]) (lshiftrt:DI (reg:DI 92 [ D.2324 ]) (subreg:QI (reg:SI 88 [ D.2325 ]) 0))) (clobber (reg:CC 17 flags)) ]) 1.c:25 -1 (nil)) Similarly I think if we can know what the insn looks like after peephole, it will be better for fwprop. But I don't have a testcase to prove it is better to consider peephole. Maybe I should do some testing how much benefit we will get if consider peephole impact in fwprop. - break; + /* Compare registers by number. */ +case REG: + return REG_P (reg) REGNO (in) == REGNO (reg); This will not work for hard registers. FWIW, en passant you've made fwprop quadratic in the number of insns in a basic block, in initialize_before_estimate_peephole potentially calling simulate_backwards_to_point repeatedly on every insn in a basic