[PATCH] fwprop: Allow UNARY_P and check register pressure.
Hi, originally inspired by the wish to transform vmv v3, a0 ; = vec_duplicate vadd.vv v1, v2, v3 into vadd.vx v1, v2, a0 via fwprop for riscv, this patch enables the forward propagation of UNARY_P sources. As this involves potentially replacing a vector register with a scalar register the ira_hoist_pressure machinery is used to calculate the change in register pressure. If the propagation would increase the pressure beyond the number of hard regs, we don't perform it. The regpressure commit this patch depends on is not yet pushed because I figured I'd post the code using it in case of further comments. The testsuite is unchanged on i386 and power10 but aarch64 has some new FAILs but I'm not terribly familiar with aarch64, hence some examples here: The following cases shrn-combine-1/2/3 seem worse as we emit one instruction more. Before: shrnv30.8b, v30.8h, 2 shrn2 v30.16b, v31.8h, 2 str q30, [x1, x3] After: ushrv30.8h, v30.8h, 2 ushrv31.8h, v31.8h, 2 uzp1v31.16b, v30.16b, v31.16b str q31, [x1, x3] Here, one optimization already happens in fwprop so combine cannot do the same work it did before. vec-init-22-speed.c changes from sxthw0, w0 sxthw1, w1 fmovd31, x0 fmovd0, x1 to: dup v31.4h, w0 dup v0.4h, w1 which I hope has the same semantics and is shorter. Apart from that there are numerous check-asm testcases where a new, earlier propagation prevents a later, supposedly better propagation. One such example is from ld4_s8.c: (insn 11 10 12 2 (set (reg:DI 100) (neg:DI (reg:DI 102))) 385 {negdi2} (expr_list:REG_EQUAL (const_poly_int:DI [-576, -576]) (nil))) (insn 12 11 13 2 (set (reg/f:DI 99) (plus:DI (reg/v/f:DI 97 [ x0 ]) (reg:DI 100))) 153 {*adddi3_aarch64} (expr_list:REG_EQUAL (plus:DI (reg/v/f:DI 97 [ x0 ]) (const_poly_int:DI [-576, -576])) (nil))) (insn 13 12 14 2 (set (reg/v:VNx64QI 94 [ z0 ]) (unspec:VNx64QI [ (reg/v:VNx16BI 96 [ p0 ]) (mem:VNx64QI (reg/f:DI 99) [0 MEM [(signed char *)_1]+0 S[64, 64] A8]) ] UNSPEC_LDN)) 5885 {vec_mask_load_lanesvnx64qivnx16qi} (nil)) where we now do: propagating insn 11 into insn 12, replacing: (set (reg/f:DI 99) (plus:DI (reg/v/f:DI 97 [ x0 ]) (reg:DI 100))) successfully matched this instruction to subdi3: (set (reg/f:DI 99) (minus:DI (reg/v/f:DI 97 [ x0 ]) (reg:DI 102))) vs before: cannot propagate from insn 11 into insn 12: would increase complexity of pattern propagating insn 12 into insn 13, replacing: (set (reg/v:VNx64QI 94 [ z0 ]) (unspec:VNx64QI [ (reg/v:VNx16BI 96 [ p0 ]) (mem:VNx64QI (reg/f:DI 99) [0 MEM [(signed char *)_1]+0 S[64, 64] A8]) ] UNSPEC_LDN)) successfully matched this instruction to vec_mask_load_lanesvnx64qivnx16qi: (set (reg/v:VNx64QI 94 [ z0 ]) (unspec:VNx64QI [ (reg/v:VNx16BI 96 [ p0 ]) (mem:VNx64QI (plus:DI (reg/v/f:DI 97 [ x0 ]) (reg:DI 100)) [0 MEM [(signed char *)_1]+0 S[64, 64] A8]) ] UNSPEC_LDN)) All in all this seems like a general problem with earlier optimizations preventing later ones and surely all of those could be fixed individually. Still, the question remains if the general approach is useful or desired or if we not rather prevent more optimizations that we gain. Suggestions welcome. I have some riscv tests for it but didn't attach them yet in order to focus on the main part first. Regards Robin gcc/ChangeLog: * fwprop.cc (fwprop_propagation::profitable_p): Add unary handling. (fwprop_propagation::update_register_pressure): New function. (fwprop_propagation::register_pressure_high_p): New function (reg_single_def_for_src_p): Look through unary expressions. (try_fwprop_subst_pattern): Check register pressure. (forward_propagate_into): Call new function. (fwprop_init): Init register pressure. (fwprop_done): Clean up register pressure. (fwprop_insn): Add comment. --- gcc/fwprop.cc | 307 -- 1 file changed, 300 insertions(+), 7 deletions(-) diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 0707a234726..413fe4e7335 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3. If not see #include "tree-pass.h" #include "rtl-iter.h" #include "target.h" +#include "dominance.h" + +#include "ira.h" +#include "regpressure.h" /* This p
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Hi Richard, I did some testing with the attached v2 that does not restrict to UNARY anymore. As feared ;) there is some more fallout that I'm detailing below. On Power there is one guality fail (pr43051-1.c) that I would take the liberty of ignoring for now. On x86 there are four fails: - cond_op_addsubmuldiv__Float16-2.c: assembler error unsupported masking for `vmovsh'. I guess that's a latent backend problem. - ifcvt-3.c, pr49095.c: Here we propagate into a compare. Before, we had (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0). That looks like a costing problem and can hopefully solveable by making the second compare more expensive, preventing the propagation. i386 costing (or every costing?) is brittle so that could well break other things. - pr88873.c: This is interesting because even before this patch we propagated with different register classes (V2DF vs DI). With the patch we check the register pressure, find the class NO_REGS for V2DF and abort (because the patch assumes NO_REGS = high pressure). I'm thinking of keeping the old behavior for reg-reg propagations and only checking the pressure for more complex operations. aarch64 has the most fails: - One guality fail (same as Power). - shrn-combine-[123].c as before. - A class of (hopefully, I only checked some) similar cases where we propagate an unspec_whilelo into an unspec_ptest. Before we would only set a REG_EQUALS note. Before we managed to create a while_ultsivnx16bi_cc whereas now we have while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined. We create redundant whilelos and I'm not sure how to improve that. I guess a peephole is out of the question :) - pred-combine-and.c: Here the new propagation appears useful at first. We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the individual and registers remain live up to the propagation site (while being dead before the patch). With the registers dead, combine could create a single fcmgt before. Now it only manages a 2->2 combination because we still need the registers and end up with two fcmgts. The code is worse but this seems more bad luck than anything. - Addressing fails from before: I looked into these and suspect all of them are a similar. What happens is that we have a poly_int offset that we shift, negate and then add to x0. The result is used as load address. Before, we would pull (combine) the (plus x0 reg) into the load keeping the neg and shift. Now we propagate everything into a single (set (minus x0 offset)). The propagation itself seems worthwhile because we save one insn. However as we got rid of the base/offset split by lumping everything together, combine cannot pull the (plus) into the address load and we require an aarch64_split_add_offset. This will emit the longer sequence of ashiftl and subtract. The "base" address is x0 here so we cannot convert (minus x0 ...)) into neg. I didn't go through all of aarch64_split_add_offset. I suppose we could re-add the separation of base/offset there but that might be a loss when the result is not used as an address. Again, all in all no fatal problems but pretty annoying :) It's not much but just gradually worse than with just UNARY. Any idea on how/whether to continue? Regards Robin gcc/ChangeLog: * fwprop.cc (fwprop_propagation::profitable_p): Add unary handling. (fwprop_propagation::update_register_pressure): New function. (fwprop_propagation::register_pressure_high_p): New function (reg_single_def_for_src_p): Look through unary expressions. (try_fwprop_subst_pattern): Check register pressure. (forward_propagate_into): Call new function. (fwprop_init): Init register pressure. (fwprop_done): Clean up register pressure. (fwprop_insn): Add comment. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test. --- gcc/fwprop.cc | 359 +- .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c | 64 2 files changed, 419 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 0707a234726..ce6f5a74b00 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3. If not see #include "tree-pass.h" #include "rtl-iter.h" #include "target.h" +#include "dominance.h" + +#include "ira.h" +#include "regpressure.h" /* This pass does simple forward propagation and simplification when an operand of an insn can only come from a single def. This pass uses @@ -103,6 +107,10 @@ using namespace rtl_ssa; static int num_changes; +/* Keep track of which registers already increased the pressure to avoid double + booking. *
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Robin Dapp writes: > Hi Richard, > > I did some testing with the attached v2 that does not restrict to UNARY > anymore. As feared ;) there is some more fallout that I'm detailing below. > > On Power there is one guality fail (pr43051-1.c) that I would take > the liberty of ignoring for now. > > On x86 there are four fails: > > - cond_op_addsubmuldiv__Float16-2.c: assembler error >unsupported masking for `vmovsh'. I guess that's a latent backend >problem. > > - ifcvt-3.c, pr49095.c: Here we propagate into a compare. Before, we had >(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0). >That looks like a costing problem and can hopefully solveable by making >the second compare more expensive, preventing the propagation. >i386 costing (or every costing?) is brittle so that could well break other >things. > > - pr88873.c: This is interesting because even before this patch we >propagated with different register classes (V2DF vs DI). With the patch >we check the register pressure, find the class NO_REGS for V2DF and >abort (because the patch assumes NO_REGS = high pressure). I'm thinking >of keeping the old behavior for reg-reg propagations and only checking >the pressure for more complex operations. > > aarch64 has the most fails: > > - One guality fail (same as Power). > - shrn-combine-[123].c as before. > > - A class of (hopefully, I only checked some) similar cases where we >propagate an unspec_whilelo into an unspec_ptest. Before we would only >set a REG_EQUALS note. >Before we managed to create a while_ultsivnx16bi_cc whereas now we have >while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined. >We create redundant whilelos and I'm not sure how to improve that. I >guess a peephole is out of the question :) > > - pred-combine-and.c: Here the new propagation appears useful at first. >We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the >individual and registers remain live up to the propagation site (while >being dead before the patch). >With the registers dead, combine could create a single fcmgt before. >Now it only manages a 2->2 combination because we still need the registers >and end up with two fcmgts. >The code is worse but this seems more bad luck than anything. > > - Addressing fails from before: I looked into these and suspect all of >them are a similar. >What happens is that we have a poly_int offset that we shift, negate >and then add to x0. The result is used as load address. >Before, we would pull (combine) the (plus x0 reg) into the load keeping >the neg and shift. >Now we propagate everything into a single (set (minus x0 offset)). >The propagation itself seems worthwhile because we save one insn. >However as we got rid of the base/offset split by lumping everything >together, combine cannot pull the (plus) into the address load and >we require an aarch64_split_add_offset. This will emit the longer >sequence of ashiftl and subtract. The "base" address is x0 here so >we cannot convert (minus x0 ...)) into neg. >I didn't go through all of aarch64_split_add_offset. I suppose we >could re-add the separation of base/offset there but that might be >a loss when the result is not used as an address. > > Again, all in all no fatal problems but pretty annoying :) It's not much > but just gradually worse than with just UNARY. Any idea on how/whether to > continue? Thanks for giving it a go. Can you post the latest version of the regpressure patch too? The previous on-list version I could find seems to be too old. Thanks, Richard > Regards > Robin > > gcc/ChangeLog: > > * fwprop.cc (fwprop_propagation::profitable_p): Add unary > handling. > (fwprop_propagation::update_register_pressure): New function. > (fwprop_propagation::register_pressure_high_p): New function > (reg_single_def_for_src_p): Look through unary expressions. > (try_fwprop_subst_pattern): Check register pressure. > (forward_propagate_into): Call new function. > (fwprop_init): Init register pressure. > (fwprop_done): Clean up register pressure. > (fwprop_insn): Add comment. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test. > --- > gcc/fwprop.cc | 359 +- > .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c | 64 > 2 files changed, 419 insertions(+), 4 deletions(-) > create mode 100644 > gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c > > diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc > index 0707a234726..ce6f5a74b00 100644 > --- a/gcc/fwprop.cc > +++ b/gcc/fwprop.cc > @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3. If not see > #include "tree-pass.h" > #include "rtl-iter.h" > #include "target.h" > +#include "dominance.h"
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
> Thanks for giving it a go. Can you post the latest version of the > regpressure patch too? The previous on-list version I could find > seems to be too old. Oh, sure, attached. Apologies, I added the regpressure_same_class convenience helper but forgot to re-send it. Regards Robin >From d3f87e4de7d7d05a2fcf8c948097b14eadf08c90 Mon Sep 17 00:00:00 2001 From: Robin Dapp Date: Mon, 24 Jul 2023 16:25:38 +0200 Subject: [PATCH] gcse: Extract reg pressure handling into separate file. This patch extracts the hoist-pressure handling from gcse and puts it into a separate file so it can be used by other passes in the future. No functional change. gcc/ChangeLog: * Makefile.in: Add regpressure.o. * gcse.cc (struct bb_data): Move to regpressure.cc. (BB_DATA): Ditto. (get_regno_pressure_class): Ditto. (get_pressure_class_and_nregs): Ditto. (record_set_data): Ditto. (update_bb_reg_pressure): Ditto. (should_hoist_expr_to_dom): Ditto. (hoist_code): Ditto. (change_pressure): Ditto. (calculate_bb_reg_pressure): Ditto. (one_code_hoisting_pass): Ditto. * gcse.h (single_set_gcse): Export single_set_gcse. * regpressure.cc: New file. * regpressure.h: New file. --- gcc/Makefile.in| 1 + gcc/gcse.cc| 304 ++- gcc/gcse.h | 2 + gcc/regpressure.cc | 391 + gcc/regpressure.h | 48 ++ 5 files changed, 459 insertions(+), 287 deletions(-) create mode 100644 gcc/regpressure.cc create mode 100644 gcc/regpressure.h diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 5930b52462a..62768a84f81 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1610,6 +1610,7 @@ OBJS = \ reg-stack.o \ regcprop.o \ reginfo.o \ + regpressure.o \ regrename.o \ regstat.o \ reload.o \ diff --git a/gcc/gcse.cc b/gcc/gcse.cc index f689c0c2687..5bafef7970f 100644 --- a/gcc/gcse.cc +++ b/gcc/gcse.cc @@ -160,6 +160,7 @@ along with GCC; see the file COPYING3. If not see #include "gcse.h" #include "gcse-common.h" #include "function-abi.h" +#include "regpressure.h" /* We support GCSE via Partial Redundancy Elimination. PRE optimizations are a superset of those done by classic GCSE. @@ -419,30 +420,6 @@ static bool doing_code_hoisting_p = false; /* For available exprs */ static sbitmap *ae_kill; -/* Data stored for each basic block. */ -struct bb_data -{ - /* Maximal register pressure inside basic block for given register class - (defined only for the pressure classes). */ - int max_reg_pressure[N_REG_CLASSES]; - /* Recorded register pressure of basic block before trying to hoist - an expression. Will be used to restore the register pressure - if the expression should not be hoisted. */ - int old_pressure; - /* Recorded register live_in info of basic block during code hoisting - process. BACKUP is used to record live_in info before trying to - hoist an expression, and will be used to restore LIVE_IN if the - expression should not be hoisted. */ - bitmap live_in, backup; -}; - -#define BB_DATA(bb) ((struct bb_data *) (bb)->aux) - -static basic_block curr_bb; - -/* Current register pressure for each pressure class. */ -static int curr_reg_pressure[N_REG_CLASSES]; - static void compute_can_copy (void); static void *gmalloc (size_t) ATTRIBUTE_MALLOC; @@ -494,8 +471,6 @@ static bool should_hoist_expr_to_dom (basic_block, struct gcse_expr *, enum reg_class, int *, bitmap, rtx_insn *); static bool hoist_code (void); -static enum reg_class get_regno_pressure_class (int regno, int *nregs); -static enum reg_class get_pressure_class_and_nregs (rtx_insn *insn, int *nregs); static bool one_code_hoisting_pass (void); static rtx_insn *process_insert_insn (struct gcse_expr *); static bool pre_edge_insert (struct edge_list *, struct gcse_expr **); @@ -2402,7 +2377,7 @@ record_set_data (rtx dest, const_rtx set, void *data) } } -static const_rtx +const_rtx single_set_gcse (rtx_insn *insn) { struct set_data s; @@ -2804,72 +2779,6 @@ compute_code_hoist_data (void) fprintf (dump_file, "\n"); } -/* Update register pressure for BB when hoisting an expression from - instruction FROM, if live ranges of inputs are shrunk. Also - maintain live_in information if live range of register referred - in FROM is shrunk. - - Return 0 if register pressure doesn't change, otherwise return - the number by which register pressure is decreased. - - NOTE: Register pressure won't be increased in this function. */ - -static int -update_bb_reg_pressure (basic_block bb, rtx_insn *from) -{ - rtx dreg; - rtx_insn *insn; - basic_block succ_bb; - df_ref use, op_ref; - edge succ; - edge_iterator ei; - int decreased_pressure = 0; - int nregs; -
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Robin Dapp writes: > Hi Richard, > > I did some testing with the attached v2 that does not restrict to UNARY > anymore. As feared ;) there is some more fallout that I'm detailing below. > > On Power there is one guality fail (pr43051-1.c) that I would take > the liberty of ignoring for now. > > On x86 there are four fails: > > - cond_op_addsubmuldiv__Float16-2.c: assembler error >unsupported masking for `vmovsh'. I guess that's a latent backend >problem. > > - ifcvt-3.c, pr49095.c: Here we propagate into a compare. Before, we had >(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0). >That looks like a costing problem and can hopefully solveable by making >the second compare more expensive, preventing the propagation. >i386 costing (or every costing?) is brittle so that could well break other >things. > > - pr88873.c: This is interesting because even before this patch we >propagated with different register classes (V2DF vs DI). With the patch >we check the register pressure, find the class NO_REGS for V2DF and >abort (because the patch assumes NO_REGS = high pressure). I'm thinking >of keeping the old behavior for reg-reg propagations and only checking >the pressure for more complex operations. > > aarch64 has the most fails: > > - One guality fail (same as Power). > - shrn-combine-[123].c as before. > > - A class of (hopefully, I only checked some) similar cases where we >propagate an unspec_whilelo into an unspec_ptest. Before we would only >set a REG_EQUALS note. >Before we managed to create a while_ultsivnx16bi_cc whereas now we have >while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined. >We create redundant whilelos and I'm not sure how to improve that. I >guess a peephole is out of the question :) Yeah, I think this is potentially a blocker for propagating A into B when A is used elsewhere. Combine is able to combine A and B while keeping A in parallel with the result. I think either fwprop would need to try that too, or it would need to be restricted to cases where A is only used in B. I don't think that's really a UNARY_P-or-not thing. The same problem would in principle apply to plain unary operations. > - pred-combine-and.c: Here the new propagation appears useful at first. I'm not sure about that, because... >We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the >individual and registers remain live up to the propagation site (while >being dead before the patch). >With the registers dead, combine could create a single fcmgt before. >Now it only manages a 2->2 combination because we still need the registers >and end up with two fcmgts. >The code is worse but this seems more bad luck than anything. ...the fwprop1 change is: (insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ]) (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ]) (reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3} (nil)) ... (insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ]) (and:VNx4BI (reg:VNx4BI 118 [ mask__25.17 ]) (reg:VNx4BI 102 [ vec_mask_and_64 ]))) 8420 {andvnx4bi3} (nil)) to: (insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ]) (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ]) (reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3} (expr_list:REG_DEAD (reg:VNx4BI 116 [ mask__30.13 ]) (expr_list:REG_DEAD (reg:VNx4BI 98 [ loop_mask_58 ]) (nil ... (insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ]) (and:VNx4BI (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ]) (reg:VNx4BI 98 [ loop_mask_58 ])) (reg:VNx4BI 118 [ mask__25.17 ]))) 8428 {aarch64_pred_andvnx4bi_z} (nil)) On its own this isn't worse. But it's also not a win. The before and after sequences have equal cost, but the after sequence is more complex. That would probably be OK for something that runs near the end of the pre-RA pipeline, since it could in principle increase parallelism. But it's probably a bad idea when we know that the main instruction combination pass is still to run. By making insn 31 more complex, we're making it a less likely combination candidate. So this isn't necesarily the wrong thing to do. But I think it is the wrong time to do it. > - Addressing fails from before: I looked into these and suspect all of >them are a similar. >What happens is that we have a poly_int offset that we shift, negate >and then add to x0. The result is used as load address. >Before, we would pull (combine) the (plus x0 reg) into the load keeping >the neg and shift. >Now we propagate everything into a single (set (minus x0 offset)). >The propagation itself seems worthwhile because we save one insn. >However as we got rid of the base/offset split by lumping everything >together, combine cannot pull the (plus) into the address
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Thanks for looking at it in detail. > Yeah, I think this is potentially a blocker for propagating A into B > when A is used elsewhere. Combine is able to combine A and B while > keeping A in parallel with the result. I think either fwprop would > need to try that too, or it would need to be restricted to cases where A > is only used in B. That seems a rather severe limitation and my original use case would not get optimized considerably anymore. The intention was to replace all uses (if register pressure allows). Of course the example is simple enough that a propagation is always useful if the costs allow it, so it might not be representative. I'm wondering if we could (my original misunderstanding) tentatively try to propagate into all uses of a definition and, when reaching a certain ratio, decide that it might be worth it, otherwise revert. Would be very crude though, and not driven by the actual problem we're trying to avoid. > I think the summary is: > > IMO, we have to be mindful that combine is still to run. We need to > avoid making equal-cost changes if the new form is more complex, or > otherwise likely to interfere with combine. I guess we don't have a good measure for complexity or "combinability" and even lower-cost changes could result in worse options later. Would it make sense to have a strict less-than cost policy for those more complex propagations? Or do you consider the approach in its current shape "hopeless", given the complications we discussed? > Alternatively, we could delay the optimisation until after combine > and have freer rein, since we're then just mopping up opportunities > that other passes left behind. > > A while back I was experimenting with a second combine pass. That was > the original motiviation for rtl-ssa. I never got chance to finish it > off though. This doesn't sound like something that would still materialize before the end of stage 1 :) Do you see any way of restricting the current approach to make it less intrusive and still worthwhile? Limiting to vec_duplicate might be much too arbitrary but would still help for my original example. Regards Robin
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Ping. I refined the code and some comments a bit and added a test case. My question in general would still be: Is this something we want given that we potentially move some of combine's work a bit towards the front of the RTL pipeline? Regards Robin Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure. This patch enables the forwarding of UNARY_P sources. As this involves potentially replacing a vector register with a scalar register the ira_hoist_pressure machinery is used to calculate the change in register pressure. If the propagation would increase the pressure beyond the number of hard regs, we don't perform it. gcc/ChangeLog: * fwprop.cc (fwprop_propagation::profitable_p): Add unary handling. (fwprop_propagation::update_register_pressure): New function. (fwprop_propagation::register_pressure_high_p): New function (reg_single_def_for_src_p): Look through unary expressions. (try_fwprop_subst_pattern): Check register pressure. (forward_propagate_into): Call new function. (fwprop_init): Init register pressure. (fwprop_done): Clean up register pressure. (fwprop_insn): Add comment. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test. --- gcc/fwprop.cc | 314 +- .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c | 64 2 files changed, 371 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 0707a234726..b49d4e4ced4 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3. If not see #include "tree-pass.h" #include "rtl-iter.h" #include "target.h" +#include "dominance.h" + +#include "ira.h" +#include "regpressure.h" /* This pass does simple forward propagation and simplification when an operand of an insn can only come from a single def. This pass uses @@ -103,6 +107,10 @@ using namespace rtl_ssa; static int num_changes; +/* Keep track of which registers already increased the pressure to avoid double + booking. */ +sbitmap pressure_accounted; + /* Do not try to replace constant addresses or addresses of local and argument slots. These MEM expressions are made only once and inserted in many instructions, as well as being used to control symbol table @@ -181,6 +189,8 @@ namespace bool changed_mem_p () const { return result_flags & CHANGED_MEM; } bool folded_to_constants_p () const; bool profitable_p () const; +bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const; +bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const; bool check_mem (int, rtx) final override; void note_simplification (int, uint16_t, rtx, rtx) final override; @@ -332,25 +342,247 @@ fwprop_propagation::profitable_p () const && (result_flags & PROFITABLE)) return true; - if (REG_P (to)) + /* Only continue with an unary operation if we consider register + pressure. */ + rtx what = copy_rtx (to); + if (UNARY_P (what) && flag_ira_hoist_pressure) +what = XEXP (what, 0); + + if (REG_P (what)) return true; - if (GET_CODE (to) == SUBREG - && REG_P (SUBREG_REG (to)) - && !paradoxical_subreg_p (to)) + if (GET_CODE (what) == SUBREG + && REG_P (SUBREG_REG (what)) + && !paradoxical_subreg_p (what)) return true; - if (CONSTANT_P (to)) + if (CONSTANT_P (what)) return true; return false; } -/* Check that X has a single def. */ +/* Check if the register pressure in any predecessor block of USE's block + until DEF's block is equal or higher to the number of hardregs in NU's + register class. */ +bool +fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def, + rtx_insn *use) const +{ + enum reg_class nu_class, old_class; + int nu_nregs, old_nregs; + nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs); + old_class += regpressure_get_regno_pressure_class (REGNO (old), &old_nregs); + + if (nu_class == NO_REGS && old_class == NO_REGS) +return true; + + if (nu_class == old_class) +return false; + + basic_block bbfrom = BLOCK_FOR_INSN (def); + basic_block bbto = BLOCK_FOR_INSN (use); + + basic_block bb; + + sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun)); + bitmap_clear (visited); + auto_vec q; + q.safe_push (bbto); + + while (!q.is_empty ()) +{ + bb = q.pop (); + + if (bitmap_bit_p (visited, bb->index)) + continue; + + /* Nothing to do if the register to be replaced is not live +in this BB. */ + if (bb !=
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote: Ping. I refined the code and some comments a bit and added a test case. My question in general would still be: Is this something we want given that we potentially move some of combine's work a bit towards the front of the RTL pipeline? Regards Robin Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure. This patch enables the forwarding of UNARY_P sources. As this involves potentially replacing a vector register with a scalar register the ira_hoist_pressure machinery is used to calculate the change in register pressure. If the propagation would increase the pressure beyond the number of hard regs, we don't perform it. gcc/ChangeLog: * fwprop.cc (fwprop_propagation::profitable_p): Add unary handling. (fwprop_propagation::update_register_pressure): New function. (fwprop_propagation::register_pressure_high_p): New function (reg_single_def_for_src_p): Look through unary expressions. (try_fwprop_subst_pattern): Check register pressure. (forward_propagate_into): Call new function. (fwprop_init): Init register pressure. (fwprop_done): Clean up register pressure. (fwprop_insn): Add comment. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test. So I was hoping that Richard S. would chime in here as he knows this code better than anyone. This looks like a much better implementation of something I've done before :-) Basically imagine a target where a sign/zero extension can be folded into arithmetic for free. We put in various hacks to this code to encourage more propagations of extensions. I still think this is valuable. As we lower from gimple->RTL we're going to still have artifacts in the RTL that we're going to want to optimize away. fwprop has certain advantages over combine, including the fact that it runs earlier, pre-loop. It looks generally sensible to me. But give Richard S. another week to chime in. He seems to be around, but may be slammed with stuff right now. jeff
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Jeff Law writes: > On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote: >> Ping. I refined the code and some comments a bit and added a test >> case. >> >> My question in general would still be: Is this something we want >> given that we potentially move some of combine's work a bit towards >> the front of the RTL pipeline? >> >> Regards >> Robin >> >> Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure. >> >> This patch enables the forwarding of UNARY_P sources. As this >> involves potentially replacing a vector register with a scalar register >> the ira_hoist_pressure machinery is used to calculate the change in >> register pressure. If the propagation would increase the pressure >> beyond the number of hard regs, we don't perform it. >> >> gcc/ChangeLog: >> >> * fwprop.cc (fwprop_propagation::profitable_p): Add unary >> handling. >> (fwprop_propagation::update_register_pressure): New function. >> (fwprop_propagation::register_pressure_high_p): New function >> (reg_single_def_for_src_p): Look through unary expressions. >> (try_fwprop_subst_pattern): Check register pressure. >> (forward_propagate_into): Call new function. >> (fwprop_init): Init register pressure. >> (fwprop_done): Clean up register pressure. >> (fwprop_insn): Add comment. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test. > So I was hoping that Richard S. would chime in here as he knows this > code better than anyone. Heh, I'm not sure about that. I rewrote the code to use rtl-ssa, so in that sense I'm OK with the framework side. But I tried to preserve the decisions that the old pass made as closely as possible. I don't know why most of those decisions were made (which is why I just kept them). So I don't think I have a good feel for the advantages and disadvantages of doing this. Robin's analysis of the aarch64 changes was nice and detailed though. I think the one that worries me most is the addressing mode one. fwprop is probably the first chance we get to propagate adds into addresses, and virtual register elimination means that some of those opportunities won't show up in gimple. There again, virtual register elimination wouldn't be the reason for the ld4_s8.c failure. Perhaps there's something missing in expand. Other than that, I think my main question is: why just unary operations? Is the underlying assumption that we only want to propagate a maximum of one register? If so, then I think we should check for that directly, by iterating over subrtxes. That way we can handle things like binary operations involving a register and a constant, and unspecs with a single non-constant operand. I imagine the check would be something like: unsigned int nregs = 0; for (each subrtx x) { if (MEM_P (x)) return false; if (SUBREG_P (x) && .../*current conditions */...) return false; if (REG_P (x)) { nregs += 1; if (nregs > 1) return false; } } return true; Perhaps we should allow the optimisation without register-pressure information if (a) the source register and destination register are in the same pressure class and (b) all uses of the destination are being replaced. (FWIW, rtl-ssa should make it easier to try to replace all definitions at once, with an all-or-nothing choice, if we ever wanted to do that.) Thanks, Richard > > This looks like a much better implementation of something I've done > before :-) Basically imagine a target where a sign/zero extension can > be folded into arithmetic for free. We put in various hacks to this > code to encourage more propagations of extensions. > > I still think this is valuable. As we lower from gimple->RTL we're > going to still have artifacts in the RTL that we're going to want to > optimize away. fwprop has certain advantages over combine, including > the fact that it runs earlier, pre-loop. > > > It looks generally sensible to me. But give Richard S. another week to > chime in. He seems to be around, but may be slammed with stuff right now. > > jeff
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
> So I don't think I have a good feel for the advantages and disadvantages > of doing this. Robin's analysis of the aarch64 changes was nice and > detailed though. I think the one that worries me most is the addressing > mode one. fwprop is probably the first chance we get to propagate adds > into addresses, and virtual register elimination means that some of > those opportunities won't show up in gimple. > > There again, virtual register elimination wouldn't be the reason for > the ld4_s8.c failure. Perhaps there's something missing in expand. > > Other than that, I think my main question is: why just unary operations? > Is the underlying assumption that we only want to propagate a maximum of > one register? If so, then I think we should check for that directly, by > iterating over subrtxes. The main reason for stopping at unary operations was to limit the scope and change as little as possible (not restricting the change to one register). I'm currently testing a v2 that iterates over subrtxs. > Perhaps we should allow the optimisation without register-pressure > information if (a) the source register and destination register are > in the same pressure class and (b) all uses of the destination are > being replaced. (FWIW, rtl-ssa should make it easier to try to > replace all definitions at once, with an all-or-nothing choice, > if we ever wanted to do that.) I presume you're referring to replacing one register (dest) in all using insns? Source and destination are somewhat overloaded in fwprop context because I'm thinking of the "to be replaced" register as dest when it's actually the replacement register. AFAICT fwprop currently iterates over insns, going through all their uses and trying if an individual use can be substituted. Do you suggest to change this general iteration order to iterate over the defs of an insn and then try to replace all the uses at once (e.g. using ssa->change_insns)? When keeping the current order, wouldn't we need to store all potential changes instead of committing them and later apply them in bulk, e.g. grouped by use? This order would also help to pick the propagation with the most number of uses (i.e. propagation potential) but maybe I'm misunderstanding? Regards Robin
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Robin Dapp writes: >> So I don't think I have a good feel for the advantages and disadvantages >> of doing this. Robin's analysis of the aarch64 changes was nice and >> detailed though. I think the one that worries me most is the addressing >> mode one. fwprop is probably the first chance we get to propagate adds >> into addresses, and virtual register elimination means that some of >> those opportunities won't show up in gimple. >> >> There again, virtual register elimination wouldn't be the reason for >> the ld4_s8.c failure. Perhaps there's something missing in expand. >> >> Other than that, I think my main question is: why just unary operations? >> Is the underlying assumption that we only want to propagate a maximum of >> one register? If so, then I think we should check for that directly, by >> iterating over subrtxes. > > The main reason for stopping at unary operations was to limit the scope > and change as little as possible (not restricting the change to one > register). I'm currently testing a v2 that iterates over subrtxs. Thanks. Definitely no problem with doing things in small steps, but IMO it's better if each choice of step can still be justified in its own terms. >> Perhaps we should allow the optimisation without register-pressure >> information if (a) the source register and destination register are >> in the same pressure class and (b) all uses of the destination are >> being replaced. (FWIW, rtl-ssa should make it easier to try to >> replace all definitions at once, with an all-or-nothing choice, >> if we ever wanted to do that.) > > I presume you're referring to replacing one register (dest) in all using > insns? Source and destination are somewhat overloaded in fwprop context > because I'm thinking of the "to be replaced" register as dest when it's > actually the replacement register. Yeah. > AFAICT fwprop currently iterates over insns, going through all their uses > and trying if an individual use can be substituted. Do you suggest to > change this general iteration order to iterate over the defs of an insn > and then try to replace all the uses at once (e.g. using ssa->change_insns)? No, I was just noting in passing that we could try do that if we wanted to. The current code is a fairly mechanical conversion of the original DF-based code, but there's no reason why it has to continue to work the way it does now. > When keeping the current order, wouldn't we need to store all potential > changes instead of committing them and later apply them in bulk, e.g. > grouped by use? This order would also help to pick the propagation > with the most number of uses (i.e. propagation potential) but maybe > I'm misunderstanding? I imagine doing it in reverse postorder would still make sense. But my point was that, for the current fwprop limitation of substituting into exactly one use of a register, we can check whether that use is the *only* use of register. I.e. if we substitute: A: (set (reg R1) (foo (reg R2))) into: B: (set ... (reg R1) ...) if R1 and R2 are likely to be in the same register class, and if B is the only user of R2, then we don't need to calculate register pressure. The change is either neutral (if R2 died in A) or an improvement (if R2 doesn't die in A, and so R1 and R2 were previously live at the same time). Thanks, Richard
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
> I imagine doing it in reverse postorder would still make sense. > > But my point was that, for the current fwprop limitation of substituting > into exactly one use of a register, we can check whether that use is > the *only* use of register. > > I.e. if we substitute: > > A: (set (reg R1) (foo (reg R2))) > > into: > > B: (set ... (reg R1) ...) > > if R1 and R2 are likely to be in the same register class, and if B > is the only user of R2, then we don't need to calculate register > pressure. The change is either neutral (if R2 died in A) or an > improvement (if R2 doesn't die in A, and so R1 and R2 were previously > live at the same time). Ah, understood, thanks. Sure, that one I can include. Regards Robin
Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.
Robin Dapp via Gcc-patches writes: > Thanks for looking at it in detail. > >> Yeah, I think this is potentially a blocker for propagating A into B >> when A is used elsewhere. Combine is able to combine A and B while >> keeping A in parallel with the result. I think either fwprop would >> need to try that too, or it would need to be restricted to cases where A >> is only used in B. > > That seems a rather severe limitation and my original use case would > not get optimized considerably anymore. The intention was to replace > all uses (if register pressure allows). Of course the example is simple > enough that a propagation is always useful if the costs allow it, so > it might not be representative. > > I'm wondering if we could (my original misunderstanding) tentatively > try to propagate into all uses of a definition and, when reaching > a certain ratio, decide that it might be worth it, otherwise revert. > Would be very crude though, and not driven by the actual problem we're > trying to avoid. > >> I think the summary is: >> >> IMO, we have to be mindful that combine is still to run. We need to >> avoid making equal-cost changes if the new form is more complex, or >> otherwise likely to interfere with combine. > > I guess we don't have a good measure for complexity or "combinability" > and even lower-cost changes could result in worse options later. > Would it make sense to have a strict less-than cost policy for those > more complex propagations? Or do you consider the approach in its > current shape "hopeless", given the complications we discussed? > >> Alternatively, we could delay the optimisation until after combine >> and have freer rein, since we're then just mopping up opportunities >> that other passes left behind. >> >> A while back I was experimenting with a second combine pass. That was >> the original motiviation for rtl-ssa. I never got chance to finish it >> off though. > > This doesn't sound like something that would still materialize before > the end of stage 1 :) > Do you see any way of restricting the current approach to make it less > intrusive and still worthwhile? Limiting to vec_duplicate might be > much too arbitrary but would still help for my original example. FWIW, I sent an RFC for a late-combine pass that might help: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631406.html I think it'll need some tweaking for your use case, but hopefully it's "just" a case of expanding the register pressure tests. Thanks, Richard