Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-26 Thread Richard Sandiford
Robin Dapp via Gcc-patches  writes:
> Thanks for looking at it in detail.
>
>> Yeah, I think this is potentially a blocker for propagating A into B
>> when A is used elsewhere.  Combine is able to combine A and B while
>> keeping A in parallel with the result.  I think either fwprop would
>> need to try that too, or it would need to be restricted to cases where A
>> is only used in B.
>
> That seems a rather severe limitation and my original use case would
> not get optimized considerably anymore.  The intention was to replace
> all uses (if register pressure allows).  Of course the example is simple
> enough that a propagation is always useful if the costs allow it, so
> it might not be representative.
>
> I'm wondering if we could (my original misunderstanding) tentatively
> try to propagate into all uses of a definition and, when reaching
> a certain ratio, decide that it might be worth it, otherwise revert.
> Would be very crude though, and not driven by the actual problem we're
> trying to avoid. 
>
>> I think the summary is:
>> 
>> IMO, we have to be mindful that combine is still to run.  We need to
>> avoid making equal-cost changes if the new form is more complex, or
>> otherwise likely to interfere with combine.
>
> I guess we don't have a good measure for complexity or "combinability"
> and even lower-cost changes could result in worse options later.
> Would it make sense to have a strict less-than cost policy for those
> more complex propagations?  Or do you consider the approach in its
> current shape "hopeless", given the complications we discussed?
>
>> Alternatively, we could delay the optimisation until after combine
>> and have freer rein, since we're then just mopping up opportunities
>> that other passes left behind.
>> 
>> A while back I was experimenting with a second combine pass.  That was
>> the original motiviation for rtl-ssa.  I never got chance to finish it
>> off though.
>
> This doesn't sound like something that would still materialize before
> the end of stage 1 :)
> Do you see any way of restricting the current approach to make it less
> intrusive and still worthwhile?  Limiting to vec_duplicate might be
> much too arbitrary but would still help for my original example.

FWIW, I sent an RFC for a late-combine pass that might help:

  https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631406.html

I think it'll need some tweaking for your use case, but hopefully
it's "just" a case of expanding the register pressure tests.

Thanks,
Richard



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Robin Dapp via Gcc-patches
Thanks for looking at it in detail.

> Yeah, I think this is potentially a blocker for propagating A into B
> when A is used elsewhere.  Combine is able to combine A and B while
> keeping A in parallel with the result.  I think either fwprop would
> need to try that too, or it would need to be restricted to cases where A
> is only used in B.

That seems a rather severe limitation and my original use case would
not get optimized considerably anymore.  The intention was to replace
all uses (if register pressure allows).  Of course the example is simple
enough that a propagation is always useful if the costs allow it, so
it might not be representative.

I'm wondering if we could (my original misunderstanding) tentatively
try to propagate into all uses of a definition and, when reaching
a certain ratio, decide that it might be worth it, otherwise revert.
Would be very crude though, and not driven by the actual problem we're
trying to avoid. 

> I think the summary is:
> 
> IMO, we have to be mindful that combine is still to run.  We need to
> avoid making equal-cost changes if the new form is more complex, or
> otherwise likely to interfere with combine.

I guess we don't have a good measure for complexity or "combinability"
and even lower-cost changes could result in worse options later.
Would it make sense to have a strict less-than cost policy for those
more complex propagations?  Or do you consider the approach in its
current shape "hopeless", given the complications we discussed?

> Alternatively, we could delay the optimisation until after combine
> and have freer rein, since we're then just mopping up opportunities
> that other passes left behind.
> 
> A while back I was experimenting with a second combine pass.  That was
> the original motiviation for rtl-ssa.  I never got chance to finish it
> off though.

This doesn't sound like something that would still materialize before
the end of stage 1 :)
Do you see any way of restricting the current approach to make it less
intrusive and still worthwhile?  Limiting to vec_duplicate might be
much too arbitrary but would still help for my original example.

Regards
 Robin



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)

Yeah, I think this is potentially a blocker for propagating A into B
when A is used elsewhere.  Combine is able to combine A and B while
keeping A in parallel with the result.  I think either fwprop would
need to try that too, or it would need to be restricted to cases where A
is only used in B.

I don't think that's really a UNARY_P-or-not thing.  The same problem
would in principle apply to plain unary operations.

>  - pred-combine-and.c: Here the new propagation appears useful at first.

I'm not sure about that, because...

>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.

...the fwprop1 change is:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
(and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
 (nil))
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
(and:VNx4BI (reg:VNx4BI 118 [ mask__25.17 ])
(reg:VNx4BI 102 [ vec_mask_and_64 ]))) 8420 {andvnx4bi3}
 (nil))

to:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
(and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
 (expr_list:REG_DEAD (reg:VNx4BI 116 [ mask__30.13 ])
(expr_list:REG_DEAD (reg:VNx4BI 98 [ loop_mask_58 ])
(nil
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
(and:VNx4BI (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))
(reg:VNx4BI 118 [ mask__25.17 ]))) 8428 {aarch64_pred_andvnx4bi_z}
 (nil))

On its own this isn't worse.  But it's also not a win.  The before and
after sequences have equal cost, but the after sequence is more complex.

That would probably be OK for something that runs near the end of
the pre-RA pipeline, since it could in principle increase parallelism.
But it's probably a bad idea when we know that the main instruction
combination pass is still to run.  By making insn 31 more complex,
we're making it a less likely combination candidate.

So this isn't necesarily the wrong thing to do.  But I think it is
the wrong time to do it.

>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Robin Dapp via Gcc-patches
> Thanks for giving it a go.  Can you post the latest version of the
> regpressure patch too?  The previous on-list version I could find
> seems to be too old.

Oh, sure, attached.  Apologies, I added the regpressure_same_class
convenience helper but forgot to re-send it.

Regards
 Robin

>From d3f87e4de7d7d05a2fcf8c948097b14eadf08c90 Mon Sep 17 00:00:00 2001
From: Robin Dapp 
Date: Mon, 24 Jul 2023 16:25:38 +0200
Subject: [PATCH] gcse: Extract reg pressure handling into separate file.

This patch extracts the hoist-pressure handling from gcse and puts it
into a separate file so it can be used by other passes in the future.
No functional change.

gcc/ChangeLog:

* Makefile.in: Add regpressure.o.
* gcse.cc (struct bb_data): Move to regpressure.cc.
(BB_DATA): Ditto.
(get_regno_pressure_class): Ditto.
(get_pressure_class_and_nregs): Ditto.
(record_set_data): Ditto.
(update_bb_reg_pressure): Ditto.
(should_hoist_expr_to_dom): Ditto.
(hoist_code): Ditto.
(change_pressure): Ditto.
(calculate_bb_reg_pressure): Ditto.
(one_code_hoisting_pass): Ditto.
* gcse.h (single_set_gcse): Export single_set_gcse.
* regpressure.cc: New file.
* regpressure.h: New file.
---
 gcc/Makefile.in|   1 +
 gcc/gcse.cc| 304 ++-
 gcc/gcse.h |   2 +
 gcc/regpressure.cc | 391 +
 gcc/regpressure.h  |  48 ++
 5 files changed, 459 insertions(+), 287 deletions(-)
 create mode 100644 gcc/regpressure.cc
 create mode 100644 gcc/regpressure.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 5930b52462a..62768a84f81 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1610,6 +1610,7 @@ OBJS = \
reg-stack.o \
regcprop.o \
reginfo.o \
+   regpressure.o \
regrename.o \
regstat.o \
reload.o \
diff --git a/gcc/gcse.cc b/gcc/gcse.cc
index f689c0c2687..5bafef7970f 100644
--- a/gcc/gcse.cc
+++ b/gcc/gcse.cc
@@ -160,6 +160,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gcse.h"
 #include "gcse-common.h"
 #include "function-abi.h"
+#include "regpressure.h"
 
 /* We support GCSE via Partial Redundancy Elimination.  PRE optimizations
are a superset of those done by classic GCSE.
@@ -419,30 +420,6 @@ static bool doing_code_hoisting_p = false;
 /* For available exprs */
 static sbitmap *ae_kill;
 
-/* Data stored for each basic block.  */
-struct bb_data
-{
-  /* Maximal register pressure inside basic block for given register class
- (defined only for the pressure classes).  */
-  int max_reg_pressure[N_REG_CLASSES];
-  /* Recorded register pressure of basic block before trying to hoist
- an expression.  Will be used to restore the register pressure
- if the expression should not be hoisted.  */
-  int old_pressure;
-  /* Recorded register live_in info of basic block during code hoisting
- process.  BACKUP is used to record live_in info before trying to
- hoist an expression, and will be used to restore LIVE_IN if the
- expression should not be hoisted.  */
-  bitmap live_in, backup;
-};
-
-#define BB_DATA(bb) ((struct bb_data *) (bb)->aux)
-
-static basic_block curr_bb;
-
-/* Current register pressure for each pressure class.  */
-static int curr_reg_pressure[N_REG_CLASSES];
-
 
 static void compute_can_copy (void);
 static void *gmalloc (size_t) ATTRIBUTE_MALLOC;
@@ -494,8 +471,6 @@ static bool should_hoist_expr_to_dom (basic_block, struct 
gcse_expr *,
  enum reg_class,
  int *, bitmap, rtx_insn *);
 static bool hoist_code (void);
-static enum reg_class get_regno_pressure_class (int regno, int *nregs);
-static enum reg_class get_pressure_class_and_nregs (rtx_insn *insn, int 
*nregs);
 static bool one_code_hoisting_pass (void);
 static rtx_insn *process_insert_insn (struct gcse_expr *);
 static bool pre_edge_insert (struct edge_list *, struct gcse_expr **);
@@ -2402,7 +2377,7 @@ record_set_data (rtx dest, const_rtx set, void *data)
 }
 }
 
-static const_rtx
+const_rtx
 single_set_gcse (rtx_insn *insn)
 {
   struct set_data s;
@@ -2804,72 +2779,6 @@ compute_code_hoist_data (void)
 fprintf (dump_file, "\n");
 }
 
-/* Update register pressure for BB when hoisting an expression from
-   instruction FROM, if live ranges of inputs are shrunk.  Also
-   maintain live_in information if live range of register referred
-   in FROM is shrunk.
-   
-   Return 0 if register pressure doesn't change, otherwise return
-   the number by which register pressure is decreased.
-   
-   NOTE: Register pressure won't be increased in this function.  */
-
-static int
-update_bb_reg_pressure (basic_block bb, rtx_insn *from)
-{
-  rtx dreg;
-  rtx_insn *insn;
-  basic_block succ_bb;
-  df_ref use, op_ref;
-  edge succ;
-  edge_iterator ei;
-  int decreased_pressure = 0;
-  int nregs;
- 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)
>
>  - pred-combine-and.c: Here the new propagation appears useful at first.
>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.
>
>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the address load and
>we require an aarch64_split_add_offset.  This will emit the longer
>sequence of ashiftl and subtract.  The "base" address is x0 here so
>we cannot convert (minus x0 ...)) into neg.
>I didn't go through all of aarch64_split_add_offset.  I suppose we
>could re-add the separation of base/offset there but that might be
>a loss when the result is not used as an address. 
>
> Again, all in all no fatal problems but pretty annoying :)  It's not much
> but just gradually worse than with just UNARY.  Any idea on how/whether to
> continue?

Thanks for giving it a go.  Can you post the latest version of the
regpressure patch too?  The previous on-list version I could find
seems to be too old.

Thanks,
Richard

> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>   handling.
>   (fwprop_propagation::update_register_pressure): New function.
>   (fwprop_propagation::register_pressure_high_p): New function
>   (reg_single_def_for_src_p): Look through unary expressions.
>   (try_fwprop_subst_pattern): Check register pressure.
>   (forward_propagate_into): Call new function.
>   (fwprop_init): Init register pressure.
>   (fwprop_done): Clean up register pressure.
>   (fwprop_insn): Add comment.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> ---
>  gcc/fwprop.cc | 359 +-
>  .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
>  2 files changed, 419 insertions(+), 4 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
>
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 0707a234726..ce6f5a74b00 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
>  #include "target.h"
> +#include "dominance.h"

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-06 Thread Robin Dapp via Gcc-patches
Hi Richard,

I did some testing with the attached v2 that does not restrict to UNARY
anymore.  As feared ;) there is some more fallout that I'm detailing below.

On Power there is one guality fail (pr43051-1.c) that I would take
the liberty of ignoring for now.

On x86 there are four fails:

 - cond_op_addsubmuldiv__Float16-2.c: assembler error
   unsupported masking for `vmovsh'.  I guess that's a latent backend
   problem.

 - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
   (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
   That looks like a costing problem and can hopefully solveable by making
   the second compare more expensive, preventing the propagation.
   i386 costing (or every costing?) is brittle so that could well break other
   things. 

 - pr88873.c: This is interesting because even before this patch we
   propagated with different register classes (V2DF vs DI).  With the patch
   we check the register pressure, find the class NO_REGS for V2DF and
   abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
   of keeping the old behavior for reg-reg propagations and only checking
   the pressure for more complex operations.

aarch64 has the most fails:

 - One guality fail (same as Power).
 - shrn-combine-[123].c as before.

 - A class of (hopefully, I only checked some) similar cases where we
   propagate an unspec_whilelo into an unspec_ptest.  Before we would only
   set a REG_EQUALS note.
   Before we managed to create a while_ultsivnx16bi_cc whereas now we have
   while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
   We create redundant whilelos and I'm not sure how to improve that. I
   guess a peephole is out of the question :)

 - pred-combine-and.c: Here the new propagation appears useful at first.
   We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
   individual and registers remain live up to the propagation site (while
   being dead before the patch).
   With the registers dead, combine could create a single fcmgt before.
   Now it only manages a 2->2 combination because we still need the registers
   and end up with two fcmgts.
   The code is worse but this seems more bad luck than anything.

 - Addressing fails from before:  I looked into these and suspect all of
   them are a similar.
   What happens is that we have a poly_int offset that we shift, negate
   and then add to x0.  The result is used as load address.
   Before, we would pull (combine) the (plus x0 reg) into the load keeping
   the neg and shift.
   Now we propagate everything into a single (set (minus x0 offset)).
   The propagation itself seems worthwhile because we save one insn.
   However as we got rid of the base/offset split by lumping everything
   together, combine cannot pull the (plus) into the address load and
   we require an aarch64_split_add_offset.  This will emit the longer
   sequence of ashiftl and subtract.  The "base" address is x0 here so
   we cannot convert (minus x0 ...)) into neg.
   I didn't go through all of aarch64_split_add_offset.  I suppose we
   could re-add the separation of base/offset there but that might be
   a loss when the result is not used as an address. 
   
Again, all in all no fatal problems but pretty annoying :)  It's not much
but just gradually worse than with just UNARY.  Any idea on how/whether to
continue?

Regards
 Robin

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
---
 gcc/fwprop.cc | 359 +-
 .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
 2 files changed, 419 insertions(+), 4 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..ce6f5a74b00 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@ using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Robin Dapp via Gcc-patches
> I imagine doing it in reverse postorder would still make sense.
> 
> But my point was that, for the current fwprop limitation of substituting
> into exactly one use of a register, we can check whether that use is
> the *only* use of register.
> 
> I.e. if we substitute:
> 
>   A: (set (reg R1) (foo (reg R2)))
> 
> into:
> 
>   B: (set ... (reg R1) ...)
> 
> if R1 and R2 are likely to be in the same register class, and if B
> is the only user of R2, then we don't need to calculate register
> pressure.  The change is either neutral (if R2 died in A) or an
> improvement (if R2 doesn't die in A, and so R1 and R2 were previously
> live at the same time).

Ah, understood, thanks.  Sure, that one I can include.

Regards
 Robin


Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> So I don't think I have a good feel for the advantages and disadvantages
>> of doing this.  Robin's analysis of the aarch64 changes was nice and
>> detailed though.  I think the one that worries me most is the addressing
>> mode one.  fwprop is probably the first chance we get to propagate adds
>> into addresses, and virtual register elimination means that some of
>> those opportunities won't show up in gimple.
>> 
>> There again, virtual register elimination wouldn't be the reason for
>> the ld4_s8.c failure.  Perhaps there's something missing in expand.
>> 
>> Other than that, I think my main question is: why just unary operations?
>> Is the underlying assumption that we only want to propagate a maximum of
>> one register?  If so, then I think we should check for that directly, by
>> iterating over subrtxes.
>
> The main reason for stopping at unary operations was to limit the scope
> and change as little as possible (not restricting the change to one
> register).  I'm currently testing a v2 that iterates over subrtxs.

Thanks.  Definitely no problem with doing things in small steps, but IMO
it's better if each choice of step can still be justified in its own terms.

>> Perhaps we should allow the optimisation without register-pressure
>> information if (a) the source register and destination register are
>> in the same pressure class and (b) all uses of the destination are
>> being replaced.  (FWIW, rtl-ssa should make it easier to try to
>> replace all definitions at once, with an all-or-nothing choice,
>> if we ever wanted to do that.)
>
> I presume you're referring to replacing one register (dest) in all using
> insns?  Source and destination are somewhat overloaded in fwprop context
> because I'm thinking of the "to be replaced" register as dest when it's
> actually the replacement register.

Yeah.

> AFAICT fwprop currently iterates over insns, going through all their uses
> and trying if an individual use can be substituted.  Do you suggest to
> change this general iteration order to iterate over the defs of an insn
> and then try to replace all the uses at once (e.g. using ssa->change_insns)?

No, I was just noting in passing that we could try do that if we wanted to.
The current code is a fairly mechanical conversion of the original DF-based
code, but there's no reason why it has to continue to work the way it
does now.

> When keeping the current order, wouldn't we need to store all potential
> changes instead of committing them and later apply them in bulk, e.g.
> grouped by use?  This order would also help to pick the propagation
> with the most number of uses (i.e. propagation potential) but maybe
> I'm misunderstanding?

I imagine doing it in reverse postorder would still make sense.

But my point was that, for the current fwprop limitation of substituting
into exactly one use of a register, we can check whether that use is
the *only* use of register.

I.e. if we substitute:

  A: (set (reg R1) (foo (reg R2)))

into:

  B: (set ... (reg R1) ...)

if R1 and R2 are likely to be in the same register class, and if B
is the only user of R2, then we don't need to calculate register
pressure.  The change is either neutral (if R2 died in A) or an
improvement (if R2 doesn't die in A, and so R1 and R2 were previously
live at the same time).

Thanks,
Richard


Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Robin Dapp via Gcc-patches
> So I don't think I have a good feel for the advantages and disadvantages
> of doing this.  Robin's analysis of the aarch64 changes was nice and
> detailed though.  I think the one that worries me most is the addressing
> mode one.  fwprop is probably the first chance we get to propagate adds
> into addresses, and virtual register elimination means that some of
> those opportunities won't show up in gimple.
> 
> There again, virtual register elimination wouldn't be the reason for
> the ld4_s8.c failure.  Perhaps there's something missing in expand.
> 
> Other than that, I think my main question is: why just unary operations?
> Is the underlying assumption that we only want to propagate a maximum of
> one register?  If so, then I think we should check for that directly, by
> iterating over subrtxes.

The main reason for stopping at unary operations was to limit the scope
and change as little as possible (not restricting the change to one
register).  I'm currently testing a v2 that iterates over subrtxs.

> Perhaps we should allow the optimisation without register-pressure
> information if (a) the source register and destination register are
> in the same pressure class and (b) all uses of the destination are
> being replaced.  (FWIW, rtl-ssa should make it easier to try to
> replace all definitions at once, with an all-or-nothing choice,
> if we ever wanted to do that.)

I presume you're referring to replacing one register (dest) in all using
insns?  Source and destination are somewhat overloaded in fwprop context
because I'm thinking of the "to be replaced" register as dest when it's
actually the replacement register.

AFAICT fwprop currently iterates over insns, going through all their uses
and trying if an individual use can be substituted.  Do you suggest to
change this general iteration order to iterate over the defs of an insn
and then try to replace all the uses at once (e.g. using ssa->change_insns)?
When keeping the current order, wouldn't we need to store all potential
changes instead of committing them and later apply them in bulk, e.g.
grouped by use?  This order would also help to pick the propagation
with the most number of uses (i.e. propagation potential) but maybe
I'm misunderstanding?

Regards
 Robin



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-29 Thread Richard Sandiford via Gcc-patches
Jeff Law  writes:
> On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote:
>> Ping.  I refined the code and some comments a bit and added a test
>> case.
>> 
>> My question in general would still be:  Is this something we want
>> given that we potentially move some of combine's work a bit towards
>> the front of the RTL pipeline?
>> 
>> Regards
>>   Robin
>> 
>> Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.
>> 
>> This patch enables the forwarding of UNARY_P sources.  As this
>> involves potentially replacing a vector register with a scalar register
>> the ira_hoist_pressure machinery is used to calculate the change in
>> register pressure.  If the propagation would increase the pressure
>> beyond the number of hard regs, we don't perform it.
>> 
>> gcc/ChangeLog:
>> 
>>  * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>>  handling.
>>  (fwprop_propagation::update_register_pressure): New function.
>>  (fwprop_propagation::register_pressure_high_p): New function
>>  (reg_single_def_for_src_p): Look through unary expressions.
>>  (try_fwprop_subst_pattern): Check register pressure.
>>  (forward_propagate_into): Call new function.
>>  (fwprop_init): Init register pressure.
>>  (fwprop_done): Clean up register pressure.
>>  (fwprop_insn): Add comment.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> So I was hoping that Richard S. would chime in here as he knows this 
> code better than anyone.

Heh, I'm not sure about that.  I rewrote the code to use rtl-ssa,
so in that sense I'm OK with the framework side.  But I tried to
preserve the decisions that the old pass made as closely as possible.
I don't know why most of those decisions were made (which is why I just
kept them).

So I don't think I have a good feel for the advantages and disadvantages
of doing this.  Robin's analysis of the aarch64 changes was nice and
detailed though.  I think the one that worries me most is the addressing
mode one.  fwprop is probably the first chance we get to propagate adds
into addresses, and virtual register elimination means that some of
those opportunities won't show up in gimple.

There again, virtual register elimination wouldn't be the reason for
the ld4_s8.c failure.  Perhaps there's something missing in expand.

Other than that, I think my main question is: why just unary operations?
Is the underlying assumption that we only want to propagate a maximum of
one register?  If so, then I think we should check for that directly, by
iterating over subrtxes.

That way we can handle things like binary operations involving a
register and a constant, and unspecs with a single non-constant operand.

I imagine the check would be something like:

  unsigned int nregs = 0;
  for (each subrtx x)
{
  if (MEM_P (x))
return false;
  if (SUBREG_P (x) && .../*current conditions */...)
return false;
  if (REG_P (x))
{
  nregs += 1;
  if (nregs > 1)
return false;
}
}
  return true;

Perhaps we should allow the optimisation without register-pressure
information if (a) the source register and destination register are
in the same pressure class and (b) all uses of the destination are
being replaced.  (FWIW, rtl-ssa should make it easier to try to
replace all definitions at once, with an all-or-nothing choice,
if we ever wanted to do that.)

Thanks,
Richard

>
> This looks like a much better implementation of something I've done 
> before :-)  Basically imagine a target where a sign/zero extension can 
> be folded into arithmetic for free.  We put in various hacks to this 
> code to encourage more propagations of extensions.
>
> I still think this is valuable.  As we lower from gimple->RTL we're 
> going to still have artifacts in the RTL that we're going to want to 
> optimize away.  fwprop has certain advantages over combine, including 
> the fact that it runs earlier, pre-loop.
>
>
> It looks generally sensible to me.  But give Richard S. another week to 
> chime in.  He seems to be around, but may be slammed with stuff right now.
>
> jeff


Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-28 Thread Jeff Law via Gcc-patches




On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote:

Ping.  I refined the code and some comments a bit and added a test
case.

My question in general would still be:  Is this something we want
given that we potentially move some of combine's work a bit towards
the front of the RTL pipeline?

Regards
  Robin

Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.

This patch enables the forwarding of UNARY_P sources.  As this
involves potentially replacing a vector register with a scalar register
the ira_hoist_pressure machinery is used to calculate the change in
register pressure.  If the propagation would increase the pressure
beyond the number of hard regs, we don't perform it.

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
So I was hoping that Richard S. would chime in here as he knows this 
code better than anyone.


This looks like a much better implementation of something I've done 
before :-)  Basically imagine a target where a sign/zero extension can 
be folded into arithmetic for free.  We put in various hacks to this 
code to encourage more propagations of extensions.


I still think this is valuable.  As we lower from gimple->RTL we're 
going to still have artifacts in the RTL that we're going to want to 
optimize away.  fwprop has certain advantages over combine, including 
the fact that it runs earlier, pre-loop.



It looks generally sensible to me.  But give Richard S. another week to 
chime in.  He seems to be around, but may be slammed with stuff right now.


jeff



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-24 Thread Robin Dapp via Gcc-patches
Ping.  I refined the code and some comments a bit and added a test
case.

My question in general would still be:  Is this something we want
given that we potentially move some of combine's work a bit towards
the front of the RTL pipeline?

Regards
 Robin

Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.

This patch enables the forwarding of UNARY_P sources.  As this
involves potentially replacing a vector register with a scalar register
the ira_hoist_pressure machinery is used to calculate the change in
register pressure.  If the propagation would increase the pressure
beyond the number of hard regs, we don't perform it.

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
---
 gcc/fwprop.cc | 314 +-
 .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
 2 files changed, 371 insertions(+), 7 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..b49d4e4ced4 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@ using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  */
+sbitmap pressure_accounted;
+
 /* Do not try to replace constant addresses or addresses of local and
argument slots.  These MEM expressions are made only once and inserted
in many instructions, as well as being used to control symbol table
@@ -181,6 +189,8 @@ namespace
 bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
 bool folded_to_constants_p () const;
 bool profitable_p () const;
+bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
+bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
 
 bool check_mem (int, rtx) final override;
 void note_simplification (int, uint16_t, rtx, rtx) final override;
@@ -332,25 +342,247 @@ fwprop_propagation::profitable_p () const
   && (result_flags & PROFITABLE))
 return true;
 
-  if (REG_P (to))
+  /* Only continue with an unary operation if we consider register
+ pressure.  */
+  rtx what = copy_rtx (to);
+  if (UNARY_P (what) && flag_ira_hoist_pressure)
+what = XEXP (what, 0);
+
+  if (REG_P (what))
 return true;
 
-  if (GET_CODE (to) == SUBREG
-  && REG_P (SUBREG_REG (to))
-  && !paradoxical_subreg_p (to))
+  if (GET_CODE (what) == SUBREG
+  && REG_P (SUBREG_REG (what))
+  && !paradoxical_subreg_p (what))
 return true;
 
-  if (CONSTANT_P (to))
+  if (CONSTANT_P (what))
 return true;
 
   return false;
 }
 
-/* Check that X has a single def.  */
+/* Check if the register pressure in any predecessor block of USE's block
+   until DEF's block is equal or higher to the number of hardregs in NU's
+   register class.  */
+bool
+fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
+ rtx_insn *use) const
+{
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), _nregs);
+  old_class
+= regpressure_get_regno_pressure_class (REGNO (old), _nregs);
+
+  if (nu_class == NO_REGS && old_class == NO_REGS)
+return true;
+
+  if (nu_class == old_class)
+return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  basic_block bb;
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec q;
+  q.safe_push (bbto);
+
+  while (!q.is_empty ())
+{
+  bb = q.pop ();
+
+  if (bitmap_bit_p (visited, bb->index))
+   continue;
+
+  /* Nothing to do if the register to be replaced is not live
+in this BB.  */
+  if (bb != bbfrom && !regpressure_is

[PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-07 Thread Robin Dapp via Gcc-patches
Hi,

originally inspired by the wish to transform

 vmv v3, a0 ; = vec_duplicate
 vadd.vv v1, v2, v3

into

 vadd.vx v1, v2, a0

via fwprop for riscv, this patch enables the forward propagation
of UNARY_P sources.

As this involves potentially replacing a vector register with
a scalar register the ira_hoist_pressure machinery is used to
calculate the change in register pressure.  If the propagation
would increase the pressure beyond the number of hard regs, we
don't perform it.

The regpressure commit this patch depends on is not yet pushed
because I figured I'd post the code using it in case of further
comments.

The testsuite is unchanged on i386 and power10 but aarch64 has
some new FAILs but I'm not terribly familiar with aarch64, hence
some examples here:

The following cases shrn-combine-1/2/3 seem worse as we emit one
instruction more.

Before:
  shrnv30.8b, v30.8h, 2
  shrn2   v30.16b, v31.8h, 2
  str q30, [x1, x3]
After:
  ushrv30.8h, v30.8h, 2
  ushrv31.8h, v31.8h, 2
  uzp1v31.16b, v30.16b, v31.16b
  str q31, [x1, x3]

Here, one optimization already happens in fwprop so combine
cannot do the same work it did before.

vec-init-22-speed.c changes from
  sxthw0, w0
  sxthw1, w1
  fmovd31, x0
  fmovd0, x1
to:
  dup v31.4h, w0
  dup v0.4h, w1 
which I hope has the same semantics and is shorter.

Apart from that there are numerous check-asm testcases where
a new, earlier propagation prevents a later, supposedly better
propagation.  One such example is from ld4_s8.c:

  (insn 11 10 12 2 (set (reg:DI 100) 
  (neg:DI (reg:DI 102))) 385 {negdi2}
   (expr_list:REG_EQUAL (const_poly_int:DI [-576, -576])
   
  (nil)))
  (insn 12 11 13 2 (set (reg/f:DI 99)
  (plus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 100))) 153 {*adddi3_aarch64}
   (expr_list:REG_EQUAL (plus:DI (reg/v/f:DI 97 [ x0 ]) 
   
  (const_poly_int:DI [-576, -576])) 
   
  (nil)))
  (insn 13 12 14 2 (set (reg/v:VNx64QI 94 [ z0 ])   
   
  (unspec:VNx64QI [
  (reg/v:VNx16BI 96 [ p0 ])
  (mem:VNx64QI (reg/f:DI 99) [0 MEM  [(signed char 
*)_1]+0 S[64, 64] A8])
  ] UNSPEC_LDN)) 5885 {vec_mask_load_lanesvnx64qivnx16qi}   
   
   (nil)) 

where we now do:

  propagating insn 11 into insn 12, replacing:
  (set (reg/f:DI 99)
  (plus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 100)))
  successfully matched this instruction to subdi3:
  (set (reg/f:DI 99)
  (minus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 102)))

vs before:

  cannot propagate from insn 11 into insn 12: would increase complexity of 
pattern
  
  propagating insn 12 into insn 13, replacing:
  (set (reg/v:VNx64QI 94 [ z0 ])
  (unspec:VNx64QI [
  (reg/v:VNx16BI 96 [ p0 ])
  (mem:VNx64QI (reg/f:DI 99) [0 MEM  [(signed char 
*)_1]+0 S[64, 64] A8])
  ] UNSPEC_LDN))
  successfully matched this instruction to vec_mask_load_lanesvnx64qivnx16qi:
  (set (reg/v:VNx64QI 94 [ z0 ])
  (unspec:VNx64QI [
  (reg/v:VNx16BI 96 [ p0 ])
  (mem:VNx64QI (plus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 100)) [0 MEM  [(signed char *)_1]+0 
S[64, 64] A8])
  ] UNSPEC_LDN))

All in all this seems like a general problem with earlier
optimizations preventing later ones and surely all of those
could be fixed individually.  Still, the question remains if
the general approach is useful or desired or if we not rather
prevent more optimizations that we gain.  Suggestions welcome.

I have some riscv tests for it but didn't attach them yet
in order to focus on the main part first.

Regards
 Robin

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.
---
 gcc/fwprop.cc | 307 --
 1 file changed, 300 insertions(+), 7 deletions(-)

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..413fe4e7335 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This