On 6/6/24 4:10 AM, Manolis Tsamis wrote:
This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

      strb    w2, [x1, 1]
      ldr     x0, [x1]      # Expensive store forwarding to larger load.

To:

      ldr     x0, [x1]
      strb    w2, [x1]
      bfi     x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

   Neoverse-N1:      +29.4%
   Intel Coffeelake: +13.1%
   AMD 5950X:        +17.5%

gcc/ChangeLog:

        * Makefile.in: Add avoid-store-forwarding.o.
        * common.opt: New option -favoid-store-forwarding.
        * params.opt: New param store-forwarding-max-distance.
        * doc/invoke.texi: Document new pass.
        * doc/passes.texi: Document new pass.
        * passes.def: Schedule a new pass.
        * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
        * avoid-store-forwarding.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
        * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
        * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
        * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
So this is getting a lot more interesting. I think the first time I looked at this it was more concerned with stores feeding something like a load-pair and avoiding the store forwarding penalty for that case. Am I mis-remembering, or did it get significantly more general?





+
+static unsigned int stats_sf_detected = 0;
+static unsigned int stats_sf_avoided = 0;
+
+static rtx
+get_load_mem (rtx expr)
Needs a function comment. You should probably mention that EXPR must be a single_set in that comment.



 +
+  rtx dest;
+  if (eliminate_load)
+    dest = gen_reg_rtx (load_inner_mode);
+  else
+    dest = SET_DEST (load);
+
+  int move_to_front = -1;
+  int total_cost = 0;
+
+  /* Check if we can emit bit insert instructions for all forwarded stores.  */
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+      rtx_insn *insns = NULL;
+
+      /* If we're eliminating the load then find the store with zero offset
+        and use it as the base register to avoid a bit insert.  */
+      if (eliminate_load && it->offset == 0)
So often is this triggering? We have various codes in the gimple optimizers to detect store followed by a load from the same address and do the forwarding. If they're happening with any frequency that would be a good sign code in DOM and elsewhere isn't working well.

THe way these passes detect this case is to take store, flip the operands around (ie, it looks like a load) and enter that into the expression hash tables. After that standard redundancy elimination approaches will work.


+       {
+         start_sequence ();
+
+         /* We can use a paradoxical subreg to force this to a wider mode, as
+            the only use will be inserting the bits (i.e., we don't care about
+            the value of the higher bits).  */
Which may be a good hint about the cases you're capturing -- if the modes/sizes differ that would make more sense since I don't think we're as likely to be capturing those cases.


diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4e8967fd8ab..c769744d178 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12657,6 +12657,15 @@ loop unrolling.
  This option is enabled by default at optimization levels @option{-O1},
  @option{-O2}, @option{-O3}, @option{-Os}.
+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
Is there any particular reason why this would be off by default at -O1 or higher? It would seem to me that on modern cores that this transformation should easily be a win. Even on an old in-order core, avoiding the load with the bit insert is likely profitable, just not as much so.




diff --git a/gcc/params.opt b/gcc/params.opt
index d34ef545bf0..b8115f5c27a 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned 
stores if it is legal to do
  Common Joined UInteger Var(param_store_merging_max_size) Init(65536) 
IntegerRange(1, 65536) Param Optimization
  Maximum size of a single store merging region in bytes.
+-param=store-forwarding-max-distance=
+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) 
IntegerRange(1, 1000) Param Optimization
+Maximum number of instruction distance that a small store forwarded to a 
larger load may stall.
I think you may need to run the update-urls script since you've added a new option.


In general it seems pretty reasonable.

I've actually added it to my tester just to see if there's any fallout. It'll take a week to churn through the long running targets that bootstrap in QEMU, but the crosses should have data Monday.

jeff

Reply via email to