Re: [PATCH v2] Target-independent store forwarding avoidance.

Jeff Law Fri, 07 Jun 2024 15:31:34 -0700



On 6/6/24 4:10 AM, Manolis Tsamis wrote:

This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

      strb    w2, [x1, 1]
      ldr     x0, [x1]      # Expensive store forwarding to larger load.

To:

      ldr     x0, [x1]
      strb    w2, [x1]
      bfi     x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

   Neoverse-N1:      +29.4%
   Intel Coffeelake: +13.1%
   AMD 5950X:        +17.5%

gcc/ChangeLog:

        * Makefile.in: Add avoid-store-forwarding.o.
        * common.opt: New option -favoid-store-forwarding.
        * params.opt: New param store-forwarding-max-distance.
        * doc/invoke.texi: Document new pass.
        * doc/passes.texi: Document new pass.
        * passes.def: Schedule a new pass.
        * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
        * avoid-store-forwarding.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
        * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
        * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
        * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.

So this is getting a lot more interesting. I think the first time Ilooked at this it was more concerned with stores feeding something likea load-pair and avoiding the store forwarding penalty for that case. AmI mis-remembering, or did it get significantly more general?

+
+static unsigned int stats_sf_detected = 0;
+static unsigned int stats_sf_avoided = 0;
+
+static rtx
+get_load_mem (rtx expr)

Needs a function comment. You should probably mention that EXPR must bea single_set in that comment.

+  rtx dest;
+  if (eliminate_load)
+    dest = gen_reg_rtx (load_inner_mode);
+  else
+    dest = SET_DEST (load);
+
+  int move_to_front = -1;
+  int total_cost = 0;
+
+  /* Check if we can emit bit insert instructions for all forwarded stores.  */
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+      rtx_insn *insns = NULL;
+
+      /* If we're eliminating the load then find the store with zero offset
+        and use it as the base register to avoid a bit insert.  */
+      if (eliminate_load && it->offset == 0)

So often is this triggering? We have various codes in the gimpleoptimizers to detect store followed by a load from the same address anddo the forwarding. If they're happening with any frequency that wouldbe a good sign code in DOM and elsewhere isn't working well.

THe way these passes detect this case is to take store, flip theoperands around (ie, it looks like a load) and enter that into theexpression hash tables. After that standard redundancy eliminationapproaches will work.

+       {
+         start_sequence ();
+
+         /* We can use a paradoxical subreg to force this to a wider mode, as
+            the only use will be inserting the bits (i.e., we don't care about
+            the value of the higher bits).  */

Which may be a good hint about the cases you're capturing -- if themodes/sizes differ that would make more sense since I don't think we'reas likely to be capturing those cases.

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4e8967fd8ab..c769744d178 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12657,6 +12657,15 @@ loop unrolling.
  This option is enabled by default at optimization levels @option{-O1},
  @option{-O2}, @option{-O3}, @option{-Os}.

+@opindex favoid-store-forwarding

+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.

Is there any particular reason why this would be off by default at -O1or higher? It would seem to me that on modern cores that thistransformation should easily be a win. Even on an old in-order core,avoiding the load with the bit insert is likely profitable, just not asmuch so.

diff --git a/gcc/params.opt b/gcc/params.opt
index d34ef545bf0..b8115f5c27a 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned 
stores if it is legal to do
  Common Joined UInteger Var(param_store_merging_max_size) Init(65536) 
IntegerRange(1, 65536) Param Optimization
  Maximum size of a single store merging region in bytes.

+-param=store-forwarding-max-distance=

+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) 
IntegerRange(1, 1000) Param Optimization
+Maximum number of instruction distance that a small store forwarded to a 
larger load may stall.

I think you may need to run the update-urls script since you've added anew option.



In general it seems pretty reasonable.

I've actually added it to my tester just to see if there's any fallout.It'll take a week to churn through the long running targets thatbootstrap in QEMU, but the crosses should have data Monday.


jeff

Re: [PATCH v2] Target-independent store forwarding avoidance.

Reply via email to