On 6/6/24 4:10 AM, Manolis Tsamis wrote:
This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:
strb w2, [x1, 1]
ldr x0, [x1] # Expensive store forwarding to larger load.
To:
ldr x0, [x1]
strb w2, [x1]
bfi x0, x2, 0, 8
Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.
Neoverse-N1: +29.4%
Intel Coffeelake: +13.1%
AMD 5950X: +17.5%
gcc/ChangeLog:
* Makefile.in: Add avoid-store-forwarding.o.
* common.opt: New option -favoid-store-forwarding.
* params.opt: New param store-forwarding-max-distance.
* doc/invoke.texi: Document new pass.
* doc/passes.texi: Document new pass.
* passes.def: Schedule a new pass.
* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
* avoid-store-forwarding.cc: New file.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
* gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
So this is getting a lot more interesting. I think the first time I
looked at this it was more concerned with stores feeding something like
a load-pair and avoiding the store forwarding penalty for that case. Am
I mis-remembering, or did it get significantly more general?
+
+static unsigned int stats_sf_detected = 0;
+static unsigned int stats_sf_avoided = 0;
+
+static rtx
+get_load_mem (rtx expr)
Needs a function comment. You should probably mention that EXPR must be
a single_set in that comment.
+
+ rtx dest;
+ if (eliminate_load)
+ dest = gen_reg_rtx (load_inner_mode);
+ else
+ dest = SET_DEST (load);
+
+ int move_to_front = -1;
+ int total_cost = 0;
+
+ /* Check if we can emit bit insert instructions for all forwarded stores. */
+ FOR_EACH_VEC_ELT (stores, i, it)
+ {
+ it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+ rtx_insn *insns = NULL;
+
+ /* If we're eliminating the load then find the store with zero offset
+ and use it as the base register to avoid a bit insert. */
+ if (eliminate_load && it->offset == 0)
So often is this triggering? We have various codes in the gimple
optimizers to detect store followed by a load from the same address and
do the forwarding. If they're happening with any frequency that would
be a good sign code in DOM and elsewhere isn't working well.
THe way these passes detect this case is to take store, flip the
operands around (ie, it looks like a load) and enter that into the
expression hash tables. After that standard redundancy elimination
approaches will work.
+ {
+ start_sequence ();
+
+ /* We can use a paradoxical subreg to force this to a wider mode, as
+ the only use will be inserting the bits (i.e., we don't care about
+ the value of the higher bits). */
Which may be a good hint about the cases you're capturing -- if the
modes/sizes differ that would make more sense since I don't think we're
as likely to be capturing those cases.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4e8967fd8ab..c769744d178 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12657,6 +12657,15 @@ loop unrolling.
This option is enabled by default at optimization levels @option{-O1},
@option{-O2}, @option{-O3}, @option{-Os}.
+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores. This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
Is there any particular reason why this would be off by default at -O1
or higher? It would seem to me that on modern cores that this
transformation should easily be a win. Even on an old in-order core,
avoiding the load with the bit insert is likely profitable, just not as
much so.
diff --git a/gcc/params.opt b/gcc/params.opt
index d34ef545bf0..b8115f5c27a 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned
stores if it is legal to do
Common Joined UInteger Var(param_store_merging_max_size) Init(65536)
IntegerRange(1, 65536) Param Optimization
Maximum size of a single store merging region in bytes.
+-param=store-forwarding-max-distance=
+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10)
IntegerRange(1, 1000) Param Optimization
+Maximum number of instruction distance that a small store forwarded to a
larger load may stall.
I think you may need to run the update-urls script since you've added a
new option.
In general it seems pretty reasonable.
I've actually added it to my tester just to see if there's any fallout.
It'll take a week to churn through the long running targets that
bootstrap in QEMU, but the crosses should have data Monday.
jeff