On 11/6/24 12:11, Vineet Gupta wrote:
> changes since v1
> * Changed target hook to --param
> * squash addon patch for RISC-V opting-in, testcase here
> * updated changelog with latest perf numbers
ping !
> ---
>
> sched1 computes ECC (Excess Change Cost) for each insn, which represents
> the register pressure attributed to the insn.
> Currently the pressure sensitive schduling algorithm deliberately ignores
> negative values (pressure reduction), making them 0 (neutral), leading
> to more spills. This happens due to the assumption that the compiler has
> a reasonably accurate processor pipeline scheduling model and thus tries
> to aggresively fill pipeline bubbles with spill slots.
>
> This however might not be true, as the model might not be available for
> certains uarches or even applicable especially for modern out-of-order cores.
>
> The existing heuristic induces spill frenzy on RISC-V, noticably so on
> SPEC2017 507.Cactu. If insn scheduling is disabled completely, the
> total dynamic icounts for this workload are reduced in half from
> ~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns).
>
> This patch adds --param=cycle-accurate-model={0,1} to gate the spill
> behavior.
>
> - The default (1) preserves existing spill behavior.
>
> - targets/uarches sensitive to spilling can override the param to (0)
> to get the reverse effect. RISC-V backend does so too.
>
> The actual perf numbers are very promising.
>
> (1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs:
>
> Before:
> ------
> Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>
> 4,917,712.97 msec task-clock:u # 1.000 CPUs
> utilized
> 5,314 context-switches:u # 1.081 /sec
> 3 cpu-migrations:u # 0.001 /sec
> 204,784 page-faults:u # 41.642 /sec
> 7,868,291,222,513 cycles:u # 1.600 GHz
> 2,615,069,866,153 instructions:u # 0.33 insn per
> cycle
> 10,799,381,890 branches:u # 2.196 M/sec
> 15,714,572 branch-misses:u # 0.15% of all
> branches
>
> After:
> -----
> Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>
> 4,552,979.58 msec task-clock:u # 0.998 CPUs
> utilized
> 205,020 context-switches:u # 45.030 /sec
> 2 cpu-migrations:u # 0.000 /sec
> 204,221 page-faults:u # 44.854 /sec
> 7,285,176,204,764 cycles:u (7.4% faster) # 1.600 GHz
> 2,145,284,345,397 instructions:u (17.96% fewer) # 0.29 insn per
> cycle
> 10,799,382,011 branches:u # 2.372 M/sec
> 16,235,628 branch-misses:u # 0.15% of all
> branches
>
> (2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs.
>
> gcc/ChangeLog:
> PR target/11472
> * params.opt (--param=cycle-accurate-model=): New opt.
> * doc/invoke.texi (cycle-accurate-model): Document.
> * haifa-sched.cc (model_excess_group_cost): Return negative
> delta if param_cycle_accurate_model is 0.
> (model_excess_cost): Ceil negative baseECC to 0 only if
> param_cycle_accurate_model is 1.
> Dump the actual ECC value.
> * config/riscv/riscv.cc (riscv_option_override): Set param
> to 0.
>
> gcc/testsuite/ChangeLog:
> PR target/114729
> * gcc.target/riscv/riscv.exp: Enable new tests to build.
> * gcc.target/riscv/sched1-spills/spill1.cpp: Add new test.
>
> Signed-off-by: Vineet Gupta <[email protected]>
> ---
> gcc/config/riscv/riscv.cc | 4 +++
> gcc/doc/invoke.texi | 7 ++++
> gcc/haifa-sched.cc | 32 ++++++++++++++-----
> gcc/params.opt | 4 +++
> gcc/testsuite/gcc.target/riscv/riscv.exp | 2 ++
> .../gcc.target/riscv/sched1-spills/spill1.cpp | 32 +++++++++++++++++++
> 6 files changed, 73 insertions(+), 8 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index 2e9ac280c8f2..75fcadfc3b58 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -10549,6 +10549,10 @@ riscv_option_override (void)
> param_sched_pressure_algorithm,
> SCHED_PRESSURE_MODEL);
>
> + SET_OPTION_IF_UNSET (&global_options, &global_options_set,
> + param_cycle_accurate_model,
> + 0);
> +
> /* Function to allocate machine-dependent function status. */
> init_machine_status = &riscv_init_machine_status;
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 7146163d66d0..c1e07e258b25 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -17084,6 +17084,13 @@ With @option{--param=openacc-privatization=quiet},
> don't diagnose.
> This is the current default.
> With @option{--param=openacc-privatization=noisy}, do diagnose.
>
> +@item cycle-accurate-model
> +Specifies whether GCC should assume that the scheduling description is mostly
> +a cycle-accurate model of the target processor, where the code is intended to
> +run on, in the absence of cache misses. Nonzero means that the selected
> scheduling
> +model is accuate and likely describes an in-order processor, and that
> scheduling
> +will aggressively spill to try and fill any pipeline bubbles.
> +
> @end table
>
> The following choices of @var{name} are available on AArch64 targets:
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> index 02c893ec5cd3..cd4b6baddcd2 100644
> --- a/gcc/haifa-sched.cc
> +++ b/gcc/haifa-sched.cc
> @@ -2398,11 +2398,18 @@ model_excess_group_cost (struct model_pressure_group
> *group,
> int pressure, cl;
>
> cl = ira_pressure_classes[pci];
> - if (delta < 0 && point >= group->limits[pci].point)
> + if (delta < 0)
> {
> - pressure = MAX (group->limits[pci].orig_pressure,
> - curr_reg_pressure[cl] + delta);
> - return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
> + if (point >= group->limits[pci].point)
> + {
> + pressure = MAX (group->limits[pci].orig_pressure,
> + curr_reg_pressure[cl] + delta);
> + return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
> + }
> + /* if target prefers fewer spills, return the -ve delta indicating
> + pressure reduction. */
> + else if (!param_cycle_accurate_model)
> + return delta;
> }
>
> if (delta > 0)
> @@ -2453,7 +2460,7 @@ model_excess_cost (rtx_insn *insn, bool print_p)
> }
>
> if (print_p)
> - fprintf (sched_dump, "\n");
> + fprintf (sched_dump, " ECC %d\n", cost);
>
> return cost;
> }
> @@ -2489,8 +2496,9 @@ model_set_excess_costs (rtx_insn **insns, int count)
> bool print_p;
>
> /* Record the baseECC value for each instruction in the model schedule,
> - except that negative costs are converted to zero ones now rather than
> - later. Do not assign a cost to debug instructions, since they must
> + except that for targets which prefer wider schedules (more spills)
> + negative costs are converted to zero ones now rather than later.
> + Do not assign a cost to debug instructions, since they must
> not change code-generation decisions. Experiments suggest we also
> get better results by not assigning a cost to instructions from
> a different block.
> @@ -2512,7 +2520,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
> print_p = true;
> }
> cost = model_excess_cost (insns[i], print_p);
> - if (cost <= 0)
> + if (param_cycle_accurate_model && cost <= 0)
> {
> priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
> priority_base = MAX (priority_base, priority);
> @@ -2523,6 +2531,14 @@ model_set_excess_costs (rtx_insn **insns, int count)
> if (print_p)
> fprintf (sched_dump, MODEL_BAR);
>
> + /* Typically in-order cores have a good pipeline scheduling model and the
> + algorithm would try to use that to minimize bubbles, favoring spills.
> + MAX (baseECC, 0) below changes negative baseECC (pressure reduction)
> + to 0 (pressure neutral) thus tending to more spills.
> + Otherwise return. */
> + if (!param_cycle_accurate_model)
> + return;
> +
> /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
> instruction. */
> for (i = 0; i < count; i++)
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 7c572774df24..6d73993cd089 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -66,6 +66,10 @@ Enable asan stack protection.
> Common Joined UInteger Var(param_asan_use_after_return) Init(1)
> IntegerRange(0, 1) Param Optimization
> Enable asan detection of use-after-return bugs.
>
> +-param=cycle-accurate-model
> +Common Joined UInteger Var(param_cycle_accurate_model) Init(1)
> IntegerRange(0, 1) Param Optimization
> +Whether the scheduling description is mostly a cycle-accurate model of the
> target processor and is likely to be spill aggressively to fill any pipeline
> bubbles.
> +
> -param=hwasan-instrument-stack=
> Common Joined UInteger Var(param_hwasan_instrument_stack) Init(1)
> IntegerRange(0, 1) Param Optimization
> Enable hwasan instrumentation of statically sized stack-allocated variables.
> diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp
> b/gcc/testsuite/gcc.target/riscv/riscv.exp
> index 187eb6640470..3cbbf63b9d0a 100644
> --- a/gcc/testsuite/gcc.target/riscv/riscv.exp
> +++ b/gcc/testsuite/gcc.target/riscv/riscv.exp
> @@ -38,6 +38,8 @@ dg-init
> # Main loop.
> gcc-dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*.\[cS\]]] \
> "" $DEFAULT_CFLAGS
> +gcc-dg-runtest [lsort [glob -nocomplain
> $srcdir/$subdir/sched1-spills/*.{\[cS\],cpp}]] \
> + "" $DEFAULT_CFLAGS
>
> # All done.
> dg-finish
> diff --git a/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
> b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
> new file mode 100644
> index 000000000000..8060ec245281
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
> @@ -0,0 +1,32 @@
> +/* { dg-options "-O2 -march=rv64gc -mabi=lp64d -save-temps -fverbose-asm" }
> */
> +/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */
> +
> +/* Reduced from SPEC2017 Cactu ML_BSSN_Advect.cpp
> + by comparing -fschedule-insn and -fno-schedule-insns builds.
> + Shows up one extra spill (pair of spill markers "sfp") in verbose asm
> + output which the patch fixes. */
> +
> +void s();
> +double b, c, d, e, f, g, h, k, l, m, n, o, p, q, t, u, v;
> +int *j;
> +double *r, *w;
> +long x;
> +void y() {
> + double *a((double *)s);
> + for (;;)
> + for (; j[1];)
> + for (int i = 1; i < j[0]; i++) {
> + k = l;
> + m = n;
> + o = p = q;
> + r[0] = t;
> + a[0] = u;
> + x = g;
> + e = f;
> + v = w[x];
> + b = c;
> + d = h;
> + }
> +}
> +
> +/* { dg-final { scan-assembler-not "%sfp" } } */