Hi all,

This small patch series is intended to address a shortcoming of the
scheduler, which currently only identifies and preserves fusible
instruction pairs (according to the value of the
TARGET_SCHED_MACRO_FUSION_PAIR_P hook) that are already consecutive in
the instruction stream, but does not do any reordering to form
new ones, which means leaving some performance on the table.

The solution here is to implement a new RTL pass, which uses the RTL-SSA
framework to identify single-use instructions that are fusible together
with their uses (using the aforementioned hook), then reorder those to
be issued back to back forming new fused pairs (as long as program
semantics remain unchanged).  This pass also sets the SCHED_GROUP flag
for the second instruction of these new pairs, to be consumed by other
passes, including the scheduler.

For some of the newly formed instruction pairs, the fused macro-op needs
to be a single-output operation from the HW perspective.  This means
that we want to make sure that the resulting hardware instructions have
the same hard register as their output.  To this end, patches 2 and 3
implement special handling for such pairs in IRA and regrename,
respectively, so that this single-output property is preserved
throughout the RTL pipeline.

This was initially conceived for small but fusion-capable RISC-V cores,
but the benefit can also be demonstrated on other architectures.  To
wit, I have used a Cortex-A53 for performance measurements.  On
SPECINT2006 (excluding 429.mcf as I don't have enough RAM for it), when
compiled with -O2, the changes in cycle measurements are as follows:

+--------------------+----------+----------+--------+
|  benchmark (wl #)  |  before  |  after   | delta  |
+--------------------+----------+----------+--------+
| 400.perlbench (0)  | 1319842M | 1323976M | 0.31%  |
| 400.perlbench (1)  | 476027M  | 472280M  | -0.79% |
| 400.perlbench (2)  | 727535M  | 723317M  | -0.58% |
| 401.bzip2 (0)      | 715395M  | 705343M  | -1.41% |
| 401.bzip2 (1)      | 316476M  | 309248M  | -2.28% |
| 401.bzip2 (2)      | 735357M  | 714107M  | -2.89% |
| 401.bzip2 (3)      | 716004M  | 709522M  | -0.91% |
| 401.bzip2 (4)      | 1052816M | 1033297M | -1.85% |
| 401.bzip2 (5)      | 582177M  | 567665M  | -2.49% |
| 403.gcc (0)        | 205247M  | 195970M  | -4.52% |
| 403.gcc (1)        | 293817M  | 286668M  | -2.43% |
| 403.gcc (2)        | 326197M  | 316132M  | -3.09% |
| 403.gcc (3)        | 209453M  | 204784M  | -2.23% |
| 403.gcc (4)        | 254564M  | 251744M  | -1.11% |
| 403.gcc (5)        | 362228M  | 360525M  | -0.47% |
| 403.gcc (6)        | 480274M  | 480480M  | 0.04%  |
| 403.gcc (7)        | 430524M  | 435792M  | 1.22%  |
| 403.gcc (8)        | 102518M  | 103326M  | 0.79%  |
| 445.gobmk (0)      | 114281M  | 115428M  | 1.00%  |
| 445.gobmk (1)      | 300708M  | 303865M  | 1.05%  |
| 445.gobmk (2)      | 153852M  | 155581M  | 1.12%  |
| 445.gobmk (3)      | 114635M  | 116088M  | 1.27%  |
| 445.gobmk (4)      | 159135M  | 160993M  | 1.17%  |
| 456.hmmer (0)      | 957695M  | 957720M  | 0.00%  |
| 456.hmmer (1)      | 1831164M | 1829819M | -0.07% |
| 458.sjeng (0)      | 3122158M | 3114042M | -0.26% |
| 462.libquantum (0) | 2815365M | 2815359M | -0.00% |
| 464.h264ref (0)    | 492320M  | 491683M  | -0.13% |
| 464.h264ref (1)    | 368700M  | 368652M  | -0.01% |
| 464.h264ref (2)    | 3392154M | 3390470M | -0.05% |
| 471.omnetpp (0)    | 2800115M | 2716493M | -2.99% |
| 473.astar (0)      | 1279345M | 1211472M | -5.31% |
| 473.astar (1)      | 1652526M | 1619623M | -1.99% |
| 483.xalancbmk (0)  | 1969041M | 1989654M | 1.05%  |
+--------------------+----------+----------+--------+

All patches have been bootstrapped and regtested on i386, x86_64, and
aarch64, and additionally regtested on riscv32.

Artemiy Volkov (3):
  gcc: introduce the dep_fusion pass
  ira: tie output allocnos for fused instruction pairs
  regrename: treat writes as reads for fused instruction pairs

 gcc/Makefile.in      |   1 +
 gcc/common.opt       |   4 ++
 gcc/dep-fusion.cc    | 148 +++++++++++++++++++++++++++++++++++++++++++
 gcc/doc/invoke.texi  |  15 ++++-
 gcc/ira-conflicts.cc |  12 +++-
 gcc/opts.cc          |   1 +
 gcc/passes.def       |   2 +
 gcc/regrename.cc     |  10 ++-
 gcc/rtl.h            |   1 +
 gcc/rtlanal.cc       |  20 ++++++
 gcc/tree-pass.h      |   1 +
 11 files changed, 209 insertions(+), 6 deletions(-)
 create mode 100644 gcc/dep-fusion.cc

-- 
2.43.0

Reply via email to