Hi all, This small patch series is intended to address a shortcoming of the scheduler, which currently only identifies and preserves fusible instruction pairs (according to the value of the TARGET_SCHED_MACRO_FUSION_PAIR_P hook) that are already consecutive in the instruction stream, but does not do any reordering to form new ones, which means leaving some performance on the table.
The solution here is to implement a new RTL pass, which uses the RTL-SSA framework to identify single-use instructions that are fusible together with their uses (using the aforementioned hook), then reorder those to be issued back to back forming new fused pairs (as long as program semantics remain unchanged). This pass also sets the SCHED_GROUP flag for the second instruction of these new pairs, to be consumed by other passes, including the scheduler. For some of the newly formed instruction pairs, the fused macro-op needs to be a single-output operation from the HW perspective. This means that we want to make sure that the resulting hardware instructions have the same hard register as their output. To this end, patches 2 and 3 implement special handling for such pairs in IRA and regrename, respectively, so that this single-output property is preserved throughout the RTL pipeline. This was initially conceived for small but fusion-capable RISC-V cores, but the benefit can also be demonstrated on other architectures. To wit, I have used a Cortex-A53 for performance measurements. On SPECINT2006 (excluding 429.mcf as I don't have enough RAM for it), when compiled with -O2, the changes in cycle measurements are as follows: +--------------------+----------+----------+--------+ | benchmark (wl #) | before | after | delta | +--------------------+----------+----------+--------+ | 400.perlbench (0) | 1319842M | 1323976M | 0.31% | | 400.perlbench (1) | 476027M | 472280M | -0.79% | | 400.perlbench (2) | 727535M | 723317M | -0.58% | | 401.bzip2 (0) | 715395M | 705343M | -1.41% | | 401.bzip2 (1) | 316476M | 309248M | -2.28% | | 401.bzip2 (2) | 735357M | 714107M | -2.89% | | 401.bzip2 (3) | 716004M | 709522M | -0.91% | | 401.bzip2 (4) | 1052816M | 1033297M | -1.85% | | 401.bzip2 (5) | 582177M | 567665M | -2.49% | | 403.gcc (0) | 205247M | 195970M | -4.52% | | 403.gcc (1) | 293817M | 286668M | -2.43% | | 403.gcc (2) | 326197M | 316132M | -3.09% | | 403.gcc (3) | 209453M | 204784M | -2.23% | | 403.gcc (4) | 254564M | 251744M | -1.11% | | 403.gcc (5) | 362228M | 360525M | -0.47% | | 403.gcc (6) | 480274M | 480480M | 0.04% | | 403.gcc (7) | 430524M | 435792M | 1.22% | | 403.gcc (8) | 102518M | 103326M | 0.79% | | 445.gobmk (0) | 114281M | 115428M | 1.00% | | 445.gobmk (1) | 300708M | 303865M | 1.05% | | 445.gobmk (2) | 153852M | 155581M | 1.12% | | 445.gobmk (3) | 114635M | 116088M | 1.27% | | 445.gobmk (4) | 159135M | 160993M | 1.17% | | 456.hmmer (0) | 957695M | 957720M | 0.00% | | 456.hmmer (1) | 1831164M | 1829819M | -0.07% | | 458.sjeng (0) | 3122158M | 3114042M | -0.26% | | 462.libquantum (0) | 2815365M | 2815359M | -0.00% | | 464.h264ref (0) | 492320M | 491683M | -0.13% | | 464.h264ref (1) | 368700M | 368652M | -0.01% | | 464.h264ref (2) | 3392154M | 3390470M | -0.05% | | 471.omnetpp (0) | 2800115M | 2716493M | -2.99% | | 473.astar (0) | 1279345M | 1211472M | -5.31% | | 473.astar (1) | 1652526M | 1619623M | -1.99% | | 483.xalancbmk (0) | 1969041M | 1989654M | 1.05% | +--------------------+----------+----------+--------+ All patches have been bootstrapped and regtested on i386, x86_64, and aarch64, and additionally regtested on riscv32. Artemiy Volkov (3): gcc: introduce the dep_fusion pass ira: tie output allocnos for fused instruction pairs regrename: treat writes as reads for fused instruction pairs gcc/Makefile.in | 1 + gcc/common.opt | 4 ++ gcc/dep-fusion.cc | 148 +++++++++++++++++++++++++++++++++++++++++++ gcc/doc/invoke.texi | 15 ++++- gcc/ira-conflicts.cc | 12 +++- gcc/opts.cc | 1 + gcc/passes.def | 2 + gcc/regrename.cc | 10 ++- gcc/rtl.h | 1 + gcc/rtlanal.cc | 20 ++++++ gcc/tree-pass.h | 1 + 11 files changed, 209 insertions(+), 6 deletions(-) create mode 100644 gcc/dep-fusion.cc -- 2.43.0