On 7/27/25 3:35 AM, Artemiy Volkov wrote:
Hi all,
This small patch series is intended to address a shortcoming of the
scheduler, which currently only identifies and preserves fusible
instruction pairs (according to the value of the
TARGET_SCHED_MACRO_FUSION_PAIR_P hook) that are already consecutive in
the instruction stream, but does not do any reordering to form
new ones, which means leaving some performance on the table.
The solution here is to implement a new RTL pass, which uses the RTL-SSA
framework to identify single-use instructions that are fusible together
with their uses (using the aforementioned hook), then reorder those to
be issued back to back forming new fused pairs (as long as program
semantics remain unchanged). This pass also sets the SCHED_GROUP flag
for the second instruction of these new pairs, to be consumed by other
passes, including the scheduler.
For some of the newly formed instruction pairs, the fused macro-op needs
to be a single-output operation from the HW perspective. This means
that we want to make sure that the resulting hardware instructions have
the same hard register as their output. To this end, patches 2 and 3
implement special handling for such pairs in IRA and regrename,
respectively, so that this single-output property is preserved
throughout the RTL pipeline.
This was initially conceived for small but fusion-capable RISC-V cores,
but the benefit can also be demonstrated on other architectures. To
wit, I have used a Cortex-A53 for performance measurements. On
SPECINT2006 (excluding 429.mcf as I don't have enough RAM for it), when
compiled with -O2, the changes in cycle measurements are as follows:
[ ... ]
Just FTR. We played with earlier versions of Artemiy's work at Ventana.
From a static standpoint it was finding notably more fusion
opportunities, particularly with all the cases we we can fuse add/shadd
instructions computing addresses for load instructions.
While the runtime behavior went the wrong way I've always strongly
suspected it was either my mis-reading of our internal docs on what can
be fused or the hardware not fusing something it was supposed to.
We've recently found a couple cases of the former and we've found a HW
design correctness issue in this space as well. I just haven't had the
chance to rerun spec2017 with Artemiy's work with those issues addressed
on our side.
Just to be clear, I broadly support Artemiy's work. I think it is quite
promising and hope we can get it reviewed and integrated. Just wanted
to pass along an additional datapoint for anyone looking at this stuff.
jeff