Hi!
On 2024/06/23 1:49, Richard Sandiford wrote:
Takayuki 'January June' Suwa <jjsuwa_sys3...@yahoo.co.jp> writes:
On 2024/06/20 22:34, Richard Sandiford wrote:
This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.
The pass currently has a single objective: remove definitions by
substituting into all uses. The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.
The patch fixes PR106594. It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.
This is just a first step. I'm hoping that the pass could be
used for other combine-related optimisations in future. In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutable, since it doesn't have to worry about register
pressure. If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.
On most targets, the pass is enabled by default at -O2 and above.
However, it has a tendency to undo x86's STV and RPAD passes,
by folding the more complex post-STV/RPAD form back into the
simpler pre-pass form.
Also, running a pass after register allocation means that we can
now match define_insn_and_splits that were previously only matched
before register allocation. This trips things like:
(define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& 1"
[...pattern...]
{
...unconditional use of gen_reg_rtx ()...;
}
because matching and splitting after RA will call gen_reg_rtx when
pseudos are no longer allowed. rs6000 has several instances of this.
xtensa also has something like that.
xtensa has a variation in which the split condition is:
"&& can_create_pseudo_p ()"
The failure then is that, if we match after RA, we'll never be
able to split the instruction.
To be honest, I'm confusing by the possibility of adding a split pattern
application opportunity that depends on the optimization options after
Rel... ah, LRA and before the existing rtl-split2.
Because I just recently submitted a patch that I expected would reliably
(i.e. regardless of optimization options, etc.) apply the split pattern
first in the rtl-split2 pass after RA, and it was merged.
The patch therefore disables the pass by default on i386, rs6000
and xtensa. Hopefully we can fix those ports later (if their
maintainers want). It seems easier to add the pass first, though,
to make it easier to test any such fixes.
gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
quite a few updates for the late-combine output. That might be
worth doing, but it seems too complex to do as part of this patch.
I tried compiling at least one target per CPU directory and comparing
the assembly output for parts of the GCC testsuite. This is just a way
of getting a flavour of how the pass performs; it obviously isn't a
meaningful benchmark. All targets seemed to improve on average:
Target Tests Good Bad %Good Delta Median
====== ===== ==== === ===== ===== ======
aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1
aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1
alpha-linux-gnu 1454 1370 84 94.22% -9502 -1
amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1
arc-elf 2166 1932 234 89.20% -37742 -1
arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1
arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1
avr-elf 4789 4330 459 90.42% -441276 -4
bfin-elf 2795 2394 401 85.65% -19252 -1
bpf-elf 3122 2928 194 93.79% -8785 -1
c6x-elf 2227 1929 298 86.62% -17339 -1
cris-elf 3464 3270 194 94.40% -23263 -2
csky-elf 2915 2591 324 88.89% -22146 -1
epiphany-elf 2399 2304 95 96.04% -28698 -2
fr30-elf 7712 7299 413 94.64% -99830 -2
frv-linux-gnu 3332 2877 455 86.34% -25108 -1
ft32-elf 2775 2667 108 96.11% -25029 -1
h8300-elf 3176 2862 314 90.11% -29305 -2
hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2
ia64-linux-gnu 2343 1946 397 83.06% -9907 -2
iq2000-elf 9684 9637 47 99.51% -126557 -2
lm32-elf 2681 2608 73 97.28% -59884 -3
loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2
m32r-elf 1626 1517 109 93.30% -9323 -2
m68k-linux-gnu 3022 2620 402 86.70% -21531 -1
mcore-elf 2315 2085 230 90.06% -24160 -1
microblaze-elf 2782 2585 197 92.92% -16530 -1
mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1
mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2
mmix 4914 4814 100 97.96% -63021 -1
mn10300-elf 3639 3320 319 91.23% -34752 -2
moxie-rtems 3497 3252 245 92.99% -87305 -3
msp430-elf 4353 3876 477 89.04% -23780 -1
nds32le-elf 3042 2780 262 91.39% -27320 -1
nios2-linux-gnu 1683 1355 328 80.51% -8065 -1
nvptx-none 2114 1781 333 84.25% -12589 -2
or1k-elf 3045 2699 346 88.64% -14328 -2
pdp11 4515 4146 369 91.83% -26047 -2
pru-elf 1585 1245 340 78.55% -5225 -1
riscv32-elf 2122 2000 122 94.25% -101162 -2
riscv64-elf 1841 1726 115 93.75% -49997 -2
rl78-elf 2823 2530 293 89.62% -40742 -4
rx-elf 2614 2480 134 94.87% -18863 -1
s390-linux-gnu 1591 1393 198 87.55% -16696 -1
s390x-linux-gnu 2015 1879 136 93.25% -21134 -1
sh-linux-gnu 1870 1507 363 80.59% -9491 -1
sparc-linux-gnu 1123 1075 48 95.73% -14503 -1
sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1
sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1
v850-elf 1897 1728 169 91.09% -11078 -1
vax-netbsdelf 3035 2995 40 98.68% -27642 -1
visium-elf 1392 1106 286 79.45% -7984 -2
xstormy16-elf 2577 2071 506 80.36% -13061 -1
** snip **
To be more frank, once a split pattern is defined, it is applied by
the existing five split paths and possibly by combiners. In most cases,
it is enough to apply it in one of these places, or that is what the
pattern creator intended.
Wouldn't applying the split pattern indiscriminately in various places
be a waste of execution resources and bring about unexpected and undesired
results?
I think we need some way to properly control the application of the split
pattern, perhaps some predicate function.
The problem is more the define_insn part of the define_insn_and_split,
rather than the define_split part. The number and location of the split
passes is the same: anything matched by rtl-late_combine1 will be split by
rtl-split1 and anything matched by rtl-late_combine2 will be split by
rtl-split2. (If the split condition allows it, of course.)
But more things can be matched by rtl-late_combine2 than are matched by
other post-RA passes like rtl-postreload. And that's what causes the
issue. If:
(define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& 1"
[...pattern...]
{
...unconditional use of gen_reg_rtx ()...;
}
is matched by rtl-late_combine2, the split will be done by rtl-split2.
But the split will ICE, because it isn't valid to call gen_reg_rtx after
register allocation.
Similarly, if:
(define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& can_create_pseudo_p ()"
[...pattern...]
{
...unconditional use of gen_reg_rtx ()...;
}
is matched by rtl-late_combine2, the can_create_pseudo_p condition will
be false in rtl-split2, and in all subsequent split passes. So we'll
still have the unsplit instruction during final, which will ICE because
it doesn't have a valid means of implementing the "#".
The traditional (and IMO correct) way to handle this is to make the
pattern reserve the temporary registers that it needs, using match_scratches.
rs6000 has many examples of this. E.g.:
(define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
[(set (match_operand:IEEE128 0 "register_operand" "=wa")
(neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
(clobber (match_scratch:V16QI 2 "=v"))]
"TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
"#"
"&& 1"
[(parallel [(set (match_dup 0)
(neg:IEEE128 (match_dup 1)))
(use (match_dup 2))])]
{
if (GET_CODE (operands[2]) == SCRATCH)
operands[2] = gen_reg_rtx (V16QImode);
emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
}
[(set_attr "length" "8")
(set_attr "type" "vecsimple")])
Before RA, this is just:
(set ...)
(clobber (scratch:V16QI))
and the split creates a new register. After RA, operand 2 provides
the required temporary register:
(set ...)
(clobber (reg:V16QI TMP))
Another approach is to add can_create_pseudo_p () to the define_insn
condition (rather than the split condition). But IMO that's an ICE
trap, since insns that have already been matched & accepted shouldn't
suddenly become invalid if recog is reattempted later.
Thanks,
Richard
Ah, I see, I understand the standard idiom for synchronizing RA between
define_insn and split conditions(, I guess). However, as I wrote before,
it is true that split paths are useful for other purposes as well.
If this is unacceptable practice, then an alternative solution would be
to add a target-specific path (in my case, after postreload and before
rtl-late_combine2).
That is obviously a very involved process.