On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Takayuki 'January June' Suwa <jjsuwa_sys3...@yahoo.co.jp> writes: > > On 2024/06/20 22:34, Richard Sandiford wrote: > >> This patch adds a combine pass that runs late in the pipeline. > >> There are two instances: one between combine and split1, and one > >> after postreload. > >> > >> The pass currently has a single objective: remove definitions by > >> substituting into all uses. The pre-RA version tries to restrict > >> itself to cases that are likely to have a neutral or beneficial > >> effect on register pressure. > >> > >> The patch fixes PR106594. It also fixes a few FAILs and XFAILs > >> in the aarch64 test results, mostly due to making proper use of > >> MOVPRFX in cases where we didn't previously. > >> > >> This is just a first step. I'm hoping that the pass could be > >> used for other combine-related optimisations in future. In particular, > >> the post-RA version doesn't need to restrict itself to cases where all > >> uses are substitutable, since it doesn't have to worry about register > >> pressure. If we did that, and if we extended it to handle multi-register > >> REGs, the pass might be a viable replacement for regcprop, which in > >> turn might reduce the cost of having a post-RA instance of the new pass. > >> > >> On most targets, the pass is enabled by default at -O2 and above. > >> However, it has a tendency to undo x86's STV and RPAD passes, > >> by folding the more complex post-STV/RPAD form back into the > >> simpler pre-pass form. > >> > >> Also, running a pass after register allocation means that we can > >> now match define_insn_and_splits that were previously only matched > >> before register allocation. This trips things like: > >> > >> (define_insn_and_split "..." > >> [...pattern...] > >> "...cond..." > >> "#" > >> "&& 1" > >> [...pattern...] > >> { > >> ...unconditional use of gen_reg_rtx ()...; > >> } > >> > >> because matching and splitting after RA will call gen_reg_rtx when > >> pseudos are no longer allowed. rs6000 has several instances of this. > > > > xtensa also has something like that. > > > >> xtensa has a variation in which the split condition is: > >> > >> "&& can_create_pseudo_p ()" > >> > >> The failure then is that, if we match after RA, we'll never be > >> able to split the instruction. > > > > To be honest, I'm confusing by the possibility of adding a split pattern > > application opportunity that depends on the optimization options after > > Rel... ah, LRA and before the existing rtl-split2. > > > > Because I just recently submitted a patch that I expected would reliably > > (i.e. regardless of optimization options, etc.) apply the split pattern > > first in the rtl-split2 pass after RA, and it was merged. > > > >> > >> The patch therefore disables the pass by default on i386, rs6000 > >> and xtensa. Hopefully we can fix those ports later (if their > >> maintainers want). It seems easier to add the pass first, though, > >> to make it easier to test any such fixes. > >> > >> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need > >> quite a few updates for the late-combine output. That might be > >> worth doing, but it seems too complex to do as part of this patch. > >> > >> I tried compiling at least one target per CPU directory and comparing > >> the assembly output for parts of the GCC testsuite. This is just a way > >> of getting a flavour of how the pass performs; it obviously isn't a > >> meaningful benchmark. All targets seemed to improve on average: > >> > >> Target Tests Good Bad %Good Delta Median > >> ====== ===== ==== === ===== ===== ====== > >> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 > >> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 > >> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 > >> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 > >> arc-elf 2166 1932 234 89.20% -37742 -1 > >> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 > >> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 > >> avr-elf 4789 4330 459 90.42% -441276 -4 > >> bfin-elf 2795 2394 401 85.65% -19252 -1 > >> bpf-elf 3122 2928 194 93.79% -8785 -1 > >> c6x-elf 2227 1929 298 86.62% -17339 -1 > >> cris-elf 3464 3270 194 94.40% -23263 -2 > >> csky-elf 2915 2591 324 88.89% -22146 -1 > >> epiphany-elf 2399 2304 95 96.04% -28698 -2 > >> fr30-elf 7712 7299 413 94.64% -99830 -2 > >> frv-linux-gnu 3332 2877 455 86.34% -25108 -1 > >> ft32-elf 2775 2667 108 96.11% -25029 -1 > >> h8300-elf 3176 2862 314 90.11% -29305 -2 > >> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 > >> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 > >> iq2000-elf 9684 9637 47 99.51% -126557 -2 > >> lm32-elf 2681 2608 73 97.28% -59884 -3 > >> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 > >> m32r-elf 1626 1517 109 93.30% -9323 -2 > >> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 > >> mcore-elf 2315 2085 230 90.06% -24160 -1 > >> microblaze-elf 2782 2585 197 92.92% -16530 -1 > >> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 > >> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 > >> mmix 4914 4814 100 97.96% -63021 -1 > >> mn10300-elf 3639 3320 319 91.23% -34752 -2 > >> moxie-rtems 3497 3252 245 92.99% -87305 -3 > >> msp430-elf 4353 3876 477 89.04% -23780 -1 > >> nds32le-elf 3042 2780 262 91.39% -27320 -1 > >> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 > >> nvptx-none 2114 1781 333 84.25% -12589 -2 > >> or1k-elf 3045 2699 346 88.64% -14328 -2 > >> pdp11 4515 4146 369 91.83% -26047 -2 > >> pru-elf 1585 1245 340 78.55% -5225 -1 > >> riscv32-elf 2122 2000 122 94.25% -101162 -2 > >> riscv64-elf 1841 1726 115 93.75% -49997 -2 > >> rl78-elf 2823 2530 293 89.62% -40742 -4 > >> rx-elf 2614 2480 134 94.87% -18863 -1 > >> s390-linux-gnu 1591 1393 198 87.55% -16696 -1 > >> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 > >> sh-linux-gnu 1870 1507 363 80.59% -9491 -1 > >> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 > >> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 > >> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 > >> v850-elf 1897 1728 169 91.09% -11078 -1 > >> vax-netbsdelf 3035 2995 40 98.68% -27642 -1 > >> visium-elf 1392 1106 286 79.45% -7984 -2 > >> xstormy16-elf 2577 2071 506 80.36% -13061 -1 > >> > >> ** snip ** > > > > To be more frank, once a split pattern is defined, it is applied by > > the existing five split paths and possibly by combiners. In most cases, > > it is enough to apply it in one of these places, or that is what the > > pattern creator intended. > > > > Wouldn't applying the split pattern indiscriminately in various places > > be a waste of execution resources and bring about unexpected and undesired > > results? > > > > I think we need some way to properly control the application of the split > > pattern, perhaps some predicate function. > > The problem is more the define_insn part of the define_insn_and_split, > rather than the define_split part. The number and location of the split > passes is the same: anything matched by rtl-late_combine1 will be split by > rtl-split1 and anything matched by rtl-late_combine2 will be split by > rtl-split2. (If the split condition allows it, of course.) > > But more things can be matched by rtl-late_combine2 than are matched by > other post-RA passes like rtl-postreload. And that's what causes the > issue. If: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& 1" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > is matched by rtl-late_combine2, the split will be done by rtl-split2. > But the split will ICE, because it isn't valid to call gen_reg_rtx after > register allocation. > > Similarly, if: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& can_create_pseudo_p ()" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > is matched by rtl-late_combine2, the can_create_pseudo_p condition will > be false in rtl-split2, and in all subsequent split passes. So we'll > still have the unsplit instruction during final, which will ICE because > it doesn't have a valid means of implementing the "#". > > The traditional (and IMO correct) way to handle this is to make the > pattern reserve the temporary registers that it needs, using match_scratches. > rs6000 has many examples of this. E.g.: > > (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" > [(set (match_operand:IEEE128 0 "register_operand" "=wa") > (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) > (clobber (match_scratch:V16QI 2 "=v"))] > "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" > "#" > "&& 1" > [(parallel [(set (match_dup 0) > (neg:IEEE128 (match_dup 1))) > (use (match_dup 2))])] > { > if (GET_CODE (operands[2]) == SCRATCH) > operands[2] = gen_reg_rtx (V16QImode); > > emit_insn (gen_ieee_128bit_negative_zero (operands[2])); > } > [(set_attr "length" "8") > (set_attr "type" "vecsimple")]) > > Before RA, this is just: > > (set ...) > (clobber (scratch:V16QI)) > > and the split creates a new register. After RA, operand 2 provides > the required temporary register: > > (set ...) > (clobber (reg:V16QI TMP)) > > Another approach is to add can_create_pseudo_p () to the define_insn > condition (rather than the split condition). But IMO that's an ICE > trap, since insns that have already been matched & accepted shouldn't > suddenly become invalid if recog is reattempted later.
What about splitting immediately in late-combine? Wouldn't that possibly allow more combinations to immediately happen? Richard. > Thanks, > Richard >