Re: [PATCH 6/6] Add a late-combine pass [PR106594]

Richard Biener Sun, 23 Jun 2024 02:34:29 -0700

On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Takayuki 'January June' Suwa <jjsuwa_sys3...@yahoo.co.jp> writes:
> > On 2024/06/20 22:34, Richard Sandiford wrote:
> >> This patch adds a combine pass that runs late in the pipeline.
> >> There are two instances: one between combine and split1, and one
> >> after postreload.
> >>
> >> The pass currently has a single objective: remove definitions by
> >> substituting into all uses.  The pre-RA version tries to restrict
> >> itself to cases that are likely to have a neutral or beneficial
> >> effect on register pressure.
> >>
> >> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> >> in the aarch64 test results, mostly due to making proper use of
> >> MOVPRFX in cases where we didn't previously.
> >>
> >> This is just a first step.  I'm hoping that the pass could be
> >> used for other combine-related optimisations in future.  In particular,
> >> the post-RA version doesn't need to restrict itself to cases where all
> >> uses are substitutable, since it doesn't have to worry about register
> >> pressure.  If we did that, and if we extended it to handle multi-register
> >> REGs, the pass might be a viable replacement for regcprop, which in
> >> turn might reduce the cost of having a post-RA instance of the new pass.
> >>
> >> On most targets, the pass is enabled by default at -O2 and above.
> >> However, it has a tendency to undo x86's STV and RPAD passes,
> >> by folding the more complex post-STV/RPAD form back into the
> >> simpler pre-pass form.
> >>
> >> Also, running a pass after register allocation means that we can
> >> now match define_insn_and_splits that were previously only matched
> >> before register allocation.  This trips things like:
> >>
> >>    (define_insn_and_split "..."
> >>      [...pattern...]
> >>      "...cond..."
> >>      "#"
> >>      "&& 1"
> >>      [...pattern...]
> >>      {
> >>        ...unconditional use of gen_reg_rtx ()...;
> >>      }
> >>
> >> because matching and splitting after RA will call gen_reg_rtx when
> >> pseudos are no longer allowed.  rs6000 has several instances of this.
> >
> > xtensa also has something like that.
> >
> >> xtensa has a variation in which the split condition is:
> >>
> >>      "&& can_create_pseudo_p ()"
> >>
> >> The failure then is that, if we match after RA, we'll never be
> >> able to split the instruction.
> >
> > To be honest, I'm confusing by the possibility of adding a split pattern
> > application opportunity that depends on the optimization options after
> > Rel... ah, LRA and before the existing rtl-split2.
> >
> > Because I just recently submitted a patch that I expected would reliably
> > (i.e. regardless of optimization options, etc.) apply the split pattern
> > first in the rtl-split2 pass after RA, and it was merged.
> >
> >>
> >> The patch therefore disables the pass by default on i386, rs6000
> >> and xtensa.  Hopefully we can fix those ports later (if their
> >> maintainers want).  It seems easier to add the pass first, though,
> >> to make it easier to test any such fixes.
> >>
> >> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
> >> quite a few updates for the late-combine output.  That might be
> >> worth doing, but it seems too complex to do as part of this patch.
> >>
> >> I tried compiling at least one target per CPU directory and comparing
> >> the assembly output for parts of the GCC testsuite.  This is just a way
> >> of getting a flavour of how the pass performs; it obviously isn't a
> >> meaningful benchmark.  All targets seemed to improve on average:
> >>
> >> Target                 Tests   Good    Bad   %Good   Delta  Median
> >> ======                 =====   ====    ===   =====   =====  ======
> >> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
> >> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
> >> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
> >> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
> >> arc-elf                 2166   1932    234  89.20%  -37742      -1
> >> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
> >> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
> >> avr-elf                 4789   4330    459  90.42% -441276      -4
> >> bfin-elf                2795   2394    401  85.65%  -19252      -1
> >> bpf-elf                 3122   2928    194  93.79%   -8785      -1
> >> c6x-elf                 2227   1929    298  86.62%  -17339      -1
> >> cris-elf                3464   3270    194  94.40%  -23263      -2
> >> csky-elf                2915   2591    324  88.89%  -22146      -1
> >> epiphany-elf            2399   2304     95  96.04%  -28698      -2
> >> fr30-elf                7712   7299    413  94.64%  -99830      -2
> >> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
> >> ft32-elf                2775   2667    108  96.11%  -25029      -1
> >> h8300-elf               3176   2862    314  90.11%  -29305      -2
> >> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
> >> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
> >> iq2000-elf              9684   9637     47  99.51% -126557      -2
> >> lm32-elf                2681   2608     73  97.28%  -59884      -3
> >> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
> >> m32r-elf                1626   1517    109  93.30%   -9323      -2
> >> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
> >> mcore-elf               2315   2085    230  90.06%  -24160      -1
> >> microblaze-elf          2782   2585    197  92.92%  -16530      -1
> >> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
> >> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
> >> mmix                    4914   4814    100  97.96%  -63021      -1
> >> mn10300-elf             3639   3320    319  91.23%  -34752      -2
> >> moxie-rtems             3497   3252    245  92.99%  -87305      -3
> >> msp430-elf              4353   3876    477  89.04%  -23780      -1
> >> nds32le-elf             3042   2780    262  91.39%  -27320      -1
> >> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
> >> nvptx-none              2114   1781    333  84.25%  -12589      -2
> >> or1k-elf                3045   2699    346  88.64%  -14328      -2
> >> pdp11                   4515   4146    369  91.83%  -26047      -2
> >> pru-elf                 1585   1245    340  78.55%   -5225      -1
> >> riscv32-elf             2122   2000    122  94.25% -101162      -2
> >> riscv64-elf             1841   1726    115  93.75%  -49997      -2
> >> rl78-elf                2823   2530    293  89.62%  -40742      -4
> >> rx-elf                  2614   2480    134  94.87%  -18863      -1
> >> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
> >> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
> >> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
> >> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
> >> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
> >> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
> >> v850-elf                1897   1728    169  91.09%  -11078      -1
> >> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
> >> visium-elf              1392   1106    286  79.45%   -7984      -2
> >> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
> >>
> >> ** snip **
> >
> > To be more frank, once a split pattern is defined, it is applied by
> > the existing five split paths and possibly by combiners. In most cases,
> > it is enough to apply it in one of these places, or that is what the
> > pattern creator intended.
> >
> > Wouldn't applying the split pattern indiscriminately in various places
> > be a waste of execution resources and bring about unexpected and undesired
> > results?
> >
> > I think we need some way to properly control the application of the split
> > pattern, perhaps some predicate function.
>
> The problem is more the define_insn part of the define_insn_and_split,
> rather than the define_split part.  The number and location of the split
> passes is the same: anything matched by rtl-late_combine1 will be split by
> rtl-split1 and anything matched by rtl-late_combine2 will be split by
> rtl-split2.  (If the split condition allows it, of course.)
>
> But more things can be matched by rtl-late_combine2 than are matched by
> other post-RA passes like rtl-postreload.  And that's what causes the
> issue.  If:
>
>     (define_insn_and_split "..."
>       [...pattern...]
>       "...cond..."
>       "#"
>       "&& 1"
>       [...pattern...]
>       {
>         ...unconditional use of gen_reg_rtx ()...;
>       }
>
> is matched by rtl-late_combine2, the split will be done by rtl-split2.
> But the split will ICE, because it isn't valid to call gen_reg_rtx after
> register allocation.
>
> Similarly, if:
>
>     (define_insn_and_split "..."
>       [...pattern...]
>       "...cond..."
>       "#"
>       "&& can_create_pseudo_p ()"
>       [...pattern...]
>       {
>         ...unconditional use of gen_reg_rtx ()...;
>       }
>
> is matched by rtl-late_combine2, the can_create_pseudo_p condition will
> be false in rtl-split2, and in all subsequent split passes.  So we'll
> still have the unsplit instruction during final, which will ICE because
> it doesn't have a valid means of implementing the "#".
>
> The traditional (and IMO correct) way to handle this is to make the
> pattern reserve the temporary registers that it needs, using match_scratches.
> rs6000 has many examples of this.  E.g.:
>
> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
>   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
>         (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
>    (clobber (match_scratch:V16QI 2 "=v"))]
>   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
>   "#"
>   "&& 1"
>   [(parallel [(set (match_dup 0)
>                    (neg:IEEE128 (match_dup 1)))
>               (use (match_dup 2))])]
> {
>   if (GET_CODE (operands[2]) == SCRATCH)
>     operands[2] = gen_reg_rtx (V16QImode);
>
>   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
> }
>   [(set_attr "length" "8")
>    (set_attr "type" "vecsimple")])
>
> Before RA, this is just:
>
>   (set ...)
>   (clobber (scratch:V16QI))
>
> and the split creates a new register.  After RA, operand 2 provides
> the required temporary register:
>
>   (set ...)
>   (clobber (reg:V16QI TMP))
>
> Another approach is to add can_create_pseudo_p () to the define_insn
> condition (rather than the split condition).  But IMO that's an ICE
> trap, since insns that have already been matched & accepted shouldn't
> suddenly become invalid if recog is reattempted later.


What about splitting immediately in late-combine?  Wouldn't that possibly
allow more combinations to immediately happen?

Richard.

> Thanks,
> Richard
>

Re: [PATCH 6/6] Add a late-combine pass [PR106594]

Reply via email to