Re: [PATCH 6/6] Add a late-combine pass [PR106594]

Takayuki 'January June' Suwa Sat, 22 Jun 2024 21:40:39 -0700

Hi!

On 2024/06/23 1:49, Richard Sandiford wrote:

Takayuki 'January June' Suwa <jjsuwa_sys3...@yahoo.co.jp> writes:

On 2024/06/20 22:34, Richard Sandiford wrote:

This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.


The pass currently has a single objective: remove definitions by
substituting into all uses.  The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.

The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.

This is just a first step.  I'm hoping that the pass could be
used for other combine-related optimisations in future.  In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutable, since it doesn't have to worry about register
pressure.  If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.

On most targets, the pass is enabled by default at -O2 and above.
However, it has a tendency to undo x86's STV and RPAD passes,
by folding the more complex post-STV/RPAD form back into the
simpler pre-pass form.

Also, running a pass after register allocation means that we can
now match define_insn_and_splits that were previously only matched
before register allocation.  This trips things like:

    (define_insn_and_split "..."
      [...pattern...]
      "...cond..."
      "#"
      "&& 1"
      [...pattern...]
      {
        ...unconditional use of gen_reg_rtx ()...;
      }

because matching and splitting after RA will call gen_reg_rtx when
pseudos are no longer allowed.  rs6000 has several instances of this.


xtensa also has something like that.

xtensa has a variation in which the split condition is:

      "&& can_create_pseudo_p ()"

The failure then is that, if we match after RA, we'll never be
able to split the instruction.


To be honest, I'm confusing by the possibility of adding a split pattern
application opportunity that depends on the optimization options after
Rel... ah, LRA and before the existing rtl-split2.

Because I just recently submitted a patch that I expected would reliably
(i.e. regardless of optimization options, etc.) apply the split pattern
first in the rtl-split2 pass after RA, and it was merged.


The patch therefore disables the pass by default on i386, rs6000
and xtensa.  Hopefully we can fix those ports later (if their
maintainers want).  It seems easier to add the pass first, though,
to make it easier to test any such fixes.

gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
quite a few updates for the late-combine output.  That might be
worth doing, but it seems too complex to do as part of this patch.

I tried compiling at least one target per CPU directory and comparing
the assembly output for parts of the GCC testsuite.  This is just a way
of getting a flavour of how the pass performs; it obviously isn't a
meaningful benchmark.  All targets seemed to improve on average:

Target                 Tests   Good    Bad   %Good   Delta  Median
======                 =====   ====    ===   =====   =====  ======
aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
arc-elf                 2166   1932    234  89.20%  -37742      -1
arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
avr-elf                 4789   4330    459  90.42% -441276      -4
bfin-elf                2795   2394    401  85.65%  -19252      -1
bpf-elf                 3122   2928    194  93.79%   -8785      -1
c6x-elf                 2227   1929    298  86.62%  -17339      -1
cris-elf                3464   3270    194  94.40%  -23263      -2
csky-elf                2915   2591    324  88.89%  -22146      -1
epiphany-elf            2399   2304     95  96.04%  -28698      -2
fr30-elf                7712   7299    413  94.64%  -99830      -2
frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
ft32-elf                2775   2667    108  96.11%  -25029      -1
h8300-elf               3176   2862    314  90.11%  -29305      -2
hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
iq2000-elf              9684   9637     47  99.51% -126557      -2
lm32-elf                2681   2608     73  97.28%  -59884      -3
loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
m32r-elf                1626   1517    109  93.30%   -9323      -2
m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
mcore-elf               2315   2085    230  90.06%  -24160      -1
microblaze-elf          2782   2585    197  92.92%  -16530      -1
mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
mmix                    4914   4814    100  97.96%  -63021      -1
mn10300-elf             3639   3320    319  91.23%  -34752      -2
moxie-rtems             3497   3252    245  92.99%  -87305      -3
msp430-elf              4353   3876    477  89.04%  -23780      -1
nds32le-elf             3042   2780    262  91.39%  -27320      -1
nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
nvptx-none              2114   1781    333  84.25%  -12589      -2
or1k-elf                3045   2699    346  88.64%  -14328      -2
pdp11                   4515   4146    369  91.83%  -26047      -2
pru-elf                 1585   1245    340  78.55%   -5225      -1
riscv32-elf             2122   2000    122  94.25% -101162      -2
riscv64-elf             1841   1726    115  93.75%  -49997      -2
rl78-elf                2823   2530    293  89.62%  -40742      -4
rx-elf                  2614   2480    134  94.87%  -18863      -1
s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
v850-elf                1897   1728    169  91.09%  -11078      -1
vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
visium-elf              1392   1106    286  79.45%   -7984      -2
xstormy16-elf           2577   2071    506  80.36%  -13061      -1

** snip **


To be more frank, once a split pattern is defined, it is applied by
the existing five split paths and possibly by combiners. In most cases,
it is enough to apply it in one of these places, or that is what the
pattern creator intended.

Wouldn't applying the split pattern indiscriminately in various places
be a waste of execution resources and bring about unexpected and undesired
results?

I think we need some way to properly control the application of the split
pattern, perhaps some predicate function.


The problem is more the define_insn part of the define_insn_and_split,
rather than the define_split part.  The number and location of the split
passes is the same: anything matched by rtl-late_combine1 will be split by
rtl-split1 and anything matched by rtl-late_combine2 will be split by
rtl-split2.  (If the split condition allows it, of course.)

But more things can be matched by rtl-late_combine2 than are matched by
other post-RA passes like rtl-postreload.  And that's what causes the
issue.  If:

     (define_insn_and_split "..."
       [...pattern...]
       "...cond..."
       "#"
       "&& 1"
       [...pattern...]
       {
         ...unconditional use of gen_reg_rtx ()...;
       }

is matched by rtl-late_combine2, the split will be done by rtl-split2.
But the split will ICE, because it isn't valid to call gen_reg_rtx after
register allocation.

Similarly, if:

     (define_insn_and_split "..."
       [...pattern...]
       "...cond..."
       "#"
       "&& can_create_pseudo_p ()"
       [...pattern...]
       {
         ...unconditional use of gen_reg_rtx ()...;
       }

is matched by rtl-late_combine2, the can_create_pseudo_p condition will
be false in rtl-split2, and in all subsequent split passes.  So we'll
still have the unsplit instruction during final, which will ICE because
it doesn't have a valid means of implementing the "#".

The traditional (and IMO correct) way to handle this is to make the
pattern reserve the temporary registers that it needs, using match_scratches.
rs6000 has many examples of this.  E.g.:

(define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
   [(set (match_operand:IEEE128 0 "register_operand" "=wa")
        (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
    (clobber (match_scratch:V16QI 2 "=v"))]
   "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
   "#"
   "&& 1"
   [(parallel [(set (match_dup 0)
                   (neg:IEEE128 (match_dup 1)))
              (use (match_dup 2))])]
{
   if (GET_CODE (operands[2]) == SCRATCH)
     operands[2] = gen_reg_rtx (V16QImode);

   emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
}
   [(set_attr "length" "8")
    (set_attr "type" "vecsimple")])

Before RA, this is just:

   (set ...)
   (clobber (scratch:V16QI))

and the split creates a new register.  After RA, operand 2 provides
the required temporary register:

   (set ...)
   (clobber (reg:V16QI TMP))

Another approach is to add can_create_pseudo_p () to the define_insn
condition (rather than the split condition).  But IMO that's an ICE
trap, since insns that have already been matched & accepted shouldn't
suddenly become invalid if recog is reattempted later.

Thanks,
Richard


Ah, I see, I understand the standard idiom for synchronizing RA between
define_insn and split conditions(, I guess).  However, as I wrote before,
it is true that split paths are useful for other purposes as well.

If this is unacceptable practice, then an alternative solution would be
to add a target-specific path (in my case, after postreload and before
rtl-late_combine2).

That is obviously a very involved process.

Re: [PATCH 6/6] Add a late-combine pass [PR106594]

Reply via email to