Hi!

On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote:
> While working on SVE, I've noticed several cases in which we fail
> to combine instructions because the combined form would need to be
> placed earlier in the instruction stream than the last of the
> instructions being combined.  This includes one very important
> case in the handling of the first fault register (FFR).

Do you have an example of that?

> Combine currently requires the combined instruction to live at the same
> location as i3.

Or i2 and i3.

> I thought about trying to relax that restriction, but it
> would be difficult to do with the current pass structure while keeping
> everything linear-ish time.

s/difficult/impossible/, yes.

A long time ago we had to only move insns forward for correctness even,
but that should no longer be required, combine always is finite by other
means now.

> So this patch instead goes for an option that has been talked about
> several times over the years: writing a new combine pass that just
> does instruction combination, and not all the other optimisations
> that have been bolted onto combine over time.  E.g. it deliberately
> doesn't do things like nonzero-bits tracking, since that really ought
> to be a separate, more global, optimisation.

In my dreams tracking nonzero bits would be a dataflow problem.

> This is still far from being a realistic replacement for the even
> the combine parts of the current combine pass.  E.g.:
> 
> - it only handles combinations that can be built up from individual
>   two-instruction combinations.

And combine does any of {2,3,4}->{1,2} combinations, and it also can
modify a third insn ("other_insn").  For the bigger ->1 combos, if it
*can* be decomposed in a bunch of 2->1, then those result in insns that
are greater cost than those we started with (or else those combinations
*would* be done).  For the ->2 combinations, there are many ways those
two insns can be formed: it can be the two arms of a parallel, or
combine can break a non-matching insn into two at what looks like a good
spot for that, or it can use a define_split for it.

All those things lead to many more successful combinations :-)

> On a more positive note, the pass handles things that the current
> combine pass doesn't:
> 
> - the main motivating feature mentioned above: it works out where
>   the combined instruction could validly live and moves it there
>   if necessary.  If there are a range of valid places, it tries
>   to pick the best one based on register pressure (although only
>   with a simple heuristic for now).

How are dependencies represented in your new pass?  If it just does
walks over the insn stream for everything, you get quadratic complexity
if you move insns backwards.  We have that in combine already, mostly
from modified_between_p, but that is limited because of how LOG_LINKS
work, and we have been doing this for so long and there are no problems
found with it, so it must work in practice.  But I am worried about it
when moving insns back an unlimited distance.

If combine results in two insns it puts them at i2 and i3, and it can
actually move a SET to i2 that was at i3 before the combination.

> - once it has combined two instructions, it can try combining the
>   result with both later and earlier code, i.e. it can combine
>   in both directions.

That is what combine does, too.

> - it tries using REG_EQUAL notes for the final instruction.

And that.

> - it can parallelise two independent instructions that both read from
>   the same register or both read from memory.

That only if somehow there is a link between the two (so essentially
never).  The only combinations tried by combine are those via LOG_LINKs,
which are between a SET and the first corresponding use.  This is a key
factor that makes it kind of linear (instead of exponential) complexity.

> The pass is supposed to be linear time without debug insns.
> It only tries a constant number C of combinations per instruction
> and its bookkeeping updates are constant-time.

But how many other insns does it look at, say by modified_between_p or
the like?

> The patch adds two instances of the new pass: one before combine and
> one after it.

One thing I want to do is some mini-combine after every split, probably
only with the insns new from the split.  But we have no cfglayout mode
anymore then, and only hard regs (except in the first split pass, which
is just a little later than your new pass).

> As far as compile-time goes, I tried compiling optabs.ii at -O2
> with an --enable-checking=release compiler:
> 
> run-combine=2 (normal combine):  100.0% (baseline)
> run-combine=4 (new pass only)     98.0%
> run-combine=6 (both passes)      100.3%
> 
> where the results are easily outside the noise.  So the pass on
> its own is quicker than combine, but that's not a fair comparison
> when it doesn't do everything combine does.  Running both passes
> only has a slight overhead.

And amount of garbage produced?

> To get a feel for the effect on multiple targets, I did my usual
> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> and g++.dg, this time comparing run-combine=2 and run-combine=6
> using -O2 -ftree-vectorize:

One problem with this is that these are very short functions on average.

What is the kind of changes you see for other targets?

Wait, does this combine sets with a hard reg source as well?  It
shouldn't do that, that is RA's job; doing this in a greedy way is a
bad idea.  (I haven't yet verified if you do this, fwiw).

> Inevitably there was some scan-assembler fallout for other tests.
> E.g. in gcc.target/aarch64/vmov_n_1.c:
> 
> #define INHIB_OPTIMIZATION asm volatile ("" : : : "memory")
>   ...
>   INHIB_OPTIMIZATION;                                                 \
>   (a) = TEST (test, data_len);                                                
> \
>   INHIB_OPTIMIZATION;                                                 \
>   (b) = VMOV_OBSCURE_INST (reg_len, data_len, data_type) (&(a));      \
> 
> is no longer effective for preventing move (a) from being merged
> into (b), because the pass can merge at the point of (a).

It never was effective for that.  Unless (b) lives in memory, in which
case your new pass has a bug here.

> I think
> this is a valid thing to do -- the asm semantics are still satisfied,
> and asm volatile ("" : : : "memory") never acted as a register barrier.
> But perhaps we should deal with this as a special case?

I don't think we should, no.  What does "register barrier" even mean,
*exactly*?

(I'll look at the new version of your patch, but not today).


Segher

Reply via email to