Re: Add a new combine pass

2019-12-06 Thread Oleg Endo
On Fri, 2019-12-06 at 16:51 -0600, Segher Boessenkool wrote:
> On Wed, Dec 04, 2019 at 07:43:30PM +0900, Oleg Endo wrote:
> > On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote:
> > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
> > > > explicitly, but it's got a R0_REGS class with only one said reg in it. 
> > > > So the default impl of class_likely_spilled should do its thing.
> > > 
> > > Yes, good point.  So what happened here?
> > 
> > "Something, somewhere, went terribly wrong"...
> > 
> > insn 18 wants to do
> > 
> > mov.l @(r4,r6),r0
> > 
> > But it can't because the reg+reg address mode has a R0 constraint
> > itself.  So it needs to be changed to
> > 
> > mov   r4,r0
> > mov.l @(r0,r6),r0
> > 
> > And it can't handle that.  Or only sometimes?  Don't remember.
> > 
> > >   Is it just RA messing things
> > > up, unrelated to the new pass?
> > 
> > Yep, I think so.  The additional pass seems to create "tougher" code so
> > reload passes out earlier than usual.  We've had the same issue when
> > trying address mode selection optimization.  In fact that was one huge
> > showstopper.
> 
> So maybe you should have a define_insn_and_split that allows any two
> regs and replaces one by r0 if neither is (and a move to r0 before the
> load)?  Split after reload of course.
> 
> It may be admitting defeat, but it may even result in better code as
> well ;-)
> 

AFAIR I've tried that already and it was just like running in circles. 
Means it didn't help.  Perhaps if R0_REGS was hidden from RA altogether
it might work.  But that sounds like opening a whole other can of
worms.  Another idea I was entertaining was to do a custom RTL pass to
pre-allocate all R0 constraints before the real full RA.  But then the
whole reload stuff would still have the same issue as above.  So all
the wallpapering is just moot.  Proper fix of the actual problem would
be more appropriate.

Cheers,
Oleg



Re: Add a new combine pass

2019-12-06 Thread Segher Boessenkool
On Wed, Dec 04, 2019 at 07:43:30PM +0900, Oleg Endo wrote:
> On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote:
> > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
> > > explicitly, but it's got a R0_REGS class with only one said reg in it. 
> > > So the default impl of class_likely_spilled should do its thing.
> > 
> > Yes, good point.  So what happened here?
> 
> "Something, somewhere, went terribly wrong"...
> 
> insn 18 wants to do
> 
> mov.l @(r4,r6),r0
> 
> But it can't because the reg+reg address mode has a R0 constraint
> itself.  So it needs to be changed to
> 
> mov   r4,r0
> mov.l @(r0,r6),r0
> 
> And it can't handle that.  Or only sometimes?  Don't remember.
> 
> >   Is it just RA messing things
> > up, unrelated to the new pass?
> 
> Yep, I think so.  The additional pass seems to create "tougher" code so
> reload passes out earlier than usual.  We've had the same issue when
> trying address mode selection optimization.  In fact that was one huge
> showstopper.

So maybe you should have a define_insn_and_split that allows any two
regs and replaces one by r0 if neither is (and a move to r0 before the
load)?  Split after reload of course.

It may be admitting defeat, but it may even result in better code as
well ;-)


Segher


Re: Add a new combine pass

2019-12-05 Thread Richard Sandiford
Here's a revised version based on the feedback so far.  Changes in v2:
- Don't move instructions that set or use allocatable hard registers.
- Check legitimate_combined_insn
- Check cannot_copy_insn_p when keeping the original insn in parallel
- Disable the pass if HAVE_cc0

I compared v1 and v2 in the same way as before and the new restrictions
didn't make much difference (as hoped).  Also bootstrapped & regression-
tested on aarch64-linux-gnu and x86_64-linux-gnu with run-combine
defaulting to 6 (unlike in the patch, where the new pass is disabled
by default).

Thanks,
Richard


2019-12-05  Richard Sandiford  

gcc/
* Makefile.in (OBJS): Add combine2.o
* params.opt (--param=run-combine): New option.
* doc/invoke.texi: Document it.
* tree-pass.h (make_pass_combine2_before): Declare.
(make_pass_combine2_after): Likewise.
* passes.def: Add them.
* timevar.def (TV_COMBINE2): New timevar.
* cfgrtl.h (update_cfg_for_uncondjump): Declare.
* combine.c (update_cfg_for_uncondjump): Move to...
* cfgrtl.c (update_cfg_for_uncondjump): ...here.
* simplify-rtx.c (simplify_truncation): Handle comparisons.
* recog.h (validate_simplify_replace_rtx): Declare.
* recog.c (validate_simplify_replace_rtx_1): New function.
(validate_simplify_replace_rtx_uses): Likewise.
(validate_simplify_replace_rtx): Likewise.
* combine2.c: New file.

Index: gcc/Makefile.in
===
--- gcc/Makefile.in 2019-12-03 18:06:09.885650522 +
+++ gcc/Makefile.in 2019-12-05 10:11:50.637631870 +
@@ -1261,6 +1261,7 @@ OBJS = \
cgraphunit.o \
cgraphclones.o \
combine.o \
+   combine2.o \
combine-stack-adj.o \
compare-elim.o \
context.o \
Index: gcc/params.opt
===
--- gcc/params.opt  2019-12-02 17:38:20.072423250 +
+++ gcc/params.opt  2019-12-05 10:11:50.653631761 +
@@ -760,6 +760,10 @@ Use internal function id in profile look
 Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) 
IntegerRange(2, 65536) Param
 Maximum depth of a loop nest to fully value-number optimistically.
 
+-param=run-combine=
+Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param
+Choose which of the 3 available combine passes to run: bit 1 for the main 
combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a 
later variant of the combine pass.
+
 -param=sccvn-max-alias-queries-per-access=
 Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) 
Init(1000) Param
 Maximum number of disambiguations to perform per memory access.
Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi 2019-12-02 17:38:18.364434903 +
+++ gcc/doc/invoke.texi 2019-12-05 10:11:50.653631761 +
@@ -11797,6 +11797,11 @@ in combiner for a pseudo register as las
 @item max-combine-insns
 The maximum number of instructions the RTL combiner tries to combine.
 
+@item run-combine
+Choose which of the 3 available combine passes to run: bit 1 for the main
+combine pass, bit 0 for an earlier variant of the combine pass, and bit 2
+for a later variant of the combine pass.
+
 @item integer-share-limit
 Small integer constants can use a shared data structure, reducing the
 compiler's memory usage and increasing its speed.  This sets the maximum
Index: gcc/tree-pass.h
===
--- gcc/tree-pass.h 2019-11-19 16:25:28.0 +
+++ gcc/tree-pass.h 2019-12-05 10:11:50.657631731 +
@@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i
 extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt);
Index: gcc/passes.def
===
--- gcc/passes.def  2019-11-19 16:25:28.0 +
+++ gcc/passes.def  2019-12-05 10:11:50.653631761 +
@@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
   NEXT_PASS (pass_inc_dec);
   NEXT_PASS (pass_initialize_regs);
   NEXT_PASS (pass_ud_rtl_dce);
+  NEXT_PASS (pass_combine2_before);
   NEXT_PASS (pass_combine);
+  NEXT_PASS (pass_combine2_after);
   NEXT_PASS (pass_if_after_combine);
   NEXT_PASS 

Re: Add a new combine pass

2019-12-04 Thread Oleg Endo
On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote:
> On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote:
> > On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote:
> > > 
> > > > > - sh (that's sh4-linux):
> > > > > 
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 
> > > > > 'snmp_get_cpu_field':
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to 
> > > > > find a register to spill in class 'R0_REGS'
> > > > >  1638 | }
> > > > >   | ^
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the 
> > > > > insn:
> > > > > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > > > > (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) 
> > > > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> > > > >  (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > > > > (nil
> > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier 
> > > > > errors, bailing out
> > > > 
> > > > Would have to look more at this one.  Seems odd that it can't allocate
> > > > R0 when it's already the destination and when R0 can't be live before
> > > > the insn.  But there again, this is reload, so my enthuasiasm for 
> > > > looking
> > > > is a bit limited :-)
> > > 
> > > It wants to use r0 in some other insn, so it needs to spill it here, but
> > > cannot.  This is what class_likely_spilled is for.
> > 
> > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
> > explicitly, but it's got a R0_REGS class with only one said reg in it. 
> > So the default impl of class_likely_spilled should do its thing.
> 
> Yes, good point.  So what happened here?

"Something, somewhere, went terribly wrong"...

insn 18 wants to do

mov.l @(r4,r6),r0

But it can't because the reg+reg address mode has a R0 constraint
itself.  So it needs to be changed to

mov   r4,r0
mov.l @(r0,r6),r0

And it can't handle that.  Or only sometimes?  Don't remember.


>   Is it just RA messing things
> up, unrelated to the new pass?
> 

Yep, I think so.  The additional pass seems to create "tougher" code so
reload passes out earlier than usual.  We've had the same issue when
trying address mode selection optimization.  In fact that was one huge
showstopper.

Cheers,
Oleg



Re: Add a new combine pass

2019-12-03 Thread Segher Boessenkool
On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote:
> On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote:
> > 
> > > > - sh (that's sh4-linux):
> > > > 
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 
> > > > 'snmp_get_cpu_field':
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to 
> > > > find a register to spill in class 'R0_REGS'
> > > >  1638 | }
> > > >   | ^
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the 
> > > > insn:
> > > > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > > > (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) 
> > > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> > > >  (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > > > (nil
> > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier 
> > > > errors, bailing out
> > > 
> > > Would have to look more at this one.  Seems odd that it can't allocate
> > > R0 when it's already the destination and when R0 can't be live before
> > > the insn.  But there again, this is reload, so my enthuasiasm for looking
> > > is a bit limited :-)
> > 
> > It wants to use r0 in some other insn, so it needs to spill it here, but
> > cannot.  This is what class_likely_spilled is for.
> 
> Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
> explicitly, but it's got a R0_REGS class with only one said reg in it. 
> So the default impl of class_likely_spilled should do its thing.

Yes, good point.  So what happened here?  Is it just RA messing things
up, unrelated to the new pass?


Segher


Re: Add a new combine pass

2019-12-03 Thread Oleg Endo
On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote:
> 
> > > - sh (that's sh4-linux):
> > > 
> > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 
> > > 'snmp_get_cpu_field':
> > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find 
> > > a register to spill in class 'R0_REGS'
> > >  1638 | }
> > >   | ^
> > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the 
> > > insn:
> > > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > > (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) 
> > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> > >  (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > > (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > > (nil
> > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier 
> > > errors, bailing out
> > 
> > Would have to look more at this one.  Seems odd that it can't allocate
> > R0 when it's already the destination and when R0 can't be live before
> > the insn.  But there again, this is reload, so my enthuasiasm for looking
> > is a bit limited :-)
> 
> It wants to use r0 in some other insn, so it needs to spill it here, but
> cannot.  This is what class_likely_spilled is for.
> 

Hmm ... the R0 problem ... SH doesn't override class_likely_spilled
explicitly, but it's got a R0_REGS class with only one said reg in it. 
So the default impl of class_likely_spilled should do its thing.

LRA is available on SH and often fixes the R0 problems -- but not
always.  Maybe it got better over time, haven't checked.

Could you re-run the SH build tests with -mlra, please ?

Cheers,
Oleg




Re: Add a new combine pass

2019-11-27 Thread Segher Boessenkool
On Wed, Nov 27, 2019 at 09:29:27AM +0100, Richard Biener wrote:
> On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool
>  wrote:
> > combine actually *calculates* DU chains almost completely, it just throws
> > away most of that information (it wants to have LOG_LINKS, as it did ages
> > ago).  The only thing stopping us from doing that right now is that not
> > all uses are counted (some are skipped).
> >
> > Since combine works only within BBs, DU chains are linear to compute, and
> > UD chains are trivial (and just linear to compute).
> 
> quadraticness appears for RTL DU/UD chains because of partial definitions,
> that doesn't change for BBs so even there computing is them is quadratic
> (because recording them is).  The situation is simply having N partial
> defs all reaching M uses which gives you a chain of size N * M.

And both N and M are constants here (bounded by a constant).  The only
dimensions we care about are those the user can grow unlimited: number
of registers, number of instructions, number of functions, that kind of
thing.

The control flow graph in a basic block is a DAG, making most of this
linear to compute.  Only updating it after every separate change is not
easily linear in total.

> Updating is linear as well if you can disregard partial defs.  Updating cannot
> be quadratic if compute is linear ;)

Sure it can.  Updating has to be O(1) (amortized) per change for the whole
pass to be O(n).  If it is O(n) per change you are likely O(n^2) in total.

I don't see how to make combine itself O(1) per change, but yeah I can
see how that can work (or almost work) for something simpler (and less
weighed down by history :-) ).


Segher


Re: Add a new combine pass

2019-11-27 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool
>  wrote:
>>
>> On Mon, Nov 25, 2019 at 11:08:47PM +, Richard Sandiford wrote:
>> > Segher Boessenkool  writes:
>> > > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote:
>> > >> Segher Boessenkool  writes:
>> > >> > I am wondering the other way around :-)  Is what you do for combine2
>> > >> > something that would be more generally applicable/useful?  That's what
>> > >> > I'm trying to find out :-)
>> > >> >
>> > >> > What combine does could use some improvement, if you want to hear a
>> > >> > more direct motivations.  LOG_LINKS just skip references we cannot
>> > >> > handle (and some more), so we always have to do modified_between etc.,
>> > >> > which hurts.
>> > >>
>> > >> The trade-offs behind the choice of representation are very specific
>> > >> to the pass.
>> > >
>> > > Yes, but hopefully not so specific that every pass needs a completely
>> > > different representation ;-)
>> >
>> > Well, it depends.  Most passes make do with df (without DU/UD-chains).
>> > But since DU/UD-chains are naturally quadratic in the general case,
>> > and are expensive to keep up to date, each DU/UD pass is going to have
>> > make some compromises.  It doesn't seem too bad that passes make
>> > different compromises based on what they're trying to do.  (combine:
>> > single use per definition; fwprop.c: track all uses, but for dominating
>> > definitions only; sched: fudged via a param; regrename: single
>> > definition/multiple use chains optimised for renmaing; combine2: full
>> > live range information, but limited use list; etc.)
>>
>> combine actually *calculates* DU chains almost completely, it just throws
>> away most of that information (it wants to have LOG_LINKS, as it did ages
>> ago).  The only thing stopping us from doing that right now is that not
>> all uses are counted (some are skipped).
>>
>> Since combine works only within BBs, DU chains are linear to compute, and
>> UD chains are trivial (and just linear to compute).
>
> quadraticness appears for RTL DU/UD chains because of partial definitions,
> that doesn't change for BBs so even there computing is them is quadratic
> (because recording them is).  The situation is simply having N partial
> defs all reaching M uses which gives you a chain of size N * M.
>
> Now - for combine you don't want partial defs, so for simplicity we could
> choose to _not_ record DU/UD chains whenever we see a partial def for
> a pseudo (and mark those as "bad").  Or, slightly enhanced, we can
> handle DU/UD chains for regions where there is no partial definition
> and add a "fake" D denoting (there are [multiple] defs beyond that
> might be partial).  Depending on the use-case that should suffice and
> make the problem linear.
>
> I think you want to ask sth like "is REG changed [partially] between
> its use in insn A and the def in insn B" and you want to answer that by using
> REGs UD chain for that.  If you only ever reached the def in insn B via the
> "pruned" chain then this would work, likewise for the chain we do not compute
> any UD chain for REG.

(Passing over this as I think it's about what current combine wants.)

>> Updating is quadratic in general, sure.  Luckily in most realistic cases
>> it is cheap (most, sigh) (insns aren't combined to very far away).
>
> Updating is linear as well if you can disregard partial defs.
> Updating cannot be quadratic if compute is linear ;)

This was based on the assumption that we'd do an update after each
combination, so that the pass still sees correct info.  That then makes
the updates across one run of the pass quadratic, since the number of
successful combinations is O(ninsns).

As far as the new pass goes: the pass would be quadratic if we tried
to combine each use in single-def DU chain with its definition.  It would
also be quadratic if we tried to parallelise each pair of uses in a DU chain.
So if we did have full DU chains in the new pass, we'd also need some
limit N on the number of uses we try to combine with.

And if we're only going to try combining with N uses, then it seemed
better to track only N uses "by name", rather than pay the cost of
tracking all uses by name but ignoring the information for some of them.
All we care about for other uses is whether they would prevent a move.
We can track that using a simple point-based live range, where points
are LUIDs with gaps in between for new insns.

So the new pass uses a list of N specific uses and a single live range.
Querying whether a particular definition is live at a particular point
is then a constant-time operation.  So is updating the info after a
successful combination (potentially including a move).

That still seems like a reasonable way of representing this, given what
the pass wants to do.  Moving to full DU chains would IMO just make the
pass more expensive with no obvious benefit.

Thanks,
Richard


Re: Add a new combine pass

2019-11-27 Thread Richard Biener
On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool
 wrote:
>
> On Mon, Nov 25, 2019 at 11:08:47PM +, Richard Sandiford wrote:
> > Segher Boessenkool  writes:
> > > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote:
> > >> Segher Boessenkool  writes:
> > >> > I am wondering the other way around :-)  Is what you do for combine2
> > >> > something that would be more generally applicable/useful?  That's what
> > >> > I'm trying to find out :-)
> > >> >
> > >> > What combine does could use some improvement, if you want to hear a
> > >> > more direct motivations.  LOG_LINKS just skip references we cannot
> > >> > handle (and some more), so we always have to do modified_between etc.,
> > >> > which hurts.
> > >>
> > >> The trade-offs behind the choice of representation are very specific
> > >> to the pass.
> > >
> > > Yes, but hopefully not so specific that every pass needs a completely
> > > different representation ;-)
> >
> > Well, it depends.  Most passes make do with df (without DU/UD-chains).
> > But since DU/UD-chains are naturally quadratic in the general case,
> > and are expensive to keep up to date, each DU/UD pass is going to have
> > make some compromises.  It doesn't seem too bad that passes make
> > different compromises based on what they're trying to do.  (combine:
> > single use per definition; fwprop.c: track all uses, but for dominating
> > definitions only; sched: fudged via a param; regrename: single
> > definition/multiple use chains optimised for renmaing; combine2: full
> > live range information, but limited use list; etc.)
>
> combine actually *calculates* DU chains almost completely, it just throws
> away most of that information (it wants to have LOG_LINKS, as it did ages
> ago).  The only thing stopping us from doing that right now is that not
> all uses are counted (some are skipped).
>
> Since combine works only within BBs, DU chains are linear to compute, and
> UD chains are trivial (and just linear to compute).

quadraticness appears for RTL DU/UD chains because of partial definitions,
that doesn't change for BBs so even there computing is them is quadratic
(because recording them is).  The situation is simply having N partial
defs all reaching M uses which gives you a chain of size N * M.

Now - for combine you don't want partial defs, so for simplicity we could
choose to _not_ record DU/UD chains whenever we see a partial def for
a pseudo (and mark those as "bad").  Or, slightly enhanced, we can
handle DU/UD chains for regions where there is no partial definition
and add a "fake" D denoting (there are [multiple] defs beyond that
might be partial).  Depending on the use-case that should suffice and
make the problem linear.

I think you want to ask sth like "is REG changed [partially] between
its use in insn A and the def in insn B" and you want to answer that by using
REGs UD chain for that.  If you only ever reached the def in insn B via the
"pruned" chain then this would work, likewise for the chain we do not compute
any UD chain for REG.

> Updating is quadratic in general, sure.  Luckily in most realistic cases
> it is cheap (most, sigh) (insns aren't combined to very far away).

Updating is linear as well if you can disregard partial defs.  Updating cannot
be quadratic if compute is linear ;)

> > So yeah, if passes want to make roughly the same compromises, it would
> > obviously be good if they shared a representation.  But since each pass
> > does something different, I don't think it's a bad sign that they make
> > different compromises and use different representations.
> >
> > So I don't think a new pass with a new representation is in itself a
> > sign of failure.
>
> Oh, I don't think so either.  I just wonder if it would be useful more
> generically :-)
>
> > >> >> >> Target Tests   DeltaBest   Worst  Median
> > >> >> >> avr-elf 1341 -111401  -13824 680 -10
> > >> >> >
> > >> >> > Things like this are kind of suspicious :-)
> > >> >>
> > >> >> Yeah.  This mostly seems to come from mopping up the extra moves 
> > >> >> created
> > >> >> by make_more_copies.  So we have combinations like:
> > >> >>
> > >> >>58: r70:SF=r94:SF
> > >> >>   REG_DEAD r94:SF
> > >> >>60: r22:SF=r70:SF
> > >> >>   REG_DEAD r70:SF
> > >> >
> > >> > Why didn't combine do this?  A target problem?
> > >>
> > >> Seems to be because combine rejects hard-reg destinations whose classes
> > >> are likely spilled (cant_combine_insn_p).
> > >
> > > Ah, okay.  And that is required to prevent ICEs, in combine2 as well
> > > then -- ICEs in RA.
> >
> > Not in this case though.  The final instruction is a hardreg<-pseudo move
> > whatever happens.  There's nothing special about r70 compared to r94.
>
> So the target hook could be improved?  Or, this doesn't matter anyway,
> the extra register move does not prevent any combinations, and RA should
> get rid of it when that is beneficial.
>
> But you see smaller code in the end, 

Re: Add a new combine pass

2019-11-25 Thread Segher Boessenkool
On Mon, Nov 25, 2019 at 11:08:47PM +, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote:
> >> Segher Boessenkool  writes:
> >> > I am wondering the other way around :-)  Is what you do for combine2
> >> > something that would be more generally applicable/useful?  That's what
> >> > I'm trying to find out :-)
> >> >
> >> > What combine does could use some improvement, if you want to hear a
> >> > more direct motivations.  LOG_LINKS just skip references we cannot
> >> > handle (and some more), so we always have to do modified_between etc.,
> >> > which hurts.
> >> 
> >> The trade-offs behind the choice of representation are very specific
> >> to the pass.
> >
> > Yes, but hopefully not so specific that every pass needs a completely
> > different representation ;-)
> 
> Well, it depends.  Most passes make do with df (without DU/UD-chains).
> But since DU/UD-chains are naturally quadratic in the general case,
> and are expensive to keep up to date, each DU/UD pass is going to have
> make some compromises.  It doesn't seem too bad that passes make
> different compromises based on what they're trying to do.  (combine:
> single use per definition; fwprop.c: track all uses, but for dominating
> definitions only; sched: fudged via a param; regrename: single
> definition/multiple use chains optimised for renmaing; combine2: full
> live range information, but limited use list; etc.)

combine actually *calculates* DU chains almost completely, it just throws
away most of that information (it wants to have LOG_LINKS, as it did ages
ago).  The only thing stopping us from doing that right now is that not
all uses are counted (some are skipped).

Since combine works only within BBs, DU chains are linear to compute, and
UD chains are trivial (and just linear to compute).

Updating is quadratic in general, sure.  Luckily in most realistic cases
it is cheap (most, sigh) (insns aren't combined to very far away).

> So yeah, if passes want to make roughly the same compromises, it would
> obviously be good if they shared a representation.  But since each pass
> does something different, I don't think it's a bad sign that they make
> different compromises and use different representations.
> 
> So I don't think a new pass with a new representation is in itself a
> sign of failure.

Oh, I don't think so either.  I just wonder if it would be useful more
generically :-)

> >> >> >> Target Tests   DeltaBest   Worst  Median
> >> >> >> avr-elf 1341 -111401  -13824 680 -10
> >> >> >
> >> >> > Things like this are kind of suspicious :-)
> >> >> 
> >> >> Yeah.  This mostly seems to come from mopping up the extra moves created
> >> >> by make_more_copies.  So we have combinations like:
> >> >> 
> >> >>58: r70:SF=r94:SF
> >> >>   REG_DEAD r94:SF
> >> >>60: r22:SF=r70:SF
> >> >>   REG_DEAD r70:SF
> >> >
> >> > Why didn't combine do this?  A target problem?
> >> 
> >> Seems to be because combine rejects hard-reg destinations whose classes
> >> are likely spilled (cant_combine_insn_p).
> >
> > Ah, okay.  And that is required to prevent ICEs, in combine2 as well
> > then -- ICEs in RA.
> 
> Not in this case though.  The final instruction is a hardreg<-pseudo move
> whatever happens.  There's nothing special about r70 compared to r94.

So the target hook could be improved?  Or, this doesn't matter anyway,
the extra register move does not prevent any combinations, and RA should
get rid of it when that is beneficial.

But you see smaller code in the end, hrm.


Segher


Re: Add a new combine pass

2019-11-25 Thread Richard Sandiford
Segher Boessenkool  writes:
> Hi!
>
> On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote:
>> Segher Boessenkool  writes:
>> > I am wondering the other way around :-)  Is what you do for combine2
>> > something that would be more generally applicable/useful?  That's what
>> > I'm trying to find out :-)
>> >
>> > What combine does could use some improvement, if you want to hear a
>> > more direct motivations.  LOG_LINKS just skip references we cannot
>> > handle (and some more), so we always have to do modified_between etc.,
>> > which hurts.
>> 
>> The trade-offs behind the choice of representation are very specific
>> to the pass.
>
> Yes, but hopefully not so specific that every pass needs a completely
> different representation ;-)

Well, it depends.  Most passes make do with df (without DU/UD-chains).
But since DU/UD-chains are naturally quadratic in the general case,
and are expensive to keep up to date, each DU/UD pass is going to have
make some compromises.  It doesn't seem too bad that passes make
different compromises based on what they're trying to do.  (combine:
single use per definition; fwprop.c: track all uses, but for dominating
definitions only; sched: fudged via a param; regrename: single
definition/multiple use chains optimised for renmaing; combine2: full
live range information, but limited use list; etc.)

So yeah, if passes want to make roughly the same compromises, it would
obviously be good if they shared a representation.  But since each pass
does something different, I don't think it's a bad sign that they make
different compromises and use different representations.

So I don't think a new pass with a new representation is in itself a
sign of failure.

>> >> >> Target Tests   DeltaBest   Worst  Median
>> >> >> avr-elf 1341 -111401  -13824 680 -10
>> >> >
>> >> > Things like this are kind of suspicious :-)
>> >> 
>> >> Yeah.  This mostly seems to come from mopping up the extra moves created
>> >> by make_more_copies.  So we have combinations like:
>> >> 
>> >>58: r70:SF=r94:SF
>> >>   REG_DEAD r94:SF
>> >>60: r22:SF=r70:SF
>> >>   REG_DEAD r70:SF
>> >
>> > Why didn't combine do this?  A target problem?
>> 
>> Seems to be because combine rejects hard-reg destinations whose classes
>> are likely spilled (cant_combine_insn_p).
>
> Ah, okay.  And that is required to prevent ICEs, in combine2 as well
> then -- ICEs in RA.

Not in this case though.  The final instruction is a hardreg<-pseudo move
whatever happens.  There's nothing special about r70 compared to r94.

> There should be a better way to do this.

ISTM we should be checking for whichever cases actually cause the
RA failures.  E.g. to take on extreme example, if all the following
are true:

- an insn has a single alternative
- an insn has a single non-earyclobber output
- an insn has no parallel clobbers
- an insn has no auto-inc/decs
- an insn has a hard register destination that satisfies its constraints
- the hard register is defined in its original location

then there should be no problem.  The insn shouldn't need any output
reloads that would conflict with the hard register.  It also doesn't
extend the live range of the output.

Obviously that's a lot of conditions :-)  And IMO they should be built
up the other way around: reject specific cases that are known to cause
problems, based on information about the matched insn.  But I think the
avr example shows that there's a real problem with using REGNO_REG_CLASS
for this too.  REGNO_REG_CLASS gives the smallest enclosing class, which
might not be the most relevant one in context.  (It isn't here, since
we're just passing arguments to functions.)

>> This SF argument register
>> happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject
>> the combination based on POINTER_X_REGS being likely spilled.
>
> static bool
> avr_class_likely_spilled_p (reg_class_t c)
> {
>   return (c != ALL_REGS &&
>(AVR_TINY ? 1 : c != ADDW_REGS));
> }
>
> So this target severely shackles combine.  Does it have to?  If so, why
> not with combine2?

As far as the above example goes, I think returning true for
POINTER_X_REGS is the right thing to do.  It only has two 8-bit
registers, and they act as a pair when used as a pointer.

Thanks,
Richard


Re: Add a new combine pass

2019-11-25 Thread Segher Boessenkool
On Mon, Nov 25, 2019 at 09:40:36PM +, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > - i386 goes into an infinite loop compiling, or at least an hour or so...
> > Erm I forgot too record what it was compiling.  I did attach a GDB...  It
> > is something from lra_create_live_ranges.
> 
> Hmm.

This one is actually worrying me -- it's not obviously a simple problem,
or a target problem, or a pre-existing problem.

> Ah, this'll be while m68k was still a cc0 target.  Yeah, I should probably
> just skip the whole pass for cc0.

Yes, tree of last friday or saturday or so.

And yup if you don't handle cc0 yet, yes you want to skip it completely :-)

> > - sh (that's sh4-linux):
> >
> > /home/segher/src/kernel/net/ipv4/af_inet.c: In function 
> > 'snmp_get_cpu_field':
> > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a 
> > register to spill in class 'R0_REGS'
> >  1638 | }
> >   | ^
> > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn:
> > (insn 18 17 19 2 (set (reg:SI 0 r0)
> > (mem:SI (plus:SI (reg:SI 4 r4 [178])
> > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) 
> > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i}
> >  (expr_list:REG_DEAD (reg:SI 4 r4 [178])
> > (expr_list:REG_DEAD (reg:SI 6 r6 [171])
> > (nil
> > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier 
> > errors, bailing out
> 
> Would have to look more at this one.  Seems odd that it can't allocate
> R0 when it's already the destination and when R0 can't be live before
> the insn.  But there again, this is reload, so my enthuasiasm for looking
> is a bit limited :-)

It wants to use r0 in some other insn, so it needs to spill it here, but
cannot.  This is what class_likely_spilled is for.

> > Looking at just binary size, which is a good stand-in for how many insns
> > it combined:
> >
> > R2
> >arm64   99.709%
> > ia64   99.651%
> > s390   99.734%
> >sparc   99.877%
> >  sparc64  100.011%
> >
> > (These are those that are not between 99.9% and 100.0%).
> >
> > So only sparc64 regressed, and just a tiny bit (I can look at what that
> > is, if there is interest).  But 32-bit sparc improved, and s390, arm64,
> > and ia64 got actual benefit.
> >
> > Again this is just code size, not analysing the actually changed code.
> 
> OK.  Certainly not an earth-shattering improvement then, but not
> entirely worthless either.

I usually takes 0.2% as "definitely useful" for combine improvements, so
there are a few targets that have that.  There can be improvements that
are important for a target even if they do not improve code size much,
of course, and it can identify weaknesses in the backend code, so you
always need to look at what really changes.

> I see combine also tests cannot_copy_insn_p.  I'm not sure whether that's
> appropriate for the new pass or not.  Arguably it's not copying the
> instruction, it's just moving it to be in parallel with something else.
> (But then that's largely true of the combine case too.)

combine tests this only for the cases where it *does* have to copy an
insn: when the dest if i0, i1, or i2 doesn't die, it is added as another
arm to the (parallel) result.


Segher


Re: Add a new combine pass

2019-11-25 Thread Segher Boessenkool
Hi!

On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > I am wondering the other way around :-)  Is what you do for combine2
> > something that would be more generally applicable/useful?  That's what
> > I'm trying to find out :-)
> >
> > What combine does could use some improvement, if you want to hear a
> > more direct motivations.  LOG_LINKS just skip references we cannot
> > handle (and some more), so we always have to do modified_between etc.,
> > which hurts.
> 
> The trade-offs behind the choice of representation are very specific
> to the pass.

Yes, but hopefully not so specific that every pass needs a completely
different representation ;-)

> >> >> Target Tests   DeltaBest   Worst  Median
> >> >> avr-elf 1341 -111401  -13824 680 -10
> >> >
> >> > Things like this are kind of suspicious :-)
> >> 
> >> Yeah.  This mostly seems to come from mopping up the extra moves created
> >> by make_more_copies.  So we have combinations like:
> >> 
> >>58: r70:SF=r94:SF
> >>   REG_DEAD r94:SF
> >>60: r22:SF=r70:SF
> >>   REG_DEAD r70:SF
> >
> > Why didn't combine do this?  A target problem?
> 
> Seems to be because combine rejects hard-reg destinations whose classes
> are likely spilled (cant_combine_insn_p).

Ah, okay.  And that is required to prevent ICEs, in combine2 as well
then -- ICEs in RA.

There should be a better way to do this.

> This SF argument register
> happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject
> the combination based on POINTER_X_REGS being likely spilled.

static bool
avr_class_likely_spilled_p (reg_class_t c)
{
  return (c != ALL_REGS &&
   (AVR_TINY ? 1 : c != ADDW_REGS));
}

So this target severely shackles combine.  Does it have to?  If so, why
not with combine2?

> >> So there's only one case in which it isn't a win, but the number of
> >> tests is tiny.  So I agree there's no justification for trying this in
> >> combine proper as things stand (and I wasn't arguing otherwise FWIW).
> >> I'd still like to keep it in the new pass because it does help
> >> *sometimes* and there's no sign yet that it has a noticeable
> >> compile-time cost.
> >
> > So when does it help?  I can only think of cases where there are
> > problems elsewhere.
> 
> The full list of affected tests (all at -O2 -ftree-vectorize) are:

I'll have to look at this closer later, sorry.


Segher


Re: Add a new combine pass

2019-11-25 Thread Richard Sandiford
Segher Boessenkool  writes:
> Hi!
>
> On Mon, Nov 18, 2019 at 05:55:13PM +, Richard Sandiford wrote:
>> Richard Sandiford  writes:
>> > (It's 23:35 local time, so it's still just about stage 1. :-))
>> 
>> Or actually, just under 1 day after end of stage 1.  Oops.
>> Could have sworn stage 1 ended on the 17th :-(  Only realised
>> I'd got it wrong when catching up on Saturday's email traffic.
>> 
>> And inevitably, I introduced a couple of stupid mistakes while
>> trying to clean the patch up for submission by that (non-)deadline.
>> Here's a version that fixes an inverted overlapping memref check
>> and that correctly prunes the use list for combined instructions.
>> (This last one is just a compile-time saving -- the old code was
>> correct, just suboptimal.)
>
> I've build the Linux kernel with the previous version, as well as this
> one.  R0 is unmodified GCC, R1 is the first patch, R2 is this one:
>
> (I've forced --param=run-combine=6 for R1 and R2):
> (Percentages are relative to R0):
>
> R0R1R2R1R2
>alpha   6107088   6101088   6101088   99.902%   99.902%
>  arc   4008224   4006568   4006568   99.959%   99.959%
>  arm   9206728   9200936   9201000   99.937%   99.938%
>arm64  13056174  13018174  13018194   99.709%   99.709%
>armhf 0 0 0 0 0
>  c6x   2337237   2337077   2337077   99.993%   99.993%
> csky   3356602 0 0 0 0
>h8300   1166996   1166776   1166776   99.981%   99.981%
> i386  11352159 0 0 0 0
> ia64  18230640  18167000  18167000   99.651%   99.651%
> m68k   3714271 0 0 0 0
>   microblaze   4982749   4979945   4979945   99.944%   99.944%
> mips   8499309   8495205   8495205   99.952%   99.952%
>   mips64   7042036   7039816   7039816   99.968%   99.968%
>nds32   4486663 0 0 0 0
>nios2   3680001   3679417   3679417   99.984%   99.984%
> openrisc   4226076   4225868   4225868   99.995%   99.995%
>   parisc   7681895   7680063   7680063   99.976%   99.976%
> parisc64   8677077   8676581   8676581   99.994%   99.994%
>  powerpc  10687611  10682199  10682199   99.949%   99.949%
>powerpc64  17671082  17658570  17658570   99.929%   99.929%
>  powerpc64le  17671082  17658570  17658570   99.929%   99.929%
>  riscv32   1554938   1554758   1554758   99.988%   99.988%
>  riscv64   6634342   6632788   6632788   99.977%   99.977%
> s390  13049643  13014939  13014939   99.734%   99.734%
>   sh   3254743 0 0 0 0
>  shnommu   1632364   1632124   1632124   99.985%   99.985%
>sparc   4404993   4399593   4399593   99.877%   99.877%
>  sparc64   6796711   6797491   6797491  100.011%  100.011%
>   x86_64  19713174  19712817  19712817   99.998%   99.998%
>   xtensa 0 0 0 0 0

Thanks for running these.

> There are fivew new failures, with either of the combine2 patches.  And
> all five are actually different (different symptoms, at least):
>
> - csky fails on libgcc build:
>
> /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi':
> /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate 
> reloads for:
>  1405 | }
>   | ^
> (insn 199 86 87 8 (parallel [
> (set (reg:SI 101)
> (plus:SI (reg:SI 98)
> (const_int -32 [0xffe0])))
> (set (reg:CC 33 c)
> (lt:CC (plus:SI (reg:SI 98)
> (const_int -32 [0xffe0]))
> (const_int 0 [0])))
> ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt}
>  (nil))
> during RTL pass: reload
>
> Target problem?

Yeah, looks like it.  The pattern is:

(define_insn "*cskyv2_declt"
  [(set (match_operand:SI 0 "register_operand" "=r")
(plus:SI (match_operand:SI 1 "register_operand" "r")
 (match_operand:SI 2 "const_int_operand" "Uh")))
   (set (reg:CC CSKY_CC_REGNUM)
(lt:CC (plus:SI (match_dup 1) (match_dup 2))
   (const_int 0)))]
  "CSKY_ISA_FEATURE (2E3)"
  "declt\t%0, %1, %M2"
)

So the predicate accepts all const_ints but the constraint doesn't.

> - i386 goes into an infinite loop compiling, or at least an hour or so...
> Erm I forgot too record what it was compiling.  I did attach a GDB...  It
> is something from lra_create_live_ranges.

Hmm.

> - m68k:
>
> /home/segher/src/kernel/fs/exec.c: In function 'copy_strings':
> /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in 
> final_scan_insn_1, at final.c:3048
>   590 | }
>   | ^
> 0x10408307 final_scan_insn_1
> /home/segher/src/gcc/gcc/final.c:3048
> 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, 

Re: Add a new combine pass

2019-11-25 Thread Richard Sandiford
Segher Boessenkool  writes:
> On Thu, Nov 21, 2019 at 08:32:14PM +, Richard Sandiford wrote:
>> Segher Boessenkool  writes:
>> > It's not great if every pass invents its own version of some common
>> > infrastructure thing because that common one is not suitable.
>> >
>> > I.e., can this be fixed somehow?  Maybe just by having a restricted DU
>> > chains df problem?
>> 
>> Well, it'd probably make sense to make fwprop.c's approach available
>> as a "proper" df interface at some point.  Hopefully if anyone wants the
>> same thing as fwprop.c, they'd do that rather than copy the code. :-)
>
>> >> * Updating a full, ordered def-use chain after a move is a linear-time
>> >>   operation, so whatever happens, we'd need to apply some kind of limit
>> >>   on the number of uses we maintain, with something like that integer
>> >>   point range for the rest.
>
> Yeah.
>
>> >> * Once we've analysed the insn and built its def-use chains, we don't
>> >>   look at the df_refs again until we update the chains after a successful
>> >>   combination.  So it should be more efficient to maintain a small array
>> >>   of insn_info_rec pointers alongside the numerical range, rather than
>> >>   walk and pollute chains of df_refs and then link back the insn uids
>> >>   to the pass-local info.
>> >
>> > So you need something like combine's LOG_LINKS?  Not that handling those
>> > is not quadratic in the worst case, but in practice it works well.  And
>> > it *could* be made linear.
>> 
>> Not sure why what I've used isn't what I need though :-)
>
> I am wondering the other way around :-)  Is what you do for combine2
> something that would be more generally applicable/useful?  That's what
> I'm trying to find out :-)
>
> What combine does could use some improvement, if you want to hear a
> more direct motivations.  LOG_LINKS just skip references we cannot
> handle (and some more), so we always have to do modified_between etc.,
> which hurts.

The trade-offs behind the choice of representation are very specific
to the pass.  You'd only pick this if you wanted both to propagate
definitions into uses and to move insns around.  You'd also only pick
it if you were happy with tracking a small number of named uses per
definition.  I can't think of any other passes that would prefer this
over what they already use.  (Combine itself is an exception, since
the new pass started out as a deliberate attempt to start from scratch.)

>> >> Target Tests   DeltaBest   Worst  Median
>> >> avr-elf 1341 -111401  -13824 680 -10
>> >
>> > Things like this are kind of suspicious :-)
>> 
>> Yeah.  This mostly seems to come from mopping up the extra moves created
>> by make_more_copies.  So we have combinations like:
>> 
>>58: r70:SF=r94:SF
>>   REG_DEAD r94:SF
>>60: r22:SF=r70:SF
>>   REG_DEAD r70:SF
>
> Why didn't combine do this?  A target problem?

Seems to be because combine rejects hard-reg destinations whose classes
are likely spilled (cant_combine_insn_p).  This SF argument register
happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject
the combination based on POINTER_X_REGS being likely spilled.

I think the same thing could happen on other targets, e.g. for
TAILCALL_ADDR_REGS on aarch64.

>> So there's only one case in which it isn't a win, but the number of
>> tests is tiny.  So I agree there's no justification for trying this in
>> combine proper as things stand (and I wasn't arguing otherwise FWIW).
>> I'd still like to keep it in the new pass because it does help
>> *sometimes* and there's no sign yet that it has a noticeable
>> compile-time cost.
>
> So when does it help?  I can only think of cases where there are
> problems elsewhere.

The full list of affected tests (all at -O2 -ftree-vectorize) are:

arc-elf gcc.c-torture/compile/pr67506.c
avr-elf gcc.dg/torture/pr77916.c
bpf-elf gcc.dg/torture/vshuf-v8hi.c
bpf-elf gcc.dg/torture/vshuf-v4si.c
bfin-elfgcc.dg/torture/vshuf-v8qi.c
c6x-elf gcc.c-torture/execute/991118-1.c
cr16-elfgcc.c-torture/compile/pr82052.c
epiphany-elfgcc.c-torture/execute/991118-1.c
epiphany-elfgcc.dg/pr77664.c
epiphany-elfgcc.dg/vect/vect-mult-pattern-2.c
epiphany-elfgcc.dg/torture/vshuf-v8hi.c
epiphany-elfgcc.dg/tree-ssa/pr77664.c
epiphany-elfgcc.dg/tree-ssa/negneg-3.c
fr30-elfgcc.dg/torture/vshuf-v4hi.c
fr30-elfgcc.dg/torture/vshuf-v8hi.c
frv-linux-gnu   gcc.dg/torture/vshuf-v4hi.c
frv-linux-gnu   gcc.dg/torture/vshuf-v8hi.c
h8300-elf   gcc.c-torture/execute/2422-1.c
h8300-elf   gcc.dg/torture/pr77916.c
ia64-linux-gnu  gcc.c-torture/execute/ieee/pr30704.c
ia64-linux-gnu  gcc.dg/vect/pr49478.c
ia64-linux-gnu  gcc.dg/tree-ssa/ldist-16.c
i686-apple-darwin 

Re: Add a new combine pass

2019-11-23 Thread Nicholas Krause




On 11/23/19 6:09 PM, Segher Boessenkool wrote:

On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote:

Please just CC to this conversation as I keep getting removed.

Everyone who was on Cc: for this thread still is.  This is how email
works.  If you want to see everything on the list, subscribe to the
mailing list?


Segher


I was part of the original CCs on my comments but seems that there were 
two or seemed
to be two splitting versions of the thread. I would like to just keep 
all comments merged

in one thread is all.

Sorry for the confusion Segher,
Nick



Re: Add a new combine pass

2019-11-23 Thread Segher Boessenkool
On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote:
> Please just CC to this conversation as I keep getting removed.

Everyone who was on Cc: for this thread still is.  This is how email
works.  If you want to see everything on the list, subscribe to the
mailing list?


Segher


Re: Add a new combine pass

2019-11-23 Thread Nicholas Krause




On 11/23/19 5:34 PM, Segher Boessenkool wrote:

Hi!

On Mon, Nov 18, 2019 at 05:55:13PM +, Richard Sandiford wrote:

Richard Sandiford  writes:

(It's 23:35 local time, so it's still just about stage 1. :-))

Or actually, just under 1 day after end of stage 1.  Oops.
Could have sworn stage 1 ended on the 17th :-(  Only realised
I'd got it wrong when catching up on Saturday's email traffic.

And inevitably, I introduced a couple of stupid mistakes while
trying to clean the patch up for submission by that (non-)deadline.
Here's a version that fixes an inverted overlapping memref check
and that correctly prunes the use list for combined instructions.
(This last one is just a compile-time saving -- the old code was
correct, just suboptimal.)

I've build the Linux kernel with the previous version, as well as this
one.  R0 is unmodified GCC, R1 is the first patch, R2 is this one:

(I've forced --param=run-combine=6 for R1 and R2):
(Percentages are relative to R0):

 R0R1R2R1R2
alpha   6107088   6101088   6101088   99.902%   99.902%
  arc   4008224   4006568   4006568   99.959%   99.959%
  arm   9206728   9200936   9201000   99.937%   99.938%
arm64  13056174  13018174  13018194   99.709%   99.709%
armhf 0 0 0 0 0
  c6x   2337237   2337077   2337077   99.993%   99.993%
 csky   3356602 0 0 0 0
h8300   1166996   1166776   1166776   99.981%   99.981%
 i386  11352159 0 0 0 0
 ia64  18230640  18167000  18167000   99.651%   99.651%
 m68k   3714271 0 0 0 0
   microblaze   4982749   4979945   4979945   99.944%   99.944%
 mips   8499309   8495205   8495205   99.952%   99.952%
   mips64   7042036   7039816   7039816   99.968%   99.968%
nds32   4486663 0 0 0 0
nios2   3680001   3679417   3679417   99.984%   99.984%
 openrisc   4226076   4225868   4225868   99.995%   99.995%
   parisc   7681895   7680063   7680063   99.976%   99.976%
 parisc64   8677077   8676581   8676581   99.994%   99.994%
  powerpc  10687611  10682199  10682199   99.949%   99.949%
powerpc64  17671082  17658570  17658570   99.929%   99.929%
  powerpc64le  17671082  17658570  17658570   99.929%   99.929%
  riscv32   1554938   1554758   1554758   99.988%   99.988%
  riscv64   6634342   6632788   6632788   99.977%   99.977%
 s390  13049643  13014939  13014939   99.734%   99.734%
   sh   3254743 0 0 0 0
  shnommu   1632364   1632124   1632124   99.985%   99.985%
sparc   4404993   4399593   4399593   99.877%   99.877%
  sparc64   6796711   6797491   6797491  100.011%  100.011%
   x86_64  19713174  19712817  19712817   99.998%   99.998%
   xtensa 0 0 0 0 0

0 means it didn't build.

armhf is probably my own problem, not sure yet.

xtensa starts with
/tmp/ccmJoY7l.s: Assembler messages:
/tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object 
file
and it doesn't get better.

My powerpc64 config actually built the powerpc64le config, since the
kernel since a while looks what the host system is, for its defconfig.
Oh well, fixed now.

There are fivew new failures, with either of the combine2 patches.  And
all five are actually different (different symptoms, at least):

- csky fails on libgcc build:

/home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi':
/home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads 
for:
  1405 | }
   | ^
(insn 199 86 87 8 (parallel [
 (set (reg:SI 101)
 (plus:SI (reg:SI 98)
 (const_int -32 [0xffe0])))
 (set (reg:CC 33 c)
 (lt:CC (plus:SI (reg:SI 98)
 (const_int -32 [0xffe0]))
 (const_int 0 [0])))
 ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt}
  (nil))
during RTL pass: reload

Target problem?

- i386 goes into an infinite loop compiling, or at least an hour or so...
Erm I forgot too record what it was compiling.  I did attach a GDB...  It
is something from lra_create_live_ranges.

- m68k:

/home/segher/src/kernel/fs/exec.c: In function 'copy_strings':
/home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in 
final_scan_insn_1, at final.c:3048
   590 | }
   | ^
0x10408307 final_scan_insn_1
 /home/segher/src/gcc/gcc/final.c:3048
0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
 /home/segher/src/gcc/gcc/final.c:3152
0x10408797 final_1
 /home/segher/src/gcc/gcc/final.c:2020
0x104091f7 rest_of_handle_final
 /home/segher/src/gcc/gcc/final.c:4658
0x104091f7 execute

Re: Add a new combine pass

2019-11-23 Thread Segher Boessenkool
Hi!

On Mon, Nov 18, 2019 at 05:55:13PM +, Richard Sandiford wrote:
> Richard Sandiford  writes:
> > (It's 23:35 local time, so it's still just about stage 1. :-))
> 
> Or actually, just under 1 day after end of stage 1.  Oops.
> Could have sworn stage 1 ended on the 17th :-(  Only realised
> I'd got it wrong when catching up on Saturday's email traffic.
> 
> And inevitably, I introduced a couple of stupid mistakes while
> trying to clean the patch up for submission by that (non-)deadline.
> Here's a version that fixes an inverted overlapping memref check
> and that correctly prunes the use list for combined instructions.
> (This last one is just a compile-time saving -- the old code was
> correct, just suboptimal.)

I've build the Linux kernel with the previous version, as well as this
one.  R0 is unmodified GCC, R1 is the first patch, R2 is this one:

(I've forced --param=run-combine=6 for R1 and R2):
(Percentages are relative to R0):

R0R1R2R1R2
   alpha   6107088   6101088   6101088   99.902%   99.902%
 arc   4008224   4006568   4006568   99.959%   99.959%
 arm   9206728   9200936   9201000   99.937%   99.938%
   arm64  13056174  13018174  13018194   99.709%   99.709%
   armhf 0 0 0 0 0
 c6x   2337237   2337077   2337077   99.993%   99.993%
csky   3356602 0 0 0 0
   h8300   1166996   1166776   1166776   99.981%   99.981%
i386  11352159 0 0 0 0
ia64  18230640  18167000  18167000   99.651%   99.651%
m68k   3714271 0 0 0 0
  microblaze   4982749   4979945   4979945   99.944%   99.944%
mips   8499309   8495205   8495205   99.952%   99.952%
  mips64   7042036   7039816   7039816   99.968%   99.968%
   nds32   4486663 0 0 0 0
   nios2   3680001   3679417   3679417   99.984%   99.984%
openrisc   4226076   4225868   4225868   99.995%   99.995%
  parisc   7681895   7680063   7680063   99.976%   99.976%
parisc64   8677077   8676581   8676581   99.994%   99.994%
 powerpc  10687611  10682199  10682199   99.949%   99.949%
   powerpc64  17671082  17658570  17658570   99.929%   99.929%
 powerpc64le  17671082  17658570  17658570   99.929%   99.929%
 riscv32   1554938   1554758   1554758   99.988%   99.988%
 riscv64   6634342   6632788   6632788   99.977%   99.977%
s390  13049643  13014939  13014939   99.734%   99.734%
  sh   3254743 0 0 0 0
 shnommu   1632364   1632124   1632124   99.985%   99.985%
   sparc   4404993   4399593   4399593   99.877%   99.877%
 sparc64   6796711   6797491   6797491  100.011%  100.011%
  x86_64  19713174  19712817  19712817   99.998%   99.998%
  xtensa 0 0 0 0 0

0 means it didn't build.

armhf is probably my own problem, not sure yet.

xtensa starts with
/tmp/ccmJoY7l.s: Assembler messages:
/tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object 
file
and it doesn't get better.

My powerpc64 config actually built the powerpc64le config, since the
kernel since a while looks what the host system is, for its defconfig.
Oh well, fixed now.

There are fivew new failures, with either of the combine2 patches.  And
all five are actually different (different symptoms, at least):

- csky fails on libgcc build:

/home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi':
/home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads 
for:
 1405 | }
  | ^
(insn 199 86 87 8 (parallel [
(set (reg:SI 101)
(plus:SI (reg:SI 98)
(const_int -32 [0xffe0])))
(set (reg:CC 33 c)
(lt:CC (plus:SI (reg:SI 98)
(const_int -32 [0xffe0]))
(const_int 0 [0])))
]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt}
 (nil))
during RTL pass: reload

Target problem?

- i386 goes into an infinite loop compiling, or at least an hour or so...
Erm I forgot too record what it was compiling.  I did attach a GDB...  It
is something from lra_create_live_ranges.

- m68k:

/home/segher/src/kernel/fs/exec.c: In function 'copy_strings':
/home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in 
final_scan_insn_1, at final.c:3048
  590 | }
  | ^
0x10408307 final_scan_insn_1
/home/segher/src/gcc/gcc/final.c:3048
0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
/home/segher/src/gcc/gcc/final.c:3152
0x10408797 final_1
/home/segher/src/gcc/gcc/final.c:2020
0x104091f7 rest_of_handle_final
/home/segher/src/gcc/gcc/final.c:4658
0x104091f7 execute
/home/segher/src/gcc/gcc/final.c:4736

and that line is

Re: Add a new combine pass

2019-11-22 Thread Segher Boessenkool
On Thu, Nov 21, 2019 at 08:32:14PM +, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > It's not great if every pass invents its own version of some common
> > infrastructure thing because that common one is not suitable.
> >
> > I.e., can this be fixed somehow?  Maybe just by having a restricted DU
> > chains df problem?
> 
> Well, it'd probably make sense to make fwprop.c's approach available
> as a "proper" df interface at some point.  Hopefully if anyone wants the
> same thing as fwprop.c, they'd do that rather than copy the code. :-)

> >> * Updating a full, ordered def-use chain after a move is a linear-time
> >>   operation, so whatever happens, we'd need to apply some kind of limit
> >>   on the number of uses we maintain, with something like that integer
> >>   point range for the rest.

Yeah.

> >> * Once we've analysed the insn and built its def-use chains, we don't
> >>   look at the df_refs again until we update the chains after a successful
> >>   combination.  So it should be more efficient to maintain a small array
> >>   of insn_info_rec pointers alongside the numerical range, rather than
> >>   walk and pollute chains of df_refs and then link back the insn uids
> >>   to the pass-local info.
> >
> > So you need something like combine's LOG_LINKS?  Not that handling those
> > is not quadratic in the worst case, but in practice it works well.  And
> > it *could* be made linear.
> 
> Not sure why what I've used isn't what I need though :-)

I am wondering the other way around :-)  Is what you do for combine2
something that would be more generally applicable/useful?  That's what
I'm trying to find out :-)

What combine does could use some improvement, if you want to hear a
more direct motivations.  LOG_LINKS just skip references we cannot
handle (and some more), so we always have to do modified_between etc.,
which hurts.

> >> Target Tests   DeltaBest   Worst  Median
> >> avr-elf 1341 -111401  -13824 680 -10
> >
> > Things like this are kind of suspicious :-)
> 
> Yeah.  This mostly seems to come from mopping up the extra moves created
> by make_more_copies.  So we have combinations like:
> 
>58: r70:SF=r94:SF
>   REG_DEAD r94:SF
>60: r22:SF=r70:SF
>   REG_DEAD r70:SF

Why didn't combine do this?  A target problem?

> So there's only one case in which it isn't a win, but the number of
> tests is tiny.  So I agree there's no justification for trying this in
> combine proper as things stand (and I wasn't arguing otherwise FWIW).
> I'd still like to keep it in the new pass because it does help
> *sometimes* and there's no sign yet that it has a noticeable
> compile-time cost.

So when does it help?  I can only think of cases where there are
problems elsewhere.

> It might also be interesting to see how much difference it makes for
> run-combine=4 (e.g. to see how much it makes up for the current 2-insn
> limit)...

Numbers are good :-)


Segher


Re: Add a new combine pass

2019-11-22 Thread Segher Boessenkool
On Thu, Nov 21, 2019 at 07:41:56PM +, Richard Sandiford wrote:
> Nicholas Krause  writes:
> >> +/* The instructions we're combining, in program order.  */
> >> +insn_info_rec *sequence[MAX_COMBINE_INSNS];
> > Can't we can this a vec in order to grow to lengths and just loop through
> > merging on instructions in the vec as required?
> 
> Yeah, extending this to combining more than 2 instructions would be
> future work.  When that happens, this would likely end up becoming an
> auto_vec.  I imagine there would
> still be a fairly low compile-time limit on the number of combinations
> though.  E.g. current combine has a limit of 4, with even 4 being
> restricted to certain high-value cases.  I don't think I've ever
> seen a case where 5 or more would help.

And sometimes it looks like 4 would help, but often this is because of a
limitation elsewhere (like, it should have done a 2->2 before, for example).

4 _does_ help quite a bit with irregular instruction sets.  It could
sometimes help with RMW insns, too, but there are other problems with
that.

What you see a lot where 4 "helps" is where it really should combine
with just 3 of them, but something prevents that, often cost, while
throwing in a 4th insn tilts the balance just enough.  We used to have
a lot of that with 3-insn combinations as well, and probably still have
some.


Segher


Re: Add a new combine pass

2019-11-21 Thread Richard Sandiford
Segher Boessenkool  writes:
> On Wed, Nov 20, 2019 at 06:20:34PM +, Richard Sandiford wrote:
>> > Why don't you use DF for the DU chains?
>> 
>> The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
>
> Oh, wow.
>
>> fwprop.c gets around that by using the MD problem and having its own
>> dominator walker to calculate limited def-use chains:
>> 
>>   /* We use the multiple definitions problem to compute our restricted
>>  use-def chains.  */
>
> It's not great if every pass invents its own version of some common
> infrastructure thing because that common one is not suitable.
>
> I.e., can this be fixed somehow?  Maybe just by having a restricted DU
> chains df problem?

Well, it'd probably make sense to make fwprop.c's approach available
as a "proper" df interface at some point.  Hopefully if anyone wants the
same thing as fwprop.c, they'd do that rather than copy the code. :-)

>> So taking that approach here would still require some amount of
>> roll-your-own.  Other reasons are:
>> 
>> * Even what fwprop does is more elaborate than we need for now.
>> 
>> * We need to handle memory too, and it's nice to be able to handle
>>   it in the same way as registers.
>> 
>> * Updating a full, ordered def-use chain after a move is a linear-time
>>   operation, so whatever happens, we'd need to apply some kind of limit
>>   on the number of uses we maintain, with something like that integer
>>   point range for the rest.
>> 
>> * Once we've analysed the insn and built its def-use chains, we don't
>>   look at the df_refs again until we update the chains after a successful
>>   combination.  So it should be more efficient to maintain a small array
>>   of insn_info_rec pointers alongside the numerical range, rather than
>>   walk and pollute chains of df_refs and then link back the insn uids
>>   to the pass-local info.
>
> So you need something like combine's LOG_LINKS?  Not that handling those
> is not quadratic in the worst case, but in practice it works well.  And
> it *could* be made linear.

Not sure why what I've used isn't what I need though :-)  If it's an
array vs. linked-list thing, then for the multi-use case, we need
two sets of link pointers, one for "next use of the same resource"
and one for "next use in this instruction".  Then we need the payload of
the list node itself.  For the small number of entries we're talking about,
using null-terminated arrays of "things that this instruction uses"
and "instructions that use this resource" should be more efficient than
pointer-chasing, and occupies the same space as the link pointers
(i.e. saves the extra payload).

We also need to be able to walk in both directions, to answer the
questions:

- which insns can I combine with this definition?
- where is this value of a resource defined?
- where are the uses of this resource?
- where was the previous definition of this resource, and where
  was its last use?

So if we're comparing it to existing linked-list GCC structures,
it's more similar to df_ref (see above for why that seemed like
a bad idea) or -- more light-weight -- dep_link_t in the scheduler.

And both the array and linked-list approaches still need to fall back to
the simple live range once a certain threshold is hit.

>> The second set is for:
>> 
>> (B) --param run-combine=6 (both passes), use-use combinations only
>> (C) --param run-combine=6 (both passes), no restrictions
>> 
>> Target Tests   DeltaBest   Worst  Median
>> == =   =   =  ==
>> aarch64-linux-gnu272   -3844-585  18  -1
>> aarch64_be-linux-gnu 190   -3336-370  18  -1
>> alpha-linux-gnu  401   -2735-370  22  -2
>> amdgcn-amdhsa1881867-4841259  -1
>> arc-elf  257   -1498-650  54  -1
>> arm-linux-gnueabi168   -1117-612 680  -1
>> arm-linux-gnueabihf  168   -1117-612 680  -1
>> avr-elf 1341 -111401  -13824 680 -10
>
> Things like this are kind of suspicious :-)

Yeah.  This mostly seems to come from mopping up the extra moves created
by make_more_copies.  So we have combinations like:

   58: r70:SF=r94:SF
  REG_DEAD r94:SF
   60: r22:SF=r70:SF
  REG_DEAD r70:SF

(r22 is a hard reg, the others are pseudos) which produces:

std Y+1,r22
std Y+2,r23
std Y+3,r24
std Y+4,r25
-   ldd r22,Y+1
-   ldd r23,Y+2
-   ldd r24,Y+3
-   ldd r25,Y+4

On the REG_EQUAL thing: you're right that it doesn't make much difference
for run-combine=6:

Target Tests   DeltaBest   Worst  Median
== =   =   =  ==
arc-elf1  -1  -1  -1  -1
avr-elf1  -1  -1  -1  -1
bfin-elf   1  -1  -1  -1  -1
bpf-elf2  -6  -5  -1  -5
c6x-elf  

Re: Add a new combine pass

2019-11-21 Thread Richard Sandiford
Hi Nick,

Thanks for the comments.

Nicholas Krause  writes:
>> Index: gcc/passes.def
>> ===
>> --- gcc/passes.def   2019-10-29 08:29:03.224443133 +
>> +++ gcc/passes.def   2019-11-17 23:15:31.200500531 +
>> @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3.
>> NEXT_PASS (pass_inc_dec);
>> NEXT_PASS (pass_initialize_regs);
>> NEXT_PASS (pass_ud_rtl_dce);
>> +  NEXT_PASS (pass_combine2_before);
>> NEXT_PASS (pass_combine);
>> +  NEXT_PASS (pass_combine2_after);
>> NEXT_PASS (pass_if_after_combine);
>> NEXT_PASS (pass_jump_after_combine);
>> NEXT_PASS (pass_partition_blocks);
>> Index: gcc/timevar.def
> This is really two passes it seems or at least functions. Just a nit but you
> may want to state that as I don't recall reading that.

It's really two instances of the same pass, but yeah, each instance
goes under a different name.  This is because each instance needs to
know which bit of the run-combine value it should be testing:

>> The patch adds two instances of the new pass: one before combine and
>> one after it.  By default both are disabled, but this can be changed
>> using the new 3-bit run-combine param, where:
>>
>> - bit 0 selects the new pre-combine pass
>> - bit 1 selects the main combine pass
>> - bit 2 selects the new post-combine pass

So bit 0 is pass_combine2_before, bit 1 is pass_combine and bit 2 is
pass_combine2_after.  But the passes are identical apart from the choice
of bit they test.

>> +  /* Describes one attempt to combine instructions.  */
>> +  struct combination_attempt_rec
>> +  {
>> +/* The instruction that we're currently trying to optimize.
>> +   If the combination succeeds, we'll use this insn_info_rec
>> +   to describe the new instruction.  */
>> +insn_info_rec *new_home;
>> +
>> +/* The instructions we're combining, in program order.  */
>> +insn_info_rec *sequence[MAX_COMBINE_INSNS];
> Can't we can this a vec in order to grow to lengths and just loop through
> merging on instructions in the vec as required?

Yeah, extending this to combining more than 2 instructions would be
future work.  When that happens, this would likely end up becoming an
auto_vec.  I imagine there would
still be a fairly low compile-time limit on the number of combinations
though.  E.g. current combine has a limit of 4, with even 4 being
restricted to certain high-value cases.  I don't think I've ever
seen a case where 5 or more would help.

>> +/* Return true if we know that USER is the last user of RANGE.  */
>> +
>> +bool
>> +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user)
>> +{
>> +  if (range->last_extra_use <= user->point)
>> +return false;
>> +
>> +  for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i)
>> +if (range->users[i] == user)
>> +  return i == NUM_RANGE_USERS - 1 || !range->users[i + 1];
> Small nit and I could be wrong but do:
>
> return !range->users[i + 1] || i == NUM_RANGE_USERS - 1;
>
> Based on your code it seems that the getting to NUM_RANGE_USERS is far 
> less likely.

The problem is that we'll then be accessing outside the users[] array
when i == NUM_RANGE_USERS - 1, so we have to check the limit first.

Thanks,
Richard


Re: Add a new combine pass

2019-11-21 Thread Nicholas Krause




On 11/17/19 6:35 PM, Richard Sandiford wrote:

(It's 23:35 local time, so it's still just about stage 1. :-))

While working on SVE, I've noticed several cases in which we fail
to combine instructions because the combined form would need to be
placed earlier in the instruction stream than the last of the
instructions being combined.  This includes one very important
case in the handling of the first fault register (FFR).

Combine currently requires the combined instruction to live at the same
location as i3.  I thought about trying to relax that restriction, but it
would be difficult to do with the current pass structure while keeping
everything linear-ish time.

So this patch instead goes for an option that has been talked about
several times over the years: writing a new combine pass that just
does instruction combination, and not all the other optimisations
that have been bolted onto combine over time.  E.g. it deliberately
doesn't do things like nonzero-bits tracking, since that really ought
to be a separate, more global, optimisation.

This is still far from being a realistic replacement for the even
the combine parts of the current combine pass.  E.g.:

- it only handles combinations that can be built up from individual
   two-instruction combinations.

- it doesn't allow new hard register clobbers to be added.

- it doesn't have the special treatment of CC operations.

- etc.

But we have to start somewhere.

On a more positive note, the pass handles things that the current
combine pass doesn't:

- the main motivating feature mentioned above: it works out where
   the combined instruction could validly live and moves it there
   if necessary.  If there are a range of valid places, it tries
   to pick the best one based on register pressure (although only
   with a simple heuristic for now).

- once it has combined two instructions, it can try combining the
   result with both later and earlier code, i.e. it can combine
   in both directions.

- it tries using REG_EQUAL notes for the final instruction.

- it can parallelise two independent instructions that both read from
   the same register or both read from memory.

This last feature is useful for generating more load-pair combinations
on AArch64.  In some cases it can also produce more store-pair combinations,
but only for consecutive stores.  However, since the pass currently does
this in a very greedy, peephole way, it only allows load/store-pair
combinations if the first memory access has a higher alignment than
the second, i.e. if we can be sure that the combined access is naturally
aligned.  This should help it to make better decisions than the post-RA
peephole pass in some cases while not being too aggressive.

The pass is supposed to be linear time without debug insns.
It only tries a constant number C of combinations per instruction
and its bookkeeping updates are constant-time.  Once it has combined two
instructions, it'll try up to C combinations on the result, but this can
be counted against the instruction that was deleted by the combination
and so effectively just doubles the constant.  (Note that C depends
on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.)

Unfortunately, debug updates via propagate_for_debug are more expensive.
This could probably be fixed if the pass did more to track debug insns
itself, but using propagate_for_debug matches combine's behaviour.

The patch adds two instances of the new pass: one before combine and
one after it.  By default both are disabled, but this can be changed
using the new 3-bit run-combine param, where:

- bit 0 selects the new pre-combine pass
- bit 1 selects the main combine pass
- bit 2 selects the new post-combine pass

The idea is that run-combine=3 can be used to see which combinations
are missed by the new pass, while run-combine=6 (which I hope to be
the production setting for AArch64 at -O2+) just uses the new pass
to mop up cases that normal combine misses.  Maybe in some distant
future, the pass will be good enough for run-combine=[14] to be a
realistic option.

I ended up having to add yet another validate_simplify_* routine,
this time to do the equivalent of:

newx = simplify_replace_rtx (*loc, old_rtx, new_rtx);
validate_change (insn, loc, newx, 1);

but in a more memory-efficient way.  validate_replace_rtx isn't suitable
because it deliberately only tries simplifications in limited cases:

   /* Do changes needed to keep rtx consistent.  Don't do any other
  simplifications, as it is not our job.  */

And validate_simplify_insn isn't useful for this case because it works
on patterns that have already had changes made to them and expects
those patterns to be valid rtxes.  simplify-replace operations instead
need to simplify as they go, when the original modes are still to hand.

As far as compile-time goes, I tried compiling optabs.ii at -O2
with an --enable-checking=release compiler:

run-combine=2 (normal combine):  100.0% (baseline)
run-combine=4 

Re: Add a new combine pass

2019-11-20 Thread Segher Boessenkool
On Wed, Nov 20, 2019 at 06:20:34PM +, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > So this would work if you had pseudos here, instead of the hard reg?
> > Because it is a hard reg it is the same number in both places, making it
> > hard to move.
> 
> Yeah, probably.  But the hard reg is a critical part of this.
> Going back to the example:
> 
> (set (reg/v:VNx16BI 102 [ ok ])
>  (reg:VNx16BI 85 ffrt))
> (set (reg:VNx16BI 85 ffrt)
>  (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
> (set (reg:CC_NZC 66 cc)
>  (unspec:CC_NZC
>[(reg:VNx16BI 106) repeated x2
> (const_int 1 [0x1])
> (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))
> 
> FFR is the real first fault register.  FFRT is actually a fake register
> whose only purpose is to describe the dependencies (in rtl) between writes
> to the FFR, reads from the FFR and first-faulting loads.  The whole scheme
> depends on having only one fixed FFRT register.

Right.  The reason this cannot work in combine is that combine always
combines to just *one* insn, at i3; later, if it turns out that it needs
to split it, it can put something at i2.  But that doesn't even happen
here, only the first and the last of those three insns are what is
combined.

It is important combine only moves things forward in the insn stream, to
make sure this whole process is finite.  Or this was true years ago, at
least :-)

> > Why don't you use DF for the DU chains?
> 
> The problem with DF_DU_CHAIN is that it's quadratic in the worst case.

Oh, wow.

> fwprop.c gets around that by using the MD problem and having its own
> dominator walker to calculate limited def-use chains:
> 
>   /* We use the multiple definitions problem to compute our restricted
>  use-def chains.  */

It's not great if every pass invents its own version of some common
infrastructure thing because that common one is not suitable.

I.e., can this be fixed somehow?  Maybe just by having a restricted DU
chains df problem?

> So taking that approach here would still require some amount of
> roll-your-own.  Other reasons are:
> 
> * Even what fwprop does is more elaborate than we need for now.
> 
> * We need to handle memory too, and it's nice to be able to handle
>   it in the same way as registers.
> 
> * Updating a full, ordered def-use chain after a move is a linear-time
>   operation, so whatever happens, we'd need to apply some kind of limit
>   on the number of uses we maintain, with something like that integer
>   point range for the rest.
> 
> * Once we've analysed the insn and built its def-use chains, we don't
>   look at the df_refs again until we update the chains after a successful
>   combination.  So it should be more efficient to maintain a small array
>   of insn_info_rec pointers alongside the numerical range, rather than
>   walk and pollute chains of df_refs and then link back the insn uids
>   to the pass-local info.

So you need something like combine's LOG_LINKS?  Not that handling those
is not quadratic in the worst case, but in practice it works well.  And
it *could* be made linear.

> >> Tracking limited def-use chains is what makes this last bit easy.
> >> We can just try parallelising two instructions from the (bounded) list
> >> of uses.  And for this case there's not any garbage rtl involved, since
> >> we reuse the same PARALLEL rtx between attempts.  The cost is basically
> >> all in the recog call (which would obviously mount up if we went
> >> overboard).
> >
> > *All* examples above and below are just this.
> 
> Yeah, the powerpc and s390x examples were.  The motivating FFR example
> above isn't though: it's a def-use combination in parallel with the
> existing definition.

Right, good point :-)

> >> >> To get a feel for the effect on multiple targets, I did my usual
> >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
> >> >> using -O2 -ftree-vectorize:
> >> >
> >> > One problem with this is that these are very short functions on average.
> >> 
> >> There are some long ones too :-)
> >
> > Yes, but this isn't a good stand-in for representative programs.
> 
> Right.  And number of lines of asm isn't a good stand-in for anything much.

For combine, number of insns generated is a surprisingly good measure of
how it performed.  Sometimes not, when it goes over a border of an
inlining decision, say, or bb-reorder decides to duplicate more because
it is cheaper now.

> Like I say, the whole thing is just to get a feel, on tests that are readily
> to hand and are easy to compile without a full toolchain.

Absolutely.  But I have no experience with using your test set, so the
numbers do not necessarily mean so much to me :-)

> > So I'd love to see statistics for *only* combining two uses of the same
> > thing, this is something combine cannot do, and arguably *shouldn't* do!
> 
> OK, here are two sets of results.  The first is for:

Re: Add a new combine pass

2019-11-20 Thread Richard Sandiford
Segher Boessenkool  writes:
>>   /* Make sure that the value that is to be substituted for the register
>>   does not use any registers whose values alter in between.  However,
>>   If the insns are adjacent, a use can't cross a set even though we
>>   think it might (this can happen for a sequence of insns each setting
>>   the same destination; last_set of that register might point to
>>   a NOTE).  If INSN has a REG_EQUIV note, the register is always
>>   equivalent to the memory so the substitution is valid even if there
>>   are intervening stores.  Also, don't move a volatile asm or
>>   UNSPEC_VOLATILE across any other insns.  */
>>   || (! all_adjacent
>>&& (((!MEM_P (src)
>>  || ! find_reg_note (insn, REG_EQUIV, src))
>> && modified_between_p (src, insn, i3))
>>|| (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
>>|| GET_CODE (src) == UNSPEC_VOLATILE))
>
> So this would work if you had pseudos here, instead of the hard reg?
> Because it is a hard reg it is the same number in both places, making it
> hard to move.

Yeah, probably.  But the hard reg is a critical part of this.
Going back to the example:

(set (reg/v:VNx16BI 102 [ ok ])
 (reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
 (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
 (unspec:CC_NZC
   [(reg:VNx16BI 106) repeated x2
(const_int 1 [0x1])
(reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))

FFR is the real first fault register.  FFRT is actually a fake register
whose only purpose is to describe the dependencies (in rtl) between writes
to the FFR, reads from the FFR and first-faulting loads.  The whole scheme
depends on having only one fixed FFRT register.

>> > How are dependencies represented in your new pass?  If it just does
>> > walks over the insn stream for everything, you get quadratic complexity
>> > if you move insns backwards.  We have that in combine already, mostly
>> > from modified_between_p, but that is limited because of how LOG_LINKS
>> > work, and we have been doing this for so long and there are no problems
>> > found with it, so it must work in practice.  But I am worried about it
>> > when moving insns back an unlimited distance.
>> 
>> It builds def-use chains, but using a constant limit on the number of
>> explicitly-recorded uses.  All other uses go in a numerical live range
>> from which they (conservatively) never escape.  The def-use chains
>> represent memory as a single entity, a bit like in gimple.
>
> Ah.  So that range thing ensures correctness.

Yeah.

> Why don't you use DF for the DU chains?

The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
fwprop.c gets around that by using the MD problem and having its own
dominator walker to calculate limited def-use chains:

  /* We use the multiple definitions problem to compute our restricted
 use-def chains.  */

So taking that approach here would still require some amount of
roll-your-own.  Other reasons are:

* Even what fwprop does is more elaborate than we need for now.

* We need to handle memory too, and it's nice to be able to handle
  it in the same way as registers.

* Updating a full, ordered def-use chain after a move is a linear-time
  operation, so whatever happens, we'd need to apply some kind of limit
  on the number of uses we maintain, with something like that integer
  point range for the rest.

* Once we've analysed the insn and built its def-use chains, we don't
  look at the df_refs again until we update the chains after a successful
  combination.  So it should be more efficient to maintain a small array
  of insn_info_rec pointers alongside the numerical range, rather than
  walk and pollute chains of df_refs and then link back the insn uids
  to the pass-local info.

>> >> - it tries using REG_EQUAL notes for the final instruction.
>> >
>> > And that.
>> 
>> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
>> with i3's REG_EQUAL note and combining into that.  Does combine do that?
>> I couldn't see it, and in:
>> 
>>https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
>> 
>> you seemed to reject the idea of allowing it.
>
> Yes, I still do.  Do you have an example where it helps?

I'll run another set of tests for that.

>> >> - it can parallelise two independent instructions that both read from
>> >>   the same register or both read from memory.
>> >
>> > That only if somehow there is a link between the two (so essentially
>> > never).  The only combinations tried by combine are those via LOG_LINKs,
>> > which are between a SET and the first corresponding use.  This is a key
>> > factor that makes it kind of linear (instead of exponential) complexity.
>> 
>> Tracking limited def-use chains is what makes this last bit easy.
>> We can just try parallelising two instructions from the (bounded) list
>> of uses.  And for this case 

Re: Add a new combine pass

2019-11-19 Thread Segher Boessenkool
On Tue, Nov 19, 2019 at 11:33:13AM +, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > On Sun, Nov 17, 2019 at 11:35:26PM +, Richard Sandiford wrote:
> >> While working on SVE, I've noticed several cases in which we fail
> >> to combine instructions because the combined form would need to be
> >> placed earlier in the instruction stream than the last of the
> >> instructions being combined.  This includes one very important
> >> case in the handling of the first fault register (FFR).
> >
> > Do you have an example of that?
> 
> It's difficult to share realistic examples at this stage since this
> isn't really the right forum for making them public for the first time.

Oh I'm very sorry.  In the future, just say "Future" and I know what
you mean :-)

>   /* Make sure that the value that is to be substituted for the register
>does not use any registers whose values alter in between.  However,
>If the insns are adjacent, a use can't cross a set even though we
>think it might (this can happen for a sequence of insns each setting
>the same destination; last_set of that register might point to
>a NOTE).  If INSN has a REG_EQUIV note, the register is always
>equivalent to the memory so the substitution is valid even if there
>are intervening stores.  Also, don't move a volatile asm or
>UNSPEC_VOLATILE across any other insns.  */
>   || (! all_adjacent
> && (((!MEM_P (src)
>   || ! find_reg_note (insn, REG_EQUIV, src))
>  && modified_between_p (src, insn, i3))
> || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
> || GET_CODE (src) == UNSPEC_VOLATILE))

So this would work if you had pseudos here, instead of the hard reg?
Because it is a hard reg it is the same number in both places, making it
hard to move.

> > How are dependencies represented in your new pass?  If it just does
> > walks over the insn stream for everything, you get quadratic complexity
> > if you move insns backwards.  We have that in combine already, mostly
> > from modified_between_p, but that is limited because of how LOG_LINKS
> > work, and we have been doing this for so long and there are no problems
> > found with it, so it must work in practice.  But I am worried about it
> > when moving insns back an unlimited distance.
> 
> It builds def-use chains, but using a constant limit on the number of
> explicitly-recorded uses.  All other uses go in a numerical live range
> from which they (conservatively) never escape.  The def-use chains
> represent memory as a single entity, a bit like in gimple.

Ah.  So that range thing ensures correctness.

Why don't you use DF for the DU chains?

> >> - it tries using REG_EQUAL notes for the final instruction.
> >
> > And that.
> 
> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
> with i3's REG_EQUAL note and combining into that.  Does combine do that?
> I couldn't see it, and in:
> 
>https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
> 
> you seemed to reject the idea of allowing it.

Yes, I still do.  Do you have an example where it helps?

> >> - it can parallelise two independent instructions that both read from
> >>   the same register or both read from memory.
> >
> > That only if somehow there is a link between the two (so essentially
> > never).  The only combinations tried by combine are those via LOG_LINKs,
> > which are between a SET and the first corresponding use.  This is a key
> > factor that makes it kind of linear (instead of exponential) complexity.
> 
> Tracking limited def-use chains is what makes this last bit easy.
> We can just try parallelising two instructions from the (bounded) list
> of uses.  And for this case there's not any garbage rtl involved, since
> we reuse the same PARALLEL rtx between attempts.  The cost is basically
> all in the recog call (which would obviously mount up if we went
> overboard).

*All* examples above and below are just this.

If you disable everything else, what do the statistics look like then?

> > One thing I want to do is some mini-combine after every split, probably
> > only with the insns new from the split.  But we have no cfglayout mode
> > anymore then, and only hard regs (except in the first split pass, which
> > is just a little later than your new pass).
> 
> Yeah, sounds like it could be useful.  I guess there'd need to be
> an extra condition on the combination that the new insn can't be
> immediately split.

It would run *after* split.  Not interleaved with it.

> > And amount of garbage produced?
> 
> If -ftime-report stats are accurate, then the total amount of
> memory allocated is:
> 
> run-combine=2 (normal combine): 1793 kB
> run-combine=4 (new pass only):98 kB
> run-combine=6 (both passes):1871 kB (new pass accounts for 78 kB)
> 
> But again that's not a fair comparison when the main combine pass does more.

The way combine does SUBST is pretty 

Re: Add a new combine pass

2019-11-19 Thread Richard Sandiford
Segher Boessenkool  writes:
> On Sun, Nov 17, 2019 at 11:35:26PM +, Richard Sandiford wrote:
>> While working on SVE, I've noticed several cases in which we fail
>> to combine instructions because the combined form would need to be
>> placed earlier in the instruction stream than the last of the
>> instructions being combined.  This includes one very important
>> case in the handling of the first fault register (FFR).
>
> Do you have an example of that?

It's difficult to share realistic examples at this stage since this
isn't really the right forum for making them public for the first time.
But in rtl terms we have:

(set (reg/v:VNx16BI 102 [ ok ])
 (reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
 (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
 (unspec:CC_NZC
   [(reg:VNx16BI 106) repeated x2
(const_int 1 [0x1])
(reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))

and want to combine the first and third instruction at the site of the
first instruction.  Current combine gives:

Trying 18 -> 24:
   18: r102:VNx16BI=ffrt:VNx16BI
   24: cc:CC_NZC=unspec[r106:VNx16BI,r106:VNx16BI,0x1,r102:VNx16BI] 104
Can't combine i2 into i3

because of:

  /* Make sure that the value that is to be substituted for the register
 does not use any registers whose values alter in between.  However,
 If the insns are adjacent, a use can't cross a set even though we
 think it might (this can happen for a sequence of insns each setting
 the same destination; last_set of that register might point to
 a NOTE).  If INSN has a REG_EQUIV note, the register is always
 equivalent to the memory so the substitution is valid even if there
 are intervening stores.  Also, don't move a volatile asm or
 UNSPEC_VOLATILE across any other insns.  */
  || (! all_adjacent
  && (((!MEM_P (src)
|| ! find_reg_note (insn, REG_EQUIV, src))
   && modified_between_p (src, insn, i3))
  || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
  || GET_CODE (src) == UNSPEC_VOLATILE))

>> Combine currently requires the combined instruction to live at the same
>> location as i3.
>
> Or i2 and i3.
>
>> I thought about trying to relax that restriction, but it
>> would be difficult to do with the current pass structure while keeping
>> everything linear-ish time.
>
> s/difficult/impossible/, yes.
>
> A long time ago we had to only move insns forward for correctness even,
> but that should no longer be required, combine always is finite by other
> means now.
>
>> So this patch instead goes for an option that has been talked about
>> several times over the years: writing a new combine pass that just
>> does instruction combination, and not all the other optimisations
>> that have been bolted onto combine over time.  E.g. it deliberately
>> doesn't do things like nonzero-bits tracking, since that really ought
>> to be a separate, more global, optimisation.
>
> In my dreams tracking nonzero bits would be a dataflow problem.
>
>> This is still far from being a realistic replacement for the even
>> the combine parts of the current combine pass.  E.g.:
>> 
>> - it only handles combinations that can be built up from individual
>>   two-instruction combinations.
>
> And combine does any of {2,3,4}->{1,2} combinations, and it also can
> modify a third insn ("other_insn").  For the bigger ->1 combos, if it
> *can* be decomposed in a bunch of 2->1, then those result in insns that
> are greater cost than those we started with (or else those combinations
> *would* be done).  For the ->2 combinations, there are many ways those
> two insns can be formed: it can be the two arms of a parallel, or
> combine can break a non-matching insn into two at what looks like a good
> spot for that, or it can use a define_split for it.
>
> All those things lead to many more successful combinations :-)

Right.  I definitely want to support multi-insn combos too.  It's one of
the TODOs in the head comment, along with the other points in this list.
Like I say, it's not yet a realistic replacement for even the combine parts
of the current pass.

>> On a more positive note, the pass handles things that the current
>> combine pass doesn't:
>> 
>> - the main motivating feature mentioned above: it works out where
>>   the combined instruction could validly live and moves it there
>>   if necessary.  If there are a range of valid places, it tries
>>   to pick the best one based on register pressure (although only
>>   with a simple heuristic for now).
>
> How are dependencies represented in your new pass?  If it just does
> walks over the insn stream for everything, you get quadratic complexity
> if you move insns backwards.  We have that in combine already, mostly
> from modified_between_p, but that is limited because of how LOG_LINKS
> work, and we have been doing this for so long and there are no 

Re: Add a new combine pass

2019-11-18 Thread Segher Boessenkool
Hi!

On Sun, Nov 17, 2019 at 11:35:26PM +, Richard Sandiford wrote:
> While working on SVE, I've noticed several cases in which we fail
> to combine instructions because the combined form would need to be
> placed earlier in the instruction stream than the last of the
> instructions being combined.  This includes one very important
> case in the handling of the first fault register (FFR).

Do you have an example of that?

> Combine currently requires the combined instruction to live at the same
> location as i3.

Or i2 and i3.

> I thought about trying to relax that restriction, but it
> would be difficult to do with the current pass structure while keeping
> everything linear-ish time.

s/difficult/impossible/, yes.

A long time ago we had to only move insns forward for correctness even,
but that should no longer be required, combine always is finite by other
means now.

> So this patch instead goes for an option that has been talked about
> several times over the years: writing a new combine pass that just
> does instruction combination, and not all the other optimisations
> that have been bolted onto combine over time.  E.g. it deliberately
> doesn't do things like nonzero-bits tracking, since that really ought
> to be a separate, more global, optimisation.

In my dreams tracking nonzero bits would be a dataflow problem.

> This is still far from being a realistic replacement for the even
> the combine parts of the current combine pass.  E.g.:
> 
> - it only handles combinations that can be built up from individual
>   two-instruction combinations.

And combine does any of {2,3,4}->{1,2} combinations, and it also can
modify a third insn ("other_insn").  For the bigger ->1 combos, if it
*can* be decomposed in a bunch of 2->1, then those result in insns that
are greater cost than those we started with (or else those combinations
*would* be done).  For the ->2 combinations, there are many ways those
two insns can be formed: it can be the two arms of a parallel, or
combine can break a non-matching insn into two at what looks like a good
spot for that, or it can use a define_split for it.

All those things lead to many more successful combinations :-)

> On a more positive note, the pass handles things that the current
> combine pass doesn't:
> 
> - the main motivating feature mentioned above: it works out where
>   the combined instruction could validly live and moves it there
>   if necessary.  If there are a range of valid places, it tries
>   to pick the best one based on register pressure (although only
>   with a simple heuristic for now).

How are dependencies represented in your new pass?  If it just does
walks over the insn stream for everything, you get quadratic complexity
if you move insns backwards.  We have that in combine already, mostly
from modified_between_p, but that is limited because of how LOG_LINKS
work, and we have been doing this for so long and there are no problems
found with it, so it must work in practice.  But I am worried about it
when moving insns back an unlimited distance.

If combine results in two insns it puts them at i2 and i3, and it can
actually move a SET to i2 that was at i3 before the combination.

> - once it has combined two instructions, it can try combining the
>   result with both later and earlier code, i.e. it can combine
>   in both directions.

That is what combine does, too.

> - it tries using REG_EQUAL notes for the final instruction.

And that.

> - it can parallelise two independent instructions that both read from
>   the same register or both read from memory.

That only if somehow there is a link between the two (so essentially
never).  The only combinations tried by combine are those via LOG_LINKs,
which are between a SET and the first corresponding use.  This is a key
factor that makes it kind of linear (instead of exponential) complexity.

> The pass is supposed to be linear time without debug insns.
> It only tries a constant number C of combinations per instruction
> and its bookkeeping updates are constant-time.

But how many other insns does it look at, say by modified_between_p or
the like?

> The patch adds two instances of the new pass: one before combine and
> one after it.

One thing I want to do is some mini-combine after every split, probably
only with the insns new from the split.  But we have no cfglayout mode
anymore then, and only hard regs (except in the first split pass, which
is just a little later than your new pass).

> As far as compile-time goes, I tried compiling optabs.ii at -O2
> with an --enable-checking=release compiler:
> 
> run-combine=2 (normal combine):  100.0% (baseline)
> run-combine=4 (new pass only) 98.0%
> run-combine=6 (both passes)  100.3%
> 
> where the results are easily outside the noise.  So the pass on
> its own is quicker than combine, but that's not a fair comparison
> when it doesn't do everything combine does.  Running both passes
> only has a slight overhead.

And amount of garbage 

Re: Add a new combine pass

2019-11-18 Thread Richard Sandiford
Richard Sandiford  writes:
> (It's 23:35 local time, so it's still just about stage 1. :-))

Or actually, just under 1 day after end of stage 1.  Oops.
Could have sworn stage 1 ended on the 17th :-(  Only realised
I'd got it wrong when catching up on Saturday's email traffic.

And inevitably, I introduced a couple of stupid mistakes while
trying to clean the patch up for submission by that (non-)deadline.
Here's a version that fixes an inverted overlapping memref check
and that correctly prunes the use list for combined instructions.
(This last one is just a compile-time saving -- the old code was
correct, just suboptimal.)

And those comparisons that looked too good to be true were:
I'd bodged the choice of run-combine parameters when setting
up the tests.  All in all, not a great a day.

Here are the (much less impressive) real values:

Target Tests   DeltaBest   Worst  Median
== =   =   =  ==
aarch64-linux-gnu412-786-270 520  -1
aarch64_be-linux-gnu 288   -3314-270  33  -1
alpha-linux-gnu  399   -2721-370  22  -2
amdgcn-amdhsa2011938-4841259  -1
arc-elf  530   -5901   -1529 356  -1
arm-linux-gnueabi193   -1167-612 680  -1
arm-linux-gnueabihf  193   -1167-612 680  -1
avr-elf 1331 -111093  -13824 680  -9
bfin-elf1347  -18928   -8461 465  -2
bpf-elf   63-475 -60   6  -2
c6x-elf  183  -10508  -10084  41  -2
cr16-elf1610  -51360  -10657  42 -13
cris-elf 143   -1534-702   4  -2
csky-elf 136   -3371-474   6  -2
epiphany-elf 178-389-149  84  -1
fr30-elf 161   -1756-756 289  -2
frv-linux-gnu807  -13324   -2074  67  -1
ft32-elf 282   -1666-111   5  -2
h8300-elf522  -11451   -1747  68  -3
hppa64-hp-hpux11.23  186-848-142  34  -1
i686-apple-darwin344   -1298 -56  44  -1
i686-pc-linux-gnu242   -1953-556  33  -1
ia64-linux-gnu   150   -4834   -1134  40  -4
iq2000-elf   177   -1333 -61   3  -2
lm32-elf 193   -1792-316  47  -2
m32r-elf  73-595 -98  11  -2
m68k-linux-gnu   210   -2351-332 148  -2
mcore-elf133   -1213-146   7  -1
microblaze-elf   445   -4493   -2094  32  -2
mipsel-linux-gnu 134   -2038-222  60  -2
mmix 108-233 -26   4  -1
mn10300-elf  224   -1024-234  80  -1
moxie-rtems  154-743 -79   4  -2
msp430-elf   182-586 -63  19  -1
nds32le-elf  267-485 -37 136  -1
nios2-linux-gnu   83-323 -66   5  -1
nvptx-none   568   -1124-208  16   1
or1k-elf  61-281 -25   4  -1
pdp11248   -1292-182  83  -1
powerpc-ibm-aix7.0  1288   -3031-3702046  -1
powerpc64-linux-gnu 1118 692-2742934  -2
powerpc64le-linux-gnu   1044   -4719-688 156  -1
pru-elf   48   -7014   -6921   6  -1
riscv32-elf   63   -1364-139   7  -2
riscv64-elf   91   -1557-264   7  -1
rl78-elf 354  -16805   -1665  42  -6
rx-elf95-186 -53   8  -1
s390-linux-gnu   184   -2282   -1485  63  -1
s390x-linux-gnu  257-363-159 522  -1
sh-linux-gnu 225-405-108  68  -1
sparc-linux-gnu  164-859 -99  18  -1
sparc64-linux-gnu169-791-102  15  -1
tilepro-linux-gnu   1037   -4896-315 332  -2
v850-elf  54-408 -53   3  -2
vax-netbsdelf251   -3315-400   2  -2
visium-elf   101-693-138  16  -1
x86_64-darwin350   -2145-490  72  -1
x86_64-linux-gnu 311-853-288 210  -1
xstormy16-elf219-770-156  59  -1
xtensa-elf   201   -1418-322  36   1

Also, the number of LDPs on aarch64-linux-gnu went up from
3543 to 5235.  The number of STPs went up from 10494 to 12151.
All the new pairs should be aligned ones.

Retested on aarch64-linux-gnu and x86_64-linux-gnu.  It missed the
deadline, but I thought I'd post it anyway to put the record straight.

Thanks,
Richard


2019-11-18  Richard Sandiford  

gcc/
 

Re: Add a new combine pass

2019-11-17 Thread Andrew Pinski
On Sun, Nov 17, 2019 at 3:35 PM Richard Sandiford
 wrote:
>
> (It's 23:35 local time, so it's still just about stage 1. :-))
>
> While working on SVE, I've noticed several cases in which we fail
> to combine instructions because the combined form would need to be
> placed earlier in the instruction stream than the last of the
> instructions being combined.  This includes one very important
> case in the handling of the first fault register (FFR).
>
> Combine currently requires the combined instruction to live at the same
> location as i3.  I thought about trying to relax that restriction, but it
> would be difficult to do with the current pass structure while keeping
> everything linear-ish time.
>
> So this patch instead goes for an option that has been talked about
> several times over the years: writing a new combine pass that just
> does instruction combination, and not all the other optimisations
> that have been bolted onto combine over time.  E.g. it deliberately
> doesn't do things like nonzero-bits tracking, since that really ought
> to be a separate, more global, optimisation.
>
> This is still far from being a realistic replacement for the even
> the combine parts of the current combine pass.  E.g.:
>
> - it only handles combinations that can be built up from individual
>   two-instruction combinations.
>
> - it doesn't allow new hard register clobbers to be added.
>
> - it doesn't have the special treatment of CC operations.
>
> - etc.
>
> But we have to start somewhere.
>
> On a more positive note, the pass handles things that the current
> combine pass doesn't:
>
> - the main motivating feature mentioned above: it works out where
>   the combined instruction could validly live and moves it there
>   if necessary.  If there are a range of valid places, it tries
>   to pick the best one based on register pressure (although only
>   with a simple heuristic for now).
>
> - once it has combined two instructions, it can try combining the
>   result with both later and earlier code, i.e. it can combine
>   in both directions.
>
> - it tries using REG_EQUAL notes for the final instruction.
>
> - it can parallelise two independent instructions that both read from
>   the same register or both read from memory.
>
> This last feature is useful for generating more load-pair combinations
> on AArch64.  In some cases it can also produce more store-pair combinations,
> but only for consecutive stores.  However, since the pass currently does
> this in a very greedy, peephole way, it only allows load/store-pair
> combinations if the first memory access has a higher alignment than
> the second, i.e. if we can be sure that the combined access is naturally
> aligned.  This should help it to make better decisions than the post-RA
> peephole pass in some cases while not being too aggressive.
>
> The pass is supposed to be linear time without debug insns.
> It only tries a constant number C of combinations per instruction
> and its bookkeeping updates are constant-time.  Once it has combined two
> instructions, it'll try up to C combinations on the result, but this can
> be counted against the instruction that was deleted by the combination
> and so effectively just doubles the constant.  (Note that C depends
> on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.)
>
> Unfortunately, debug updates via propagate_for_debug are more expensive.
> This could probably be fixed if the pass did more to track debug insns
> itself, but using propagate_for_debug matches combine's behaviour.
>
> The patch adds two instances of the new pass: one before combine and
> one after it.  By default both are disabled, but this can be changed
> using the new 3-bit run-combine param, where:
>
> - bit 0 selects the new pre-combine pass
> - bit 1 selects the main combine pass
> - bit 2 selects the new post-combine pass
>
> The idea is that run-combine=3 can be used to see which combinations
> are missed by the new pass, while run-combine=6 (which I hope to be
> the production setting for AArch64 at -O2+) just uses the new pass
> to mop up cases that normal combine misses.  Maybe in some distant
> future, the pass will be good enough for run-combine=[14] to be a
> realistic option.
>
> I ended up having to add yet another validate_simplify_* routine,
> this time to do the equivalent of:
>
>newx = simplify_replace_rtx (*loc, old_rtx, new_rtx);
>validate_change (insn, loc, newx, 1);
>
> but in a more memory-efficient way.  validate_replace_rtx isn't suitable
> because it deliberately only tries simplifications in limited cases:
>
>   /* Do changes needed to keep rtx consistent.  Don't do any other
>  simplifications, as it is not our job.  */
>
> And validate_simplify_insn isn't useful for this case because it works
> on patterns that have already had changes made to them and expects
> those patterns to be valid rtxes.  simplify-replace operations instead
> need to simplify as they go, when the original modes are still to 

Add a new combine pass

2019-11-17 Thread Richard Sandiford
(It's 23:35 local time, so it's still just about stage 1. :-))

While working on SVE, I've noticed several cases in which we fail
to combine instructions because the combined form would need to be
placed earlier in the instruction stream than the last of the
instructions being combined.  This includes one very important
case in the handling of the first fault register (FFR).

Combine currently requires the combined instruction to live at the same
location as i3.  I thought about trying to relax that restriction, but it
would be difficult to do with the current pass structure while keeping
everything linear-ish time.

So this patch instead goes for an option that has been talked about
several times over the years: writing a new combine pass that just
does instruction combination, and not all the other optimisations
that have been bolted onto combine over time.  E.g. it deliberately
doesn't do things like nonzero-bits tracking, since that really ought
to be a separate, more global, optimisation.

This is still far from being a realistic replacement for the even
the combine parts of the current combine pass.  E.g.:

- it only handles combinations that can be built up from individual
  two-instruction combinations.

- it doesn't allow new hard register clobbers to be added.

- it doesn't have the special treatment of CC operations.

- etc.

But we have to start somewhere.

On a more positive note, the pass handles things that the current
combine pass doesn't:

- the main motivating feature mentioned above: it works out where
  the combined instruction could validly live and moves it there
  if necessary.  If there are a range of valid places, it tries
  to pick the best one based on register pressure (although only
  with a simple heuristic for now).

- once it has combined two instructions, it can try combining the
  result with both later and earlier code, i.e. it can combine
  in both directions.

- it tries using REG_EQUAL notes for the final instruction.

- it can parallelise two independent instructions that both read from
  the same register or both read from memory.

This last feature is useful for generating more load-pair combinations
on AArch64.  In some cases it can also produce more store-pair combinations,
but only for consecutive stores.  However, since the pass currently does
this in a very greedy, peephole way, it only allows load/store-pair
combinations if the first memory access has a higher alignment than
the second, i.e. if we can be sure that the combined access is naturally
aligned.  This should help it to make better decisions than the post-RA
peephole pass in some cases while not being too aggressive.

The pass is supposed to be linear time without debug insns.
It only tries a constant number C of combinations per instruction
and its bookkeeping updates are constant-time.  Once it has combined two
instructions, it'll try up to C combinations on the result, but this can
be counted against the instruction that was deleted by the combination
and so effectively just doubles the constant.  (Note that C depends
on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.)

Unfortunately, debug updates via propagate_for_debug are more expensive.
This could probably be fixed if the pass did more to track debug insns
itself, but using propagate_for_debug matches combine's behaviour.

The patch adds two instances of the new pass: one before combine and
one after it.  By default both are disabled, but this can be changed
using the new 3-bit run-combine param, where:

- bit 0 selects the new pre-combine pass
- bit 1 selects the main combine pass
- bit 2 selects the new post-combine pass

The idea is that run-combine=3 can be used to see which combinations
are missed by the new pass, while run-combine=6 (which I hope to be
the production setting for AArch64 at -O2+) just uses the new pass
to mop up cases that normal combine misses.  Maybe in some distant
future, the pass will be good enough for run-combine=[14] to be a
realistic option.

I ended up having to add yet another validate_simplify_* routine,
this time to do the equivalent of:

   newx = simplify_replace_rtx (*loc, old_rtx, new_rtx);
   validate_change (insn, loc, newx, 1);

but in a more memory-efficient way.  validate_replace_rtx isn't suitable
because it deliberately only tries simplifications in limited cases:

  /* Do changes needed to keep rtx consistent.  Don't do any other
 simplifications, as it is not our job.  */

And validate_simplify_insn isn't useful for this case because it works
on patterns that have already had changes made to them and expects
those patterns to be valid rtxes.  simplify-replace operations instead
need to simplify as they go, when the original modes are still to hand.

As far as compile-time goes, I tried compiling optabs.ii at -O2
with an --enable-checking=release compiler:

run-combine=2 (normal combine):  100.0% (baseline)
run-combine=4 (new pass only) 98.0%
run-combine=6 (both passes)