Re: Add a new combine pass
On Fri, 2019-12-06 at 16:51 -0600, Segher Boessenkool wrote: > On Wed, Dec 04, 2019 at 07:43:30PM +0900, Oleg Endo wrote: > > On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote: > > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > > > > explicitly, but it's got a R0_REGS class with only one said reg in it. > > > > So the default impl of class_likely_spilled should do its thing. > > > > > > Yes, good point. So what happened here? > > > > "Something, somewhere, went terribly wrong"... > > > > insn 18 wants to do > > > > mov.l @(r4,r6),r0 > > > > But it can't because the reg+reg address mode has a R0 constraint > > itself. So it needs to be changed to > > > > mov r4,r0 > > mov.l @(r0,r6),r0 > > > > And it can't handle that. Or only sometimes? Don't remember. > > > > > Is it just RA messing things > > > up, unrelated to the new pass? > > > > Yep, I think so. The additional pass seems to create "tougher" code so > > reload passes out earlier than usual. We've had the same issue when > > trying address mode selection optimization. In fact that was one huge > > showstopper. > > So maybe you should have a define_insn_and_split that allows any two > regs and replaces one by r0 if neither is (and a move to r0 before the > load)? Split after reload of course. > > It may be admitting defeat, but it may even result in better code as > well ;-) > AFAIR I've tried that already and it was just like running in circles. Means it didn't help. Perhaps if R0_REGS was hidden from RA altogether it might work. But that sounds like opening a whole other can of worms. Another idea I was entertaining was to do a custom RTL pass to pre-allocate all R0 constraints before the real full RA. But then the whole reload stuff would still have the same issue as above. So all the wallpapering is just moot. Proper fix of the actual problem would be more appropriate. Cheers, Oleg
Re: Add a new combine pass
On Wed, Dec 04, 2019 at 07:43:30PM +0900, Oleg Endo wrote: > On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote: > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > > > explicitly, but it's got a R0_REGS class with only one said reg in it. > > > So the default impl of class_likely_spilled should do its thing. > > > > Yes, good point. So what happened here? > > "Something, somewhere, went terribly wrong"... > > insn 18 wants to do > > mov.l @(r4,r6),r0 > > But it can't because the reg+reg address mode has a R0 constraint > itself. So it needs to be changed to > > mov r4,r0 > mov.l @(r0,r6),r0 > > And it can't handle that. Or only sometimes? Don't remember. > > > Is it just RA messing things > > up, unrelated to the new pass? > > Yep, I think so. The additional pass seems to create "tougher" code so > reload passes out earlier than usual. We've had the same issue when > trying address mode selection optimization. In fact that was one huge > showstopper. So maybe you should have a define_insn_and_split that allows any two regs and replaces one by r0 if neither is (and a move to r0 before the load)? Split after reload of course. It may be admitting defeat, but it may even result in better code as well ;-) Segher
Re: Add a new combine pass
Here's a revised version based on the feedback so far. Changes in v2: - Don't move instructions that set or use allocatable hard registers. - Check legitimate_combined_insn - Check cannot_copy_insn_p when keeping the original insn in parallel - Disable the pass if HAVE_cc0 I compared v1 and v2 in the same way as before and the new restrictions didn't make much difference (as hoped). Also bootstrapped & regression- tested on aarch64-linux-gnu and x86_64-linux-gnu with run-combine defaulting to 6 (unlike in the patch, where the new pass is disabled by default). Thanks, Richard 2019-12-05 Richard Sandiford gcc/ * Makefile.in (OBJS): Add combine2.o * params.opt (--param=run-combine): New option. * doc/invoke.texi: Document it. * tree-pass.h (make_pass_combine2_before): Declare. (make_pass_combine2_after): Likewise. * passes.def: Add them. * timevar.def (TV_COMBINE2): New timevar. * cfgrtl.h (update_cfg_for_uncondjump): Declare. * combine.c (update_cfg_for_uncondjump): Move to... * cfgrtl.c (update_cfg_for_uncondjump): ...here. * simplify-rtx.c (simplify_truncation): Handle comparisons. * recog.h (validate_simplify_replace_rtx): Declare. * recog.c (validate_simplify_replace_rtx_1): New function. (validate_simplify_replace_rtx_uses): Likewise. (validate_simplify_replace_rtx): Likewise. * combine2.c: New file. Index: gcc/Makefile.in === --- gcc/Makefile.in 2019-12-03 18:06:09.885650522 + +++ gcc/Makefile.in 2019-12-05 10:11:50.637631870 + @@ -1261,6 +1261,7 @@ OBJS = \ cgraphunit.o \ cgraphclones.o \ combine.o \ + combine2.o \ combine-stack-adj.o \ compare-elim.o \ context.o \ Index: gcc/params.opt === --- gcc/params.opt 2019-12-02 17:38:20.072423250 + +++ gcc/params.opt 2019-12-05 10:11:50.653631761 + @@ -760,6 +760,10 @@ Use internal function id in profile look Common Joined UInteger Var(param_rpo_vn_max_loop_depth) Init(7) IntegerRange(2, 65536) Param Maximum depth of a loop nest to fully value-number optimistically. +-param=run-combine= +Target Joined UInteger Var(param_run_combine) Init(2) IntegerRange(0, 7) Param +Choose which of the 3 available combine passes to run: bit 1 for the main combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 for a later variant of the combine pass. + -param=sccvn-max-alias-queries-per-access= Common Joined UInteger Var(param_sccvn_max_alias_queries_per_access) Init(1000) Param Maximum number of disambiguations to perform per memory access. Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi 2019-12-02 17:38:18.364434903 + +++ gcc/doc/invoke.texi 2019-12-05 10:11:50.653631761 + @@ -11797,6 +11797,11 @@ in combiner for a pseudo register as las @item max-combine-insns The maximum number of instructions the RTL combiner tries to combine. +@item run-combine +Choose which of the 3 available combine passes to run: bit 1 for the main +combine pass, bit 0 for an earlier variant of the combine pass, and bit 2 +for a later variant of the combine pass. + @item integer-share-limit Small integer constants can use a shared data structure, reducing the compiler's memory usage and increasing its speed. This sets the maximum Index: gcc/tree-pass.h === --- gcc/tree-pass.h 2019-11-19 16:25:28.0 + +++ gcc/tree-pass.h 2019-12-05 10:11:50.657631731 + @@ -562,7 +562,9 @@ extern rtl_opt_pass *make_pass_reginfo_i extern rtl_opt_pass *make_pass_inc_dec (gcc::context *ctxt); extern rtl_opt_pass *make_pass_stack_ptr_mod (gcc::context *ctxt); extern rtl_opt_pass *make_pass_initialize_regs (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_combine2_before (gcc::context *ctxt); extern rtl_opt_pass *make_pass_combine (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_combine2_after (gcc::context *ctxt); extern rtl_opt_pass *make_pass_if_after_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_jump_after_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_ree (gcc::context *ctxt); Index: gcc/passes.def === --- gcc/passes.def 2019-11-19 16:25:28.0 + +++ gcc/passes.def 2019-12-05 10:11:50.653631761 + @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_inc_dec); NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); + NEXT_PASS (pass_combine2_before); NEXT_PASS (pass_combine); + NEXT_PASS (pass_combine2_after); NEXT_PASS (pass_if_after_combine); NEXT_PASS
Re: Add a new combine pass
On Tue, 2019-12-03 at 12:05 -0600, Segher Boessenkool wrote: > On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote: > > On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote: > > > > > > > > - sh (that's sh4-linux): > > > > > > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function > > > > > 'snmp_get_cpu_field': > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to > > > > > find a register to spill in class 'R0_REGS' > > > > > 1638 | } > > > > > | ^ > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the > > > > > insn: > > > > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > > > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) > > > > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > > > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > > > > (nil > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier > > > > > errors, bailing out > > > > > > > > Would have to look more at this one. Seems odd that it can't allocate > > > > R0 when it's already the destination and when R0 can't be live before > > > > the insn. But there again, this is reload, so my enthuasiasm for > > > > looking > > > > is a bit limited :-) > > > > > > It wants to use r0 in some other insn, so it needs to spill it here, but > > > cannot. This is what class_likely_spilled is for. > > > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > > explicitly, but it's got a R0_REGS class with only one said reg in it. > > So the default impl of class_likely_spilled should do its thing. > > Yes, good point. So what happened here? "Something, somewhere, went terribly wrong"... insn 18 wants to do mov.l @(r4,r6),r0 But it can't because the reg+reg address mode has a R0 constraint itself. So it needs to be changed to mov r4,r0 mov.l @(r0,r6),r0 And it can't handle that. Or only sometimes? Don't remember. > Is it just RA messing things > up, unrelated to the new pass? > Yep, I think so. The additional pass seems to create "tougher" code so reload passes out earlier than usual. We've had the same issue when trying address mode selection optimization. In fact that was one huge showstopper. Cheers, Oleg
Re: Add a new combine pass
On Tue, Dec 03, 2019 at 10:33:48PM +0900, Oleg Endo wrote: > On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote: > > > > > > - sh (that's sh4-linux): > > > > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function > > > > 'snmp_get_cpu_field': > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to > > > > find a register to spill in class 'R0_REGS' > > > > 1638 | } > > > > | ^ > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the > > > > insn: > > > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) > > > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > > > (nil > > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier > > > > errors, bailing out > > > > > > Would have to look more at this one. Seems odd that it can't allocate > > > R0 when it's already the destination and when R0 can't be live before > > > the insn. But there again, this is reload, so my enthuasiasm for looking > > > is a bit limited :-) > > > > It wants to use r0 in some other insn, so it needs to spill it here, but > > cannot. This is what class_likely_spilled is for. > > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled > explicitly, but it's got a R0_REGS class with only one said reg in it. > So the default impl of class_likely_spilled should do its thing. Yes, good point. So what happened here? Is it just RA messing things up, unrelated to the new pass? Segher
Re: Add a new combine pass
On Mon, 2019-11-25 at 16:47 -0600, Segher Boessenkool wrote: > > > > - sh (that's sh4-linux): > > > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function > > > 'snmp_get_cpu_field': > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find > > > a register to spill in class 'R0_REGS' > > > 1638 | } > > > | ^ > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the > > > insn: > > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) > > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > > (nil > > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier > > > errors, bailing out > > > > Would have to look more at this one. Seems odd that it can't allocate > > R0 when it's already the destination and when R0 can't be live before > > the insn. But there again, this is reload, so my enthuasiasm for looking > > is a bit limited :-) > > It wants to use r0 in some other insn, so it needs to spill it here, but > cannot. This is what class_likely_spilled is for. > Hmm ... the R0 problem ... SH doesn't override class_likely_spilled explicitly, but it's got a R0_REGS class with only one said reg in it. So the default impl of class_likely_spilled should do its thing. LRA is available on SH and often fixes the R0 problems -- but not always. Maybe it got better over time, haven't checked. Could you re-run the SH build tests with -mlra, please ? Cheers, Oleg
Re: Add a new combine pass
On Wed, Nov 27, 2019 at 09:29:27AM +0100, Richard Biener wrote: > On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool > wrote: > > combine actually *calculates* DU chains almost completely, it just throws > > away most of that information (it wants to have LOG_LINKS, as it did ages > > ago). The only thing stopping us from doing that right now is that not > > all uses are counted (some are skipped). > > > > Since combine works only within BBs, DU chains are linear to compute, and > > UD chains are trivial (and just linear to compute). > > quadraticness appears for RTL DU/UD chains because of partial definitions, > that doesn't change for BBs so even there computing is them is quadratic > (because recording them is). The situation is simply having N partial > defs all reaching M uses which gives you a chain of size N * M. And both N and M are constants here (bounded by a constant). The only dimensions we care about are those the user can grow unlimited: number of registers, number of instructions, number of functions, that kind of thing. The control flow graph in a basic block is a DAG, making most of this linear to compute. Only updating it after every separate change is not easily linear in total. > Updating is linear as well if you can disregard partial defs. Updating cannot > be quadratic if compute is linear ;) Sure it can. Updating has to be O(1) (amortized) per change for the whole pass to be O(n). If it is O(n) per change you are likely O(n^2) in total. I don't see how to make combine itself O(1) per change, but yeah I can see how that can work (or almost work) for something simpler (and less weighed down by history :-) ). Segher
Re: Add a new combine pass
Richard Biener writes: > On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool > wrote: >> >> On Mon, Nov 25, 2019 at 11:08:47PM +, Richard Sandiford wrote: >> > Segher Boessenkool writes: >> > > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote: >> > >> Segher Boessenkool writes: >> > >> > I am wondering the other way around :-) Is what you do for combine2 >> > >> > something that would be more generally applicable/useful? That's what >> > >> > I'm trying to find out :-) >> > >> > >> > >> > What combine does could use some improvement, if you want to hear a >> > >> > more direct motivations. LOG_LINKS just skip references we cannot >> > >> > handle (and some more), so we always have to do modified_between etc., >> > >> > which hurts. >> > >> >> > >> The trade-offs behind the choice of representation are very specific >> > >> to the pass. >> > > >> > > Yes, but hopefully not so specific that every pass needs a completely >> > > different representation ;-) >> > >> > Well, it depends. Most passes make do with df (without DU/UD-chains). >> > But since DU/UD-chains are naturally quadratic in the general case, >> > and are expensive to keep up to date, each DU/UD pass is going to have >> > make some compromises. It doesn't seem too bad that passes make >> > different compromises based on what they're trying to do. (combine: >> > single use per definition; fwprop.c: track all uses, but for dominating >> > definitions only; sched: fudged via a param; regrename: single >> > definition/multiple use chains optimised for renmaing; combine2: full >> > live range information, but limited use list; etc.) >> >> combine actually *calculates* DU chains almost completely, it just throws >> away most of that information (it wants to have LOG_LINKS, as it did ages >> ago). The only thing stopping us from doing that right now is that not >> all uses are counted (some are skipped). >> >> Since combine works only within BBs, DU chains are linear to compute, and >> UD chains are trivial (and just linear to compute). > > quadraticness appears for RTL DU/UD chains because of partial definitions, > that doesn't change for BBs so even there computing is them is quadratic > (because recording them is). The situation is simply having N partial > defs all reaching M uses which gives you a chain of size N * M. > > Now - for combine you don't want partial defs, so for simplicity we could > choose to _not_ record DU/UD chains whenever we see a partial def for > a pseudo (and mark those as "bad"). Or, slightly enhanced, we can > handle DU/UD chains for regions where there is no partial definition > and add a "fake" D denoting (there are [multiple] defs beyond that > might be partial). Depending on the use-case that should suffice and > make the problem linear. > > I think you want to ask sth like "is REG changed [partially] between > its use in insn A and the def in insn B" and you want to answer that by using > REGs UD chain for that. If you only ever reached the def in insn B via the > "pruned" chain then this would work, likewise for the chain we do not compute > any UD chain for REG. (Passing over this as I think it's about what current combine wants.) >> Updating is quadratic in general, sure. Luckily in most realistic cases >> it is cheap (most, sigh) (insns aren't combined to very far away). > > Updating is linear as well if you can disregard partial defs. > Updating cannot be quadratic if compute is linear ;) This was based on the assumption that we'd do an update after each combination, so that the pass still sees correct info. That then makes the updates across one run of the pass quadratic, since the number of successful combinations is O(ninsns). As far as the new pass goes: the pass would be quadratic if we tried to combine each use in single-def DU chain with its definition. It would also be quadratic if we tried to parallelise each pair of uses in a DU chain. So if we did have full DU chains in the new pass, we'd also need some limit N on the number of uses we try to combine with. And if we're only going to try combining with N uses, then it seemed better to track only N uses "by name", rather than pay the cost of tracking all uses by name but ignoring the information for some of them. All we care about for other uses is whether they would prevent a move. We can track that using a simple point-based live range, where points are LUIDs with gaps in between for new insns. So the new pass uses a list of N specific uses and a single live range. Querying whether a particular definition is live at a particular point is then a constant-time operation. So is updating the info after a successful combination (potentially including a move). That still seems like a reasonable way of representing this, given what the pass wants to do. Moving to full DU chains would IMO just make the pass more expensive with no obvious benefit. Thanks, Richard
Re: Add a new combine pass
On Tue, Nov 26, 2019 at 2:42 AM Segher Boessenkool wrote: > > On Mon, Nov 25, 2019 at 11:08:47PM +, Richard Sandiford wrote: > > Segher Boessenkool writes: > > > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote: > > >> Segher Boessenkool writes: > > >> > I am wondering the other way around :-) Is what you do for combine2 > > >> > something that would be more generally applicable/useful? That's what > > >> > I'm trying to find out :-) > > >> > > > >> > What combine does could use some improvement, if you want to hear a > > >> > more direct motivations. LOG_LINKS just skip references we cannot > > >> > handle (and some more), so we always have to do modified_between etc., > > >> > which hurts. > > >> > > >> The trade-offs behind the choice of representation are very specific > > >> to the pass. > > > > > > Yes, but hopefully not so specific that every pass needs a completely > > > different representation ;-) > > > > Well, it depends. Most passes make do with df (without DU/UD-chains). > > But since DU/UD-chains are naturally quadratic in the general case, > > and are expensive to keep up to date, each DU/UD pass is going to have > > make some compromises. It doesn't seem too bad that passes make > > different compromises based on what they're trying to do. (combine: > > single use per definition; fwprop.c: track all uses, but for dominating > > definitions only; sched: fudged via a param; regrename: single > > definition/multiple use chains optimised for renmaing; combine2: full > > live range information, but limited use list; etc.) > > combine actually *calculates* DU chains almost completely, it just throws > away most of that information (it wants to have LOG_LINKS, as it did ages > ago). The only thing stopping us from doing that right now is that not > all uses are counted (some are skipped). > > Since combine works only within BBs, DU chains are linear to compute, and > UD chains are trivial (and just linear to compute). quadraticness appears for RTL DU/UD chains because of partial definitions, that doesn't change for BBs so even there computing is them is quadratic (because recording them is). The situation is simply having N partial defs all reaching M uses which gives you a chain of size N * M. Now - for combine you don't want partial defs, so for simplicity we could choose to _not_ record DU/UD chains whenever we see a partial def for a pseudo (and mark those as "bad"). Or, slightly enhanced, we can handle DU/UD chains for regions where there is no partial definition and add a "fake" D denoting (there are [multiple] defs beyond that might be partial). Depending on the use-case that should suffice and make the problem linear. I think you want to ask sth like "is REG changed [partially] between its use in insn A and the def in insn B" and you want to answer that by using REGs UD chain for that. If you only ever reached the def in insn B via the "pruned" chain then this would work, likewise for the chain we do not compute any UD chain for REG. > Updating is quadratic in general, sure. Luckily in most realistic cases > it is cheap (most, sigh) (insns aren't combined to very far away). Updating is linear as well if you can disregard partial defs. Updating cannot be quadratic if compute is linear ;) > > So yeah, if passes want to make roughly the same compromises, it would > > obviously be good if they shared a representation. But since each pass > > does something different, I don't think it's a bad sign that they make > > different compromises and use different representations. > > > > So I don't think a new pass with a new representation is in itself a > > sign of failure. > > Oh, I don't think so either. I just wonder if it would be useful more > generically :-) > > > >> >> >> Target Tests DeltaBest Worst Median > > >> >> >> avr-elf 1341 -111401 -13824 680 -10 > > >> >> > > > >> >> > Things like this are kind of suspicious :-) > > >> >> > > >> >> Yeah. This mostly seems to come from mopping up the extra moves > > >> >> created > > >> >> by make_more_copies. So we have combinations like: > > >> >> > > >> >>58: r70:SF=r94:SF > > >> >> REG_DEAD r94:SF > > >> >>60: r22:SF=r70:SF > > >> >> REG_DEAD r70:SF > > >> > > > >> > Why didn't combine do this? A target problem? > > >> > > >> Seems to be because combine rejects hard-reg destinations whose classes > > >> are likely spilled (cant_combine_insn_p). > > > > > > Ah, okay. And that is required to prevent ICEs, in combine2 as well > > > then -- ICEs in RA. > > > > Not in this case though. The final instruction is a hardreg<-pseudo move > > whatever happens. There's nothing special about r70 compared to r94. > > So the target hook could be improved? Or, this doesn't matter anyway, > the extra register move does not prevent any combinations, and RA should > get rid of it when that is beneficial. > > But you see smaller code in the end,
Re: Add a new combine pass
On Mon, Nov 25, 2019 at 11:08:47PM +, Richard Sandiford wrote: > Segher Boessenkool writes: > > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote: > >> Segher Boessenkool writes: > >> > I am wondering the other way around :-) Is what you do for combine2 > >> > something that would be more generally applicable/useful? That's what > >> > I'm trying to find out :-) > >> > > >> > What combine does could use some improvement, if you want to hear a > >> > more direct motivations. LOG_LINKS just skip references we cannot > >> > handle (and some more), so we always have to do modified_between etc., > >> > which hurts. > >> > >> The trade-offs behind the choice of representation are very specific > >> to the pass. > > > > Yes, but hopefully not so specific that every pass needs a completely > > different representation ;-) > > Well, it depends. Most passes make do with df (without DU/UD-chains). > But since DU/UD-chains are naturally quadratic in the general case, > and are expensive to keep up to date, each DU/UD pass is going to have > make some compromises. It doesn't seem too bad that passes make > different compromises based on what they're trying to do. (combine: > single use per definition; fwprop.c: track all uses, but for dominating > definitions only; sched: fudged via a param; regrename: single > definition/multiple use chains optimised for renmaing; combine2: full > live range information, but limited use list; etc.) combine actually *calculates* DU chains almost completely, it just throws away most of that information (it wants to have LOG_LINKS, as it did ages ago). The only thing stopping us from doing that right now is that not all uses are counted (some are skipped). Since combine works only within BBs, DU chains are linear to compute, and UD chains are trivial (and just linear to compute). Updating is quadratic in general, sure. Luckily in most realistic cases it is cheap (most, sigh) (insns aren't combined to very far away). > So yeah, if passes want to make roughly the same compromises, it would > obviously be good if they shared a representation. But since each pass > does something different, I don't think it's a bad sign that they make > different compromises and use different representations. > > So I don't think a new pass with a new representation is in itself a > sign of failure. Oh, I don't think so either. I just wonder if it would be useful more generically :-) > >> >> >> Target Tests DeltaBest Worst Median > >> >> >> avr-elf 1341 -111401 -13824 680 -10 > >> >> > > >> >> > Things like this are kind of suspicious :-) > >> >> > >> >> Yeah. This mostly seems to come from mopping up the extra moves created > >> >> by make_more_copies. So we have combinations like: > >> >> > >> >>58: r70:SF=r94:SF > >> >> REG_DEAD r94:SF > >> >>60: r22:SF=r70:SF > >> >> REG_DEAD r70:SF > >> > > >> > Why didn't combine do this? A target problem? > >> > >> Seems to be because combine rejects hard-reg destinations whose classes > >> are likely spilled (cant_combine_insn_p). > > > > Ah, okay. And that is required to prevent ICEs, in combine2 as well > > then -- ICEs in RA. > > Not in this case though. The final instruction is a hardreg<-pseudo move > whatever happens. There's nothing special about r70 compared to r94. So the target hook could be improved? Or, this doesn't matter anyway, the extra register move does not prevent any combinations, and RA should get rid of it when that is beneficial. But you see smaller code in the end, hrm. Segher
Re: Add a new combine pass
Segher Boessenkool writes: > Hi! > > On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote: >> Segher Boessenkool writes: >> > I am wondering the other way around :-) Is what you do for combine2 >> > something that would be more generally applicable/useful? That's what >> > I'm trying to find out :-) >> > >> > What combine does could use some improvement, if you want to hear a >> > more direct motivations. LOG_LINKS just skip references we cannot >> > handle (and some more), so we always have to do modified_between etc., >> > which hurts. >> >> The trade-offs behind the choice of representation are very specific >> to the pass. > > Yes, but hopefully not so specific that every pass needs a completely > different representation ;-) Well, it depends. Most passes make do with df (without DU/UD-chains). But since DU/UD-chains are naturally quadratic in the general case, and are expensive to keep up to date, each DU/UD pass is going to have make some compromises. It doesn't seem too bad that passes make different compromises based on what they're trying to do. (combine: single use per definition; fwprop.c: track all uses, but for dominating definitions only; sched: fudged via a param; regrename: single definition/multiple use chains optimised for renmaing; combine2: full live range information, but limited use list; etc.) So yeah, if passes want to make roughly the same compromises, it would obviously be good if they shared a representation. But since each pass does something different, I don't think it's a bad sign that they make different compromises and use different representations. So I don't think a new pass with a new representation is in itself a sign of failure. >> >> >> Target Tests DeltaBest Worst Median >> >> >> avr-elf 1341 -111401 -13824 680 -10 >> >> > >> >> > Things like this are kind of suspicious :-) >> >> >> >> Yeah. This mostly seems to come from mopping up the extra moves created >> >> by make_more_copies. So we have combinations like: >> >> >> >>58: r70:SF=r94:SF >> >> REG_DEAD r94:SF >> >>60: r22:SF=r70:SF >> >> REG_DEAD r70:SF >> > >> > Why didn't combine do this? A target problem? >> >> Seems to be because combine rejects hard-reg destinations whose classes >> are likely spilled (cant_combine_insn_p). > > Ah, okay. And that is required to prevent ICEs, in combine2 as well > then -- ICEs in RA. Not in this case though. The final instruction is a hardreg<-pseudo move whatever happens. There's nothing special about r70 compared to r94. > There should be a better way to do this. ISTM we should be checking for whichever cases actually cause the RA failures. E.g. to take on extreme example, if all the following are true: - an insn has a single alternative - an insn has a single non-earyclobber output - an insn has no parallel clobbers - an insn has no auto-inc/decs - an insn has a hard register destination that satisfies its constraints - the hard register is defined in its original location then there should be no problem. The insn shouldn't need any output reloads that would conflict with the hard register. It also doesn't extend the live range of the output. Obviously that's a lot of conditions :-) And IMO they should be built up the other way around: reject specific cases that are known to cause problems, based on information about the matched insn. But I think the avr example shows that there's a real problem with using REGNO_REG_CLASS for this too. REGNO_REG_CLASS gives the smallest enclosing class, which might not be the most relevant one in context. (It isn't here, since we're just passing arguments to functions.) >> This SF argument register >> happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject >> the combination based on POINTER_X_REGS being likely spilled. > > static bool > avr_class_likely_spilled_p (reg_class_t c) > { > return (c != ALL_REGS && >(AVR_TINY ? 1 : c != ADDW_REGS)); > } > > So this target severely shackles combine. Does it have to? If so, why > not with combine2? As far as the above example goes, I think returning true for POINTER_X_REGS is the right thing to do. It only has two 8-bit registers, and they act as a pair when used as a pointer. Thanks, Richard
Re: Add a new combine pass
On Mon, Nov 25, 2019 at 09:40:36PM +, Richard Sandiford wrote: > Segher Boessenkool writes: > > - i386 goes into an infinite loop compiling, or at least an hour or so... > > Erm I forgot too record what it was compiling. I did attach a GDB... It > > is something from lra_create_live_ranges. > > Hmm. This one is actually worrying me -- it's not obviously a simple problem, or a target problem, or a pre-existing problem. > Ah, this'll be while m68k was still a cc0 target. Yeah, I should probably > just skip the whole pass for cc0. Yes, tree of last friday or saturday or so. And yup if you don't handle cc0 yet, yes you want to skip it completely :-) > > - sh (that's sh4-linux): > > > > /home/segher/src/kernel/net/ipv4/af_inet.c: In function > > 'snmp_get_cpu_field': > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: unable to find a > > register to spill in class 'R0_REGS' > > 1638 | } > > | ^ > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638:1: error: this is the insn: > > (insn 18 17 19 2 (set (reg:SI 0 r0) > > (mem:SI (plus:SI (reg:SI 4 r4 [178]) > > (reg:SI 6 r6 [171])) [17 *_3+0 S4 A32])) > > "/home/segher/src/kernel/net/ipv4/af_inet.c":1638:1 188 {movsi_i} > > (expr_list:REG_DEAD (reg:SI 4 r4 [178]) > > (expr_list:REG_DEAD (reg:SI 6 r6 [171]) > > (nil > > /home/segher/src/kernel/net/ipv4/af_inet.c:1638: confused by earlier > > errors, bailing out > > Would have to look more at this one. Seems odd that it can't allocate > R0 when it's already the destination and when R0 can't be live before > the insn. But there again, this is reload, so my enthuasiasm for looking > is a bit limited :-) It wants to use r0 in some other insn, so it needs to spill it here, but cannot. This is what class_likely_spilled is for. > > Looking at just binary size, which is a good stand-in for how many insns > > it combined: > > > > R2 > >arm64 99.709% > > ia64 99.651% > > s390 99.734% > >sparc 99.877% > > sparc64 100.011% > > > > (These are those that are not between 99.9% and 100.0%). > > > > So only sparc64 regressed, and just a tiny bit (I can look at what that > > is, if there is interest). But 32-bit sparc improved, and s390, arm64, > > and ia64 got actual benefit. > > > > Again this is just code size, not analysing the actually changed code. > > OK. Certainly not an earth-shattering improvement then, but not > entirely worthless either. I usually takes 0.2% as "definitely useful" for combine improvements, so there are a few targets that have that. There can be improvements that are important for a target even if they do not improve code size much, of course, and it can identify weaknesses in the backend code, so you always need to look at what really changes. > I see combine also tests cannot_copy_insn_p. I'm not sure whether that's > appropriate for the new pass or not. Arguably it's not copying the > instruction, it's just moving it to be in parallel with something else. > (But then that's largely true of the combine case too.) combine tests this only for the cases where it *does* have to copy an insn: when the dest if i0, i1, or i2 doesn't die, it is added as another arm to the (parallel) result. Segher
Re: Add a new combine pass
Hi! On Mon, Nov 25, 2019 at 09:16:52PM +, Richard Sandiford wrote: > Segher Boessenkool writes: > > I am wondering the other way around :-) Is what you do for combine2 > > something that would be more generally applicable/useful? That's what > > I'm trying to find out :-) > > > > What combine does could use some improvement, if you want to hear a > > more direct motivations. LOG_LINKS just skip references we cannot > > handle (and some more), so we always have to do modified_between etc., > > which hurts. > > The trade-offs behind the choice of representation are very specific > to the pass. Yes, but hopefully not so specific that every pass needs a completely different representation ;-) > >> >> Target Tests DeltaBest Worst Median > >> >> avr-elf 1341 -111401 -13824 680 -10 > >> > > >> > Things like this are kind of suspicious :-) > >> > >> Yeah. This mostly seems to come from mopping up the extra moves created > >> by make_more_copies. So we have combinations like: > >> > >>58: r70:SF=r94:SF > >> REG_DEAD r94:SF > >>60: r22:SF=r70:SF > >> REG_DEAD r70:SF > > > > Why didn't combine do this? A target problem? > > Seems to be because combine rejects hard-reg destinations whose classes > are likely spilled (cant_combine_insn_p). Ah, okay. And that is required to prevent ICEs, in combine2 as well then -- ICEs in RA. There should be a better way to do this. > This SF argument register > happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject > the combination based on POINTER_X_REGS being likely spilled. static bool avr_class_likely_spilled_p (reg_class_t c) { return (c != ALL_REGS && (AVR_TINY ? 1 : c != ADDW_REGS)); } So this target severely shackles combine. Does it have to? If so, why not with combine2? > >> So there's only one case in which it isn't a win, but the number of > >> tests is tiny. So I agree there's no justification for trying this in > >> combine proper as things stand (and I wasn't arguing otherwise FWIW). > >> I'd still like to keep it in the new pass because it does help > >> *sometimes* and there's no sign yet that it has a noticeable > >> compile-time cost. > > > > So when does it help? I can only think of cases where there are > > problems elsewhere. > > The full list of affected tests (all at -O2 -ftree-vectorize) are: I'll have to look at this closer later, sorry. Segher
Re: Add a new combine pass
Segher Boessenkool writes: > Hi! > > On Mon, Nov 18, 2019 at 05:55:13PM +, Richard Sandiford wrote: >> Richard Sandiford writes: >> > (It's 23:35 local time, so it's still just about stage 1. :-)) >> >> Or actually, just under 1 day after end of stage 1. Oops. >> Could have sworn stage 1 ended on the 17th :-( Only realised >> I'd got it wrong when catching up on Saturday's email traffic. >> >> And inevitably, I introduced a couple of stupid mistakes while >> trying to clean the patch up for submission by that (non-)deadline. >> Here's a version that fixes an inverted overlapping memref check >> and that correctly prunes the use list for combined instructions. >> (This last one is just a compile-time saving -- the old code was >> correct, just suboptimal.) > > I've build the Linux kernel with the previous version, as well as this > one. R0 is unmodified GCC, R1 is the first patch, R2 is this one: > > (I've forced --param=run-combine=6 for R1 and R2): > (Percentages are relative to R0): > > R0R1R2R1R2 >alpha 6107088 6101088 6101088 99.902% 99.902% > arc 4008224 4006568 4006568 99.959% 99.959% > arm 9206728 9200936 9201000 99.937% 99.938% >arm64 13056174 13018174 13018194 99.709% 99.709% >armhf 0 0 0 0 0 > c6x 2337237 2337077 2337077 99.993% 99.993% > csky 3356602 0 0 0 0 >h8300 1166996 1166776 1166776 99.981% 99.981% > i386 11352159 0 0 0 0 > ia64 18230640 18167000 18167000 99.651% 99.651% > m68k 3714271 0 0 0 0 > microblaze 4982749 4979945 4979945 99.944% 99.944% > mips 8499309 8495205 8495205 99.952% 99.952% > mips64 7042036 7039816 7039816 99.968% 99.968% >nds32 4486663 0 0 0 0 >nios2 3680001 3679417 3679417 99.984% 99.984% > openrisc 4226076 4225868 4225868 99.995% 99.995% > parisc 7681895 7680063 7680063 99.976% 99.976% > parisc64 8677077 8676581 8676581 99.994% 99.994% > powerpc 10687611 10682199 10682199 99.949% 99.949% >powerpc64 17671082 17658570 17658570 99.929% 99.929% > powerpc64le 17671082 17658570 17658570 99.929% 99.929% > riscv32 1554938 1554758 1554758 99.988% 99.988% > riscv64 6634342 6632788 6632788 99.977% 99.977% > s390 13049643 13014939 13014939 99.734% 99.734% > sh 3254743 0 0 0 0 > shnommu 1632364 1632124 1632124 99.985% 99.985% >sparc 4404993 4399593 4399593 99.877% 99.877% > sparc64 6796711 6797491 6797491 100.011% 100.011% > x86_64 19713174 19712817 19712817 99.998% 99.998% > xtensa 0 0 0 0 0 Thanks for running these. > There are fivew new failures, with either of the combine2 patches. And > all five are actually different (different symptoms, at least): > > - csky fails on libgcc build: > > /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi': > /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate > reloads for: > 1405 | } > | ^ > (insn 199 86 87 8 (parallel [ > (set (reg:SI 101) > (plus:SI (reg:SI 98) > (const_int -32 [0xffe0]))) > (set (reg:CC 33 c) > (lt:CC (plus:SI (reg:SI 98) > (const_int -32 [0xffe0])) > (const_int 0 [0]))) > ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt} > (nil)) > during RTL pass: reload > > Target problem? Yeah, looks like it. The pattern is: (define_insn "*cskyv2_declt" [(set (match_operand:SI 0 "register_operand" "=r") (plus:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 2 "const_int_operand" "Uh"))) (set (reg:CC CSKY_CC_REGNUM) (lt:CC (plus:SI (match_dup 1) (match_dup 2)) (const_int 0)))] "CSKY_ISA_FEATURE (2E3)" "declt\t%0, %1, %M2" ) So the predicate accepts all const_ints but the constraint doesn't. > - i386 goes into an infinite loop compiling, or at least an hour or so... > Erm I forgot too record what it was compiling. I did attach a GDB... It > is something from lra_create_live_ranges. Hmm. > - m68k: > > /home/segher/src/kernel/fs/exec.c: In function 'copy_strings': > /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in > final_scan_insn_1, at final.c:3048 > 590 | } > | ^ > 0x10408307 final_scan_insn_1 > /home/segher/src/gcc/gcc/final.c:3048 > 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int,
Re: Add a new combine pass
Segher Boessenkool writes: > On Thu, Nov 21, 2019 at 08:32:14PM +, Richard Sandiford wrote: >> Segher Boessenkool writes: >> > It's not great if every pass invents its own version of some common >> > infrastructure thing because that common one is not suitable. >> > >> > I.e., can this be fixed somehow? Maybe just by having a restricted DU >> > chains df problem? >> >> Well, it'd probably make sense to make fwprop.c's approach available >> as a "proper" df interface at some point. Hopefully if anyone wants the >> same thing as fwprop.c, they'd do that rather than copy the code. :-) > >> >> * Updating a full, ordered def-use chain after a move is a linear-time >> >> operation, so whatever happens, we'd need to apply some kind of limit >> >> on the number of uses we maintain, with something like that integer >> >> point range for the rest. > > Yeah. > >> >> * Once we've analysed the insn and built its def-use chains, we don't >> >> look at the df_refs again until we update the chains after a successful >> >> combination. So it should be more efficient to maintain a small array >> >> of insn_info_rec pointers alongside the numerical range, rather than >> >> walk and pollute chains of df_refs and then link back the insn uids >> >> to the pass-local info. >> > >> > So you need something like combine's LOG_LINKS? Not that handling those >> > is not quadratic in the worst case, but in practice it works well. And >> > it *could* be made linear. >> >> Not sure why what I've used isn't what I need though :-) > > I am wondering the other way around :-) Is what you do for combine2 > something that would be more generally applicable/useful? That's what > I'm trying to find out :-) > > What combine does could use some improvement, if you want to hear a > more direct motivations. LOG_LINKS just skip references we cannot > handle (and some more), so we always have to do modified_between etc., > which hurts. The trade-offs behind the choice of representation are very specific to the pass. You'd only pick this if you wanted both to propagate definitions into uses and to move insns around. You'd also only pick it if you were happy with tracking a small number of named uses per definition. I can't think of any other passes that would prefer this over what they already use. (Combine itself is an exception, since the new pass started out as a deliberate attempt to start from scratch.) >> >> Target Tests DeltaBest Worst Median >> >> avr-elf 1341 -111401 -13824 680 -10 >> > >> > Things like this are kind of suspicious :-) >> >> Yeah. This mostly seems to come from mopping up the extra moves created >> by make_more_copies. So we have combinations like: >> >>58: r70:SF=r94:SF >> REG_DEAD r94:SF >>60: r22:SF=r70:SF >> REG_DEAD r70:SF > > Why didn't combine do this? A target problem? Seems to be because combine rejects hard-reg destinations whose classes are likely spilled (cant_combine_insn_p). This SF argument register happens to overlap POINTER_X_REGS and POINTER_Y_REGS and so we reject the combination based on POINTER_X_REGS being likely spilled. I think the same thing could happen on other targets, e.g. for TAILCALL_ADDR_REGS on aarch64. >> So there's only one case in which it isn't a win, but the number of >> tests is tiny. So I agree there's no justification for trying this in >> combine proper as things stand (and I wasn't arguing otherwise FWIW). >> I'd still like to keep it in the new pass because it does help >> *sometimes* and there's no sign yet that it has a noticeable >> compile-time cost. > > So when does it help? I can only think of cases where there are > problems elsewhere. The full list of affected tests (all at -O2 -ftree-vectorize) are: arc-elf gcc.c-torture/compile/pr67506.c avr-elf gcc.dg/torture/pr77916.c bpf-elf gcc.dg/torture/vshuf-v8hi.c bpf-elf gcc.dg/torture/vshuf-v4si.c bfin-elfgcc.dg/torture/vshuf-v8qi.c c6x-elf gcc.c-torture/execute/991118-1.c cr16-elfgcc.c-torture/compile/pr82052.c epiphany-elfgcc.c-torture/execute/991118-1.c epiphany-elfgcc.dg/pr77664.c epiphany-elfgcc.dg/vect/vect-mult-pattern-2.c epiphany-elfgcc.dg/torture/vshuf-v8hi.c epiphany-elfgcc.dg/tree-ssa/pr77664.c epiphany-elfgcc.dg/tree-ssa/negneg-3.c fr30-elfgcc.dg/torture/vshuf-v4hi.c fr30-elfgcc.dg/torture/vshuf-v8hi.c frv-linux-gnu gcc.dg/torture/vshuf-v4hi.c frv-linux-gnu gcc.dg/torture/vshuf-v8hi.c h8300-elf gcc.c-torture/execute/2422-1.c h8300-elf gcc.dg/torture/pr77916.c ia64-linux-gnu gcc.c-torture/execute/ieee/pr30704.c ia64-linux-gnu gcc.dg/vect/pr49478.c ia64-linux-gnu gcc.dg/tree-ssa/ldist-16.c i686-apple-darwin
Re: Add a new combine pass
On 11/23/19 6:09 PM, Segher Boessenkool wrote: On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote: Please just CC to this conversation as I keep getting removed. Everyone who was on Cc: for this thread still is. This is how email works. If you want to see everything on the list, subscribe to the mailing list? Segher I was part of the original CCs on my comments but seems that there were two or seemed to be two splitting versions of the thread. I would like to just keep all comments merged in one thread is all. Sorry for the confusion Segher, Nick
Re: Add a new combine pass
On Sat, Nov 23, 2019 at 06:01:28PM -0500, Nicholas Krause wrote: > Please just CC to this conversation as I keep getting removed. Everyone who was on Cc: for this thread still is. This is how email works. If you want to see everything on the list, subscribe to the mailing list? Segher
Re: Add a new combine pass
On 11/23/19 5:34 PM, Segher Boessenkool wrote: Hi! On Mon, Nov 18, 2019 at 05:55:13PM +, Richard Sandiford wrote: Richard Sandiford writes: (It's 23:35 local time, so it's still just about stage 1. :-)) Or actually, just under 1 day after end of stage 1. Oops. Could have sworn stage 1 ended on the 17th :-( Only realised I'd got it wrong when catching up on Saturday's email traffic. And inevitably, I introduced a couple of stupid mistakes while trying to clean the patch up for submission by that (non-)deadline. Here's a version that fixes an inverted overlapping memref check and that correctly prunes the use list for combined instructions. (This last one is just a compile-time saving -- the old code was correct, just suboptimal.) I've build the Linux kernel with the previous version, as well as this one. R0 is unmodified GCC, R1 is the first patch, R2 is this one: (I've forced --param=run-combine=6 for R1 and R2): (Percentages are relative to R0): R0R1R2R1R2 alpha 6107088 6101088 6101088 99.902% 99.902% arc 4008224 4006568 4006568 99.959% 99.959% arm 9206728 9200936 9201000 99.937% 99.938% arm64 13056174 13018174 13018194 99.709% 99.709% armhf 0 0 0 0 0 c6x 2337237 2337077 2337077 99.993% 99.993% csky 3356602 0 0 0 0 h8300 1166996 1166776 1166776 99.981% 99.981% i386 11352159 0 0 0 0 ia64 18230640 18167000 18167000 99.651% 99.651% m68k 3714271 0 0 0 0 microblaze 4982749 4979945 4979945 99.944% 99.944% mips 8499309 8495205 8495205 99.952% 99.952% mips64 7042036 7039816 7039816 99.968% 99.968% nds32 4486663 0 0 0 0 nios2 3680001 3679417 3679417 99.984% 99.984% openrisc 4226076 4225868 4225868 99.995% 99.995% parisc 7681895 7680063 7680063 99.976% 99.976% parisc64 8677077 8676581 8676581 99.994% 99.994% powerpc 10687611 10682199 10682199 99.949% 99.949% powerpc64 17671082 17658570 17658570 99.929% 99.929% powerpc64le 17671082 17658570 17658570 99.929% 99.929% riscv32 1554938 1554758 1554758 99.988% 99.988% riscv64 6634342 6632788 6632788 99.977% 99.977% s390 13049643 13014939 13014939 99.734% 99.734% sh 3254743 0 0 0 0 shnommu 1632364 1632124 1632124 99.985% 99.985% sparc 4404993 4399593 4399593 99.877% 99.877% sparc64 6796711 6797491 6797491 100.011% 100.011% x86_64 19713174 19712817 19712817 99.998% 99.998% xtensa 0 0 0 0 0 0 means it didn't build. armhf is probably my own problem, not sure yet. xtensa starts with /tmp/ccmJoY7l.s: Assembler messages: /tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object file and it doesn't get better. My powerpc64 config actually built the powerpc64le config, since the kernel since a while looks what the host system is, for its defconfig. Oh well, fixed now. There are fivew new failures, with either of the combine2 patches. And all five are actually different (different symptoms, at least): - csky fails on libgcc build: /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi': /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for: 1405 | } | ^ (insn 199 86 87 8 (parallel [ (set (reg:SI 101) (plus:SI (reg:SI 98) (const_int -32 [0xffe0]))) (set (reg:CC 33 c) (lt:CC (plus:SI (reg:SI 98) (const_int -32 [0xffe0])) (const_int 0 [0]))) ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt} (nil)) during RTL pass: reload Target problem? - i386 goes into an infinite loop compiling, or at least an hour or so... Erm I forgot too record what it was compiling. I did attach a GDB... It is something from lra_create_live_ranges. - m68k: /home/segher/src/kernel/fs/exec.c: In function 'copy_strings': /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048 590 | } | ^ 0x10408307 final_scan_insn_1 /home/segher/src/gcc/gcc/final.c:3048 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) /home/segher/src/gcc/gcc/final.c:3152 0x10408797 final_1 /home/segher/src/gcc/gcc/final.c:2020 0x104091f7 rest_of_handle_final /home/segher/src/gcc/gcc/final.c:4658 0x104091f7 execute
Re: Add a new combine pass
Hi! On Mon, Nov 18, 2019 at 05:55:13PM +, Richard Sandiford wrote: > Richard Sandiford writes: > > (It's 23:35 local time, so it's still just about stage 1. :-)) > > Or actually, just under 1 day after end of stage 1. Oops. > Could have sworn stage 1 ended on the 17th :-( Only realised > I'd got it wrong when catching up on Saturday's email traffic. > > And inevitably, I introduced a couple of stupid mistakes while > trying to clean the patch up for submission by that (non-)deadline. > Here's a version that fixes an inverted overlapping memref check > and that correctly prunes the use list for combined instructions. > (This last one is just a compile-time saving -- the old code was > correct, just suboptimal.) I've build the Linux kernel with the previous version, as well as this one. R0 is unmodified GCC, R1 is the first patch, R2 is this one: (I've forced --param=run-combine=6 for R1 and R2): (Percentages are relative to R0): R0R1R2R1R2 alpha 6107088 6101088 6101088 99.902% 99.902% arc 4008224 4006568 4006568 99.959% 99.959% arm 9206728 9200936 9201000 99.937% 99.938% arm64 13056174 13018174 13018194 99.709% 99.709% armhf 0 0 0 0 0 c6x 2337237 2337077 2337077 99.993% 99.993% csky 3356602 0 0 0 0 h8300 1166996 1166776 1166776 99.981% 99.981% i386 11352159 0 0 0 0 ia64 18230640 18167000 18167000 99.651% 99.651% m68k 3714271 0 0 0 0 microblaze 4982749 4979945 4979945 99.944% 99.944% mips 8499309 8495205 8495205 99.952% 99.952% mips64 7042036 7039816 7039816 99.968% 99.968% nds32 4486663 0 0 0 0 nios2 3680001 3679417 3679417 99.984% 99.984% openrisc 4226076 4225868 4225868 99.995% 99.995% parisc 7681895 7680063 7680063 99.976% 99.976% parisc64 8677077 8676581 8676581 99.994% 99.994% powerpc 10687611 10682199 10682199 99.949% 99.949% powerpc64 17671082 17658570 17658570 99.929% 99.929% powerpc64le 17671082 17658570 17658570 99.929% 99.929% riscv32 1554938 1554758 1554758 99.988% 99.988% riscv64 6634342 6632788 6632788 99.977% 99.977% s390 13049643 13014939 13014939 99.734% 99.734% sh 3254743 0 0 0 0 shnommu 1632364 1632124 1632124 99.985% 99.985% sparc 4404993 4399593 4399593 99.877% 99.877% sparc64 6796711 6797491 6797491 100.011% 100.011% x86_64 19713174 19712817 19712817 99.998% 99.998% xtensa 0 0 0 0 0 0 means it didn't build. armhf is probably my own problem, not sure yet. xtensa starts with /tmp/ccmJoY7l.s: Assembler messages: /tmp/ccmJoY7l.s:407: Error: cannot represent `BFD_RELOC_8' relocation in object file and it doesn't get better. My powerpc64 config actually built the powerpc64le config, since the kernel since a while looks what the host system is, for its defconfig. Oh well, fixed now. There are fivew new failures, with either of the combine2 patches. And all five are actually different (different symptoms, at least): - csky fails on libgcc build: /home/segher/src/gcc/libgcc/fp-bit.c: In function '__fixdfsi': /home/segher/src/gcc/libgcc/fp-bit.c:1405:1: error: unable to generate reloads for: 1405 | } | ^ (insn 199 86 87 8 (parallel [ (set (reg:SI 101) (plus:SI (reg:SI 98) (const_int -32 [0xffe0]))) (set (reg:CC 33 c) (lt:CC (plus:SI (reg:SI 98) (const_int -32 [0xffe0])) (const_int 0 [0]))) ]) "/home/segher/src/gcc/libgcc/fp-bit.c":1403:23 207 {*cskyv2_declt} (nil)) during RTL pass: reload Target problem? - i386 goes into an infinite loop compiling, or at least an hour or so... Erm I forgot too record what it was compiling. I did attach a GDB... It is something from lra_create_live_ranges. - m68k: /home/segher/src/kernel/fs/exec.c: In function 'copy_strings': /home/segher/src/kernel/fs/exec.c:590:1: internal compiler error: in final_scan_insn_1, at final.c:3048 590 | } | ^ 0x10408307 final_scan_insn_1 /home/segher/src/gcc/gcc/final.c:3048 0x10408383 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) /home/segher/src/gcc/gcc/final.c:3152 0x10408797 final_1 /home/segher/src/gcc/gcc/final.c:2020 0x104091f7 rest_of_handle_final /home/segher/src/gcc/gcc/final.c:4658 0x104091f7 execute /home/segher/src/gcc/gcc/final.c:4736 and that line is
Re: Add a new combine pass
On Thu, Nov 21, 2019 at 08:32:14PM +, Richard Sandiford wrote: > Segher Boessenkool writes: > > It's not great if every pass invents its own version of some common > > infrastructure thing because that common one is not suitable. > > > > I.e., can this be fixed somehow? Maybe just by having a restricted DU > > chains df problem? > > Well, it'd probably make sense to make fwprop.c's approach available > as a "proper" df interface at some point. Hopefully if anyone wants the > same thing as fwprop.c, they'd do that rather than copy the code. :-) > >> * Updating a full, ordered def-use chain after a move is a linear-time > >> operation, so whatever happens, we'd need to apply some kind of limit > >> on the number of uses we maintain, with something like that integer > >> point range for the rest. Yeah. > >> * Once we've analysed the insn and built its def-use chains, we don't > >> look at the df_refs again until we update the chains after a successful > >> combination. So it should be more efficient to maintain a small array > >> of insn_info_rec pointers alongside the numerical range, rather than > >> walk and pollute chains of df_refs and then link back the insn uids > >> to the pass-local info. > > > > So you need something like combine's LOG_LINKS? Not that handling those > > is not quadratic in the worst case, but in practice it works well. And > > it *could* be made linear. > > Not sure why what I've used isn't what I need though :-) I am wondering the other way around :-) Is what you do for combine2 something that would be more generally applicable/useful? That's what I'm trying to find out :-) What combine does could use some improvement, if you want to hear a more direct motivations. LOG_LINKS just skip references we cannot handle (and some more), so we always have to do modified_between etc., which hurts. > >> Target Tests DeltaBest Worst Median > >> avr-elf 1341 -111401 -13824 680 -10 > > > > Things like this are kind of suspicious :-) > > Yeah. This mostly seems to come from mopping up the extra moves created > by make_more_copies. So we have combinations like: > >58: r70:SF=r94:SF > REG_DEAD r94:SF >60: r22:SF=r70:SF > REG_DEAD r70:SF Why didn't combine do this? A target problem? > So there's only one case in which it isn't a win, but the number of > tests is tiny. So I agree there's no justification for trying this in > combine proper as things stand (and I wasn't arguing otherwise FWIW). > I'd still like to keep it in the new pass because it does help > *sometimes* and there's no sign yet that it has a noticeable > compile-time cost. So when does it help? I can only think of cases where there are problems elsewhere. > It might also be interesting to see how much difference it makes for > run-combine=4 (e.g. to see how much it makes up for the current 2-insn > limit)... Numbers are good :-) Segher
Re: Add a new combine pass
On Thu, Nov 21, 2019 at 07:41:56PM +, Richard Sandiford wrote: > Nicholas Krause writes: > >> +/* The instructions we're combining, in program order. */ > >> +insn_info_rec *sequence[MAX_COMBINE_INSNS]; > > Can't we can this a vec in order to grow to lengths and just loop through > > merging on instructions in the vec as required? > > Yeah, extending this to combining more than 2 instructions would be > future work. When that happens, this would likely end up becoming an > auto_vec. I imagine there would > still be a fairly low compile-time limit on the number of combinations > though. E.g. current combine has a limit of 4, with even 4 being > restricted to certain high-value cases. I don't think I've ever > seen a case where 5 or more would help. And sometimes it looks like 4 would help, but often this is because of a limitation elsewhere (like, it should have done a 2->2 before, for example). 4 _does_ help quite a bit with irregular instruction sets. It could sometimes help with RMW insns, too, but there are other problems with that. What you see a lot where 4 "helps" is where it really should combine with just 3 of them, but something prevents that, often cost, while throwing in a 4th insn tilts the balance just enough. We used to have a lot of that with 3-insn combinations as well, and probably still have some. Segher
Re: Add a new combine pass
Segher Boessenkool writes: > On Wed, Nov 20, 2019 at 06:20:34PM +, Richard Sandiford wrote: >> > Why don't you use DF for the DU chains? >> >> The problem with DF_DU_CHAIN is that it's quadratic in the worst case. > > Oh, wow. > >> fwprop.c gets around that by using the MD problem and having its own >> dominator walker to calculate limited def-use chains: >> >> /* We use the multiple definitions problem to compute our restricted >> use-def chains. */ > > It's not great if every pass invents its own version of some common > infrastructure thing because that common one is not suitable. > > I.e., can this be fixed somehow? Maybe just by having a restricted DU > chains df problem? Well, it'd probably make sense to make fwprop.c's approach available as a "proper" df interface at some point. Hopefully if anyone wants the same thing as fwprop.c, they'd do that rather than copy the code. :-) >> So taking that approach here would still require some amount of >> roll-your-own. Other reasons are: >> >> * Even what fwprop does is more elaborate than we need for now. >> >> * We need to handle memory too, and it's nice to be able to handle >> it in the same way as registers. >> >> * Updating a full, ordered def-use chain after a move is a linear-time >> operation, so whatever happens, we'd need to apply some kind of limit >> on the number of uses we maintain, with something like that integer >> point range for the rest. >> >> * Once we've analysed the insn and built its def-use chains, we don't >> look at the df_refs again until we update the chains after a successful >> combination. So it should be more efficient to maintain a small array >> of insn_info_rec pointers alongside the numerical range, rather than >> walk and pollute chains of df_refs and then link back the insn uids >> to the pass-local info. > > So you need something like combine's LOG_LINKS? Not that handling those > is not quadratic in the worst case, but in practice it works well. And > it *could* be made linear. Not sure why what I've used isn't what I need though :-) If it's an array vs. linked-list thing, then for the multi-use case, we need two sets of link pointers, one for "next use of the same resource" and one for "next use in this instruction". Then we need the payload of the list node itself. For the small number of entries we're talking about, using null-terminated arrays of "things that this instruction uses" and "instructions that use this resource" should be more efficient than pointer-chasing, and occupies the same space as the link pointers (i.e. saves the extra payload). We also need to be able to walk in both directions, to answer the questions: - which insns can I combine with this definition? - where is this value of a resource defined? - where are the uses of this resource? - where was the previous definition of this resource, and where was its last use? So if we're comparing it to existing linked-list GCC structures, it's more similar to df_ref (see above for why that seemed like a bad idea) or -- more light-weight -- dep_link_t in the scheduler. And both the array and linked-list approaches still need to fall back to the simple live range once a certain threshold is hit. >> The second set is for: >> >> (B) --param run-combine=6 (both passes), use-use combinations only >> (C) --param run-combine=6 (both passes), no restrictions >> >> Target Tests DeltaBest Worst Median >> == = = = == >> aarch64-linux-gnu272 -3844-585 18 -1 >> aarch64_be-linux-gnu 190 -3336-370 18 -1 >> alpha-linux-gnu 401 -2735-370 22 -2 >> amdgcn-amdhsa1881867-4841259 -1 >> arc-elf 257 -1498-650 54 -1 >> arm-linux-gnueabi168 -1117-612 680 -1 >> arm-linux-gnueabihf 168 -1117-612 680 -1 >> avr-elf 1341 -111401 -13824 680 -10 > > Things like this are kind of suspicious :-) Yeah. This mostly seems to come from mopping up the extra moves created by make_more_copies. So we have combinations like: 58: r70:SF=r94:SF REG_DEAD r94:SF 60: r22:SF=r70:SF REG_DEAD r70:SF (r22 is a hard reg, the others are pseudos) which produces: std Y+1,r22 std Y+2,r23 std Y+3,r24 std Y+4,r25 - ldd r22,Y+1 - ldd r23,Y+2 - ldd r24,Y+3 - ldd r25,Y+4 On the REG_EQUAL thing: you're right that it doesn't make much difference for run-combine=6: Target Tests DeltaBest Worst Median == = = = == arc-elf1 -1 -1 -1 -1 avr-elf1 -1 -1 -1 -1 bfin-elf 1 -1 -1 -1 -1 bpf-elf2 -6 -5 -1 -5 c6x-elf
Re: Add a new combine pass
Hi Nick, Thanks for the comments. Nicholas Krause writes: >> Index: gcc/passes.def >> === >> --- gcc/passes.def 2019-10-29 08:29:03.224443133 + >> +++ gcc/passes.def 2019-11-17 23:15:31.200500531 + >> @@ -437,7 +437,9 @@ along with GCC; see the file COPYING3. >> NEXT_PASS (pass_inc_dec); >> NEXT_PASS (pass_initialize_regs); >> NEXT_PASS (pass_ud_rtl_dce); >> + NEXT_PASS (pass_combine2_before); >> NEXT_PASS (pass_combine); >> + NEXT_PASS (pass_combine2_after); >> NEXT_PASS (pass_if_after_combine); >> NEXT_PASS (pass_jump_after_combine); >> NEXT_PASS (pass_partition_blocks); >> Index: gcc/timevar.def > This is really two passes it seems or at least functions. Just a nit but you > may want to state that as I don't recall reading that. It's really two instances of the same pass, but yeah, each instance goes under a different name. This is because each instance needs to know which bit of the run-combine value it should be testing: >> The patch adds two instances of the new pass: one before combine and >> one after it. By default both are disabled, but this can be changed >> using the new 3-bit run-combine param, where: >> >> - bit 0 selects the new pre-combine pass >> - bit 1 selects the main combine pass >> - bit 2 selects the new post-combine pass So bit 0 is pass_combine2_before, bit 1 is pass_combine and bit 2 is pass_combine2_after. But the passes are identical apart from the choice of bit they test. >> + /* Describes one attempt to combine instructions. */ >> + struct combination_attempt_rec >> + { >> +/* The instruction that we're currently trying to optimize. >> + If the combination succeeds, we'll use this insn_info_rec >> + to describe the new instruction. */ >> +insn_info_rec *new_home; >> + >> +/* The instructions we're combining, in program order. */ >> +insn_info_rec *sequence[MAX_COMBINE_INSNS]; > Can't we can this a vec in order to grow to lengths and just loop through > merging on instructions in the vec as required? Yeah, extending this to combining more than 2 instructions would be future work. When that happens, this would likely end up becoming an auto_vec. I imagine there would still be a fairly low compile-time limit on the number of combinations though. E.g. current combine has a limit of 4, with even 4 being restricted to certain high-value cases. I don't think I've ever seen a case where 5 or more would help. >> +/* Return true if we know that USER is the last user of RANGE. */ >> + >> +bool >> +combine2::known_last_use_p (live_range_rec *range, insn_info_rec *user) >> +{ >> + if (range->last_extra_use <= user->point) >> +return false; >> + >> + for (unsigned int i = 0; i < NUM_RANGE_USERS && range->users[i]; ++i) >> +if (range->users[i] == user) >> + return i == NUM_RANGE_USERS - 1 || !range->users[i + 1]; > Small nit and I could be wrong but do: > > return !range->users[i + 1] || i == NUM_RANGE_USERS - 1; > > Based on your code it seems that the getting to NUM_RANGE_USERS is far > less likely. The problem is that we'll then be accessing outside the users[] array when i == NUM_RANGE_USERS - 1, so we have to check the limit first. Thanks, Richard
Re: Add a new combine pass
On 11/17/19 6:35 PM, Richard Sandiford wrote: (It's 23:35 local time, so it's still just about stage 1. :-)) While working on SVE, I've noticed several cases in which we fail to combine instructions because the combined form would need to be placed earlier in the instruction stream than the last of the instructions being combined. This includes one very important case in the handling of the first fault register (FFR). Combine currently requires the combined instruction to live at the same location as i3. I thought about trying to relax that restriction, but it would be difficult to do with the current pass structure while keeping everything linear-ish time. So this patch instead goes for an option that has been talked about several times over the years: writing a new combine pass that just does instruction combination, and not all the other optimisations that have been bolted onto combine over time. E.g. it deliberately doesn't do things like nonzero-bits tracking, since that really ought to be a separate, more global, optimisation. This is still far from being a realistic replacement for the even the combine parts of the current combine pass. E.g.: - it only handles combinations that can be built up from individual two-instruction combinations. - it doesn't allow new hard register clobbers to be added. - it doesn't have the special treatment of CC operations. - etc. But we have to start somewhere. On a more positive note, the pass handles things that the current combine pass doesn't: - the main motivating feature mentioned above: it works out where the combined instruction could validly live and moves it there if necessary. If there are a range of valid places, it tries to pick the best one based on register pressure (although only with a simple heuristic for now). - once it has combined two instructions, it can try combining the result with both later and earlier code, i.e. it can combine in both directions. - it tries using REG_EQUAL notes for the final instruction. - it can parallelise two independent instructions that both read from the same register or both read from memory. This last feature is useful for generating more load-pair combinations on AArch64. In some cases it can also produce more store-pair combinations, but only for consecutive stores. However, since the pass currently does this in a very greedy, peephole way, it only allows load/store-pair combinations if the first memory access has a higher alignment than the second, i.e. if we can be sure that the combined access is naturally aligned. This should help it to make better decisions than the post-RA peephole pass in some cases while not being too aggressive. The pass is supposed to be linear time without debug insns. It only tries a constant number C of combinations per instruction and its bookkeeping updates are constant-time. Once it has combined two instructions, it'll try up to C combinations on the result, but this can be counted against the instruction that was deleted by the combination and so effectively just doubles the constant. (Note that C depends on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.) Unfortunately, debug updates via propagate_for_debug are more expensive. This could probably be fixed if the pass did more to track debug insns itself, but using propagate_for_debug matches combine's behaviour. The patch adds two instances of the new pass: one before combine and one after it. By default both are disabled, but this can be changed using the new 3-bit run-combine param, where: - bit 0 selects the new pre-combine pass - bit 1 selects the main combine pass - bit 2 selects the new post-combine pass The idea is that run-combine=3 can be used to see which combinations are missed by the new pass, while run-combine=6 (which I hope to be the production setting for AArch64 at -O2+) just uses the new pass to mop up cases that normal combine misses. Maybe in some distant future, the pass will be good enough for run-combine=[14] to be a realistic option. I ended up having to add yet another validate_simplify_* routine, this time to do the equivalent of: newx = simplify_replace_rtx (*loc, old_rtx, new_rtx); validate_change (insn, loc, newx, 1); but in a more memory-efficient way. validate_replace_rtx isn't suitable because it deliberately only tries simplifications in limited cases: /* Do changes needed to keep rtx consistent. Don't do any other simplifications, as it is not our job. */ And validate_simplify_insn isn't useful for this case because it works on patterns that have already had changes made to them and expects those patterns to be valid rtxes. simplify-replace operations instead need to simplify as they go, when the original modes are still to hand. As far as compile-time goes, I tried compiling optabs.ii at -O2 with an --enable-checking=release compiler: run-combine=2 (normal combine): 100.0% (baseline) run-combine=4
Re: Add a new combine pass
On Wed, Nov 20, 2019 at 06:20:34PM +, Richard Sandiford wrote: > Segher Boessenkool writes: > > So this would work if you had pseudos here, instead of the hard reg? > > Because it is a hard reg it is the same number in both places, making it > > hard to move. > > Yeah, probably. But the hard reg is a critical part of this. > Going back to the example: > > (set (reg/v:VNx16BI 102 [ ok ]) > (reg:VNx16BI 85 ffrt)) > (set (reg:VNx16BI 85 ffrt) > (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) > (set (reg:CC_NZC 66 cc) > (unspec:CC_NZC >[(reg:VNx16BI 106) repeated x2 > (const_int 1 [0x1]) > (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) > > FFR is the real first fault register. FFRT is actually a fake register > whose only purpose is to describe the dependencies (in rtl) between writes > to the FFR, reads from the FFR and first-faulting loads. The whole scheme > depends on having only one fixed FFRT register. Right. The reason this cannot work in combine is that combine always combines to just *one* insn, at i3; later, if it turns out that it needs to split it, it can put something at i2. But that doesn't even happen here, only the first and the last of those three insns are what is combined. It is important combine only moves things forward in the insn stream, to make sure this whole process is finite. Or this was true years ago, at least :-) > > Why don't you use DF for the DU chains? > > The problem with DF_DU_CHAIN is that it's quadratic in the worst case. Oh, wow. > fwprop.c gets around that by using the MD problem and having its own > dominator walker to calculate limited def-use chains: > > /* We use the multiple definitions problem to compute our restricted > use-def chains. */ It's not great if every pass invents its own version of some common infrastructure thing because that common one is not suitable. I.e., can this be fixed somehow? Maybe just by having a restricted DU chains df problem? > So taking that approach here would still require some amount of > roll-your-own. Other reasons are: > > * Even what fwprop does is more elaborate than we need for now. > > * We need to handle memory too, and it's nice to be able to handle > it in the same way as registers. > > * Updating a full, ordered def-use chain after a move is a linear-time > operation, so whatever happens, we'd need to apply some kind of limit > on the number of uses we maintain, with something like that integer > point range for the rest. > > * Once we've analysed the insn and built its def-use chains, we don't > look at the df_refs again until we update the chains after a successful > combination. So it should be more efficient to maintain a small array > of insn_info_rec pointers alongside the numerical range, rather than > walk and pollute chains of df_refs and then link back the insn uids > to the pass-local info. So you need something like combine's LOG_LINKS? Not that handling those is not quadratic in the worst case, but in practice it works well. And it *could* be made linear. > >> Tracking limited def-use chains is what makes this last bit easy. > >> We can just try parallelising two instructions from the (bounded) list > >> of uses. And for this case there's not any garbage rtl involved, since > >> we reuse the same PARALLEL rtx between attempts. The cost is basically > >> all in the recog call (which would obviously mount up if we went > >> overboard). > > > > *All* examples above and below are just this. > > Yeah, the powerpc and s390x examples were. The motivating FFR example > above isn't though: it's a def-use combination in parallel with the > existing definition. Right, good point :-) > >> >> To get a feel for the effect on multiple targets, I did my usual > >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6 > >> >> using -O2 -ftree-vectorize: > >> > > >> > One problem with this is that these are very short functions on average. > >> > >> There are some long ones too :-) > > > > Yes, but this isn't a good stand-in for representative programs. > > Right. And number of lines of asm isn't a good stand-in for anything much. For combine, number of insns generated is a surprisingly good measure of how it performed. Sometimes not, when it goes over a border of an inlining decision, say, or bb-reorder decides to duplicate more because it is cheaper now. > Like I say, the whole thing is just to get a feel, on tests that are readily > to hand and are easy to compile without a full toolchain. Absolutely. But I have no experience with using your test set, so the numbers do not necessarily mean so much to me :-) > > So I'd love to see statistics for *only* combining two uses of the same > > thing, this is something combine cannot do, and arguably *shouldn't* do! > > OK, here are two sets of results. The first is for:
Re: Add a new combine pass
Segher Boessenkool writes: >> /* Make sure that the value that is to be substituted for the register >> does not use any registers whose values alter in between. However, >> If the insns are adjacent, a use can't cross a set even though we >> think it might (this can happen for a sequence of insns each setting >> the same destination; last_set of that register might point to >> a NOTE). If INSN has a REG_EQUIV note, the register is always >> equivalent to the memory so the substitution is valid even if there >> are intervening stores. Also, don't move a volatile asm or >> UNSPEC_VOLATILE across any other insns. */ >> || (! all_adjacent >>&& (((!MEM_P (src) >> || ! find_reg_note (insn, REG_EQUIV, src)) >> && modified_between_p (src, insn, i3)) >>|| (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) >>|| GET_CODE (src) == UNSPEC_VOLATILE)) > > So this would work if you had pseudos here, instead of the hard reg? > Because it is a hard reg it is the same number in both places, making it > hard to move. Yeah, probably. But the hard reg is a critical part of this. Going back to the example: (set (reg/v:VNx16BI 102 [ ok ]) (reg:VNx16BI 85 ffrt)) (set (reg:VNx16BI 85 ffrt) (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) (set (reg:CC_NZC 66 cc) (unspec:CC_NZC [(reg:VNx16BI 106) repeated x2 (const_int 1 [0x1]) (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) FFR is the real first fault register. FFRT is actually a fake register whose only purpose is to describe the dependencies (in rtl) between writes to the FFR, reads from the FFR and first-faulting loads. The whole scheme depends on having only one fixed FFRT register. >> > How are dependencies represented in your new pass? If it just does >> > walks over the insn stream for everything, you get quadratic complexity >> > if you move insns backwards. We have that in combine already, mostly >> > from modified_between_p, but that is limited because of how LOG_LINKS >> > work, and we have been doing this for so long and there are no problems >> > found with it, so it must work in practice. But I am worried about it >> > when moving insns back an unlimited distance. >> >> It builds def-use chains, but using a constant limit on the number of >> explicitly-recorded uses. All other uses go in a numerical live range >> from which they (conservatively) never escape. The def-use chains >> represent memory as a single entity, a bit like in gimple. > > Ah. So that range thing ensures correctness. Yeah. > Why don't you use DF for the DU chains? The problem with DF_DU_CHAIN is that it's quadratic in the worst case. fwprop.c gets around that by using the MD problem and having its own dominator walker to calculate limited def-use chains: /* We use the multiple definitions problem to compute our restricted use-def chains. */ So taking that approach here would still require some amount of roll-your-own. Other reasons are: * Even what fwprop does is more elaborate than we need for now. * We need to handle memory too, and it's nice to be able to handle it in the same way as registers. * Updating a full, ordered def-use chain after a move is a linear-time operation, so whatever happens, we'd need to apply some kind of limit on the number of uses we maintain, with something like that integer point range for the rest. * Once we've analysed the insn and built its def-use chains, we don't look at the df_refs again until we update the chains after a successful combination. So it should be more efficient to maintain a small array of insn_info_rec pointers alongside the numerical range, rather than walk and pollute chains of df_refs and then link back the insn uids to the pass-local info. >> >> - it tries using REG_EQUAL notes for the final instruction. >> > >> > And that. >> >> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 >> with i3's REG_EQUAL note and combining into that. Does combine do that? >> I couldn't see it, and in: >> >>https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html >> >> you seemed to reject the idea of allowing it. > > Yes, I still do. Do you have an example where it helps? I'll run another set of tests for that. >> >> - it can parallelise two independent instructions that both read from >> >> the same register or both read from memory. >> > >> > That only if somehow there is a link between the two (so essentially >> > never). The only combinations tried by combine are those via LOG_LINKs, >> > which are between a SET and the first corresponding use. This is a key >> > factor that makes it kind of linear (instead of exponential) complexity. >> >> Tracking limited def-use chains is what makes this last bit easy. >> We can just try parallelising two instructions from the (bounded) list >> of uses. And for this case
Re: Add a new combine pass
On Tue, Nov 19, 2019 at 11:33:13AM +, Richard Sandiford wrote: > Segher Boessenkool writes: > > On Sun, Nov 17, 2019 at 11:35:26PM +, Richard Sandiford wrote: > >> While working on SVE, I've noticed several cases in which we fail > >> to combine instructions because the combined form would need to be > >> placed earlier in the instruction stream than the last of the > >> instructions being combined. This includes one very important > >> case in the handling of the first fault register (FFR). > > > > Do you have an example of that? > > It's difficult to share realistic examples at this stage since this > isn't really the right forum for making them public for the first time. Oh I'm very sorry. In the future, just say "Future" and I know what you mean :-) > /* Make sure that the value that is to be substituted for the register >does not use any registers whose values alter in between. However, >If the insns are adjacent, a use can't cross a set even though we >think it might (this can happen for a sequence of insns each setting >the same destination; last_set of that register might point to >a NOTE). If INSN has a REG_EQUIV note, the register is always >equivalent to the memory so the substitution is valid even if there >are intervening stores. Also, don't move a volatile asm or >UNSPEC_VOLATILE across any other insns. */ > || (! all_adjacent > && (((!MEM_P (src) > || ! find_reg_note (insn, REG_EQUIV, src)) > && modified_between_p (src, insn, i3)) > || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) > || GET_CODE (src) == UNSPEC_VOLATILE)) So this would work if you had pseudos here, instead of the hard reg? Because it is a hard reg it is the same number in both places, making it hard to move. > > How are dependencies represented in your new pass? If it just does > > walks over the insn stream for everything, you get quadratic complexity > > if you move insns backwards. We have that in combine already, mostly > > from modified_between_p, but that is limited because of how LOG_LINKS > > work, and we have been doing this for so long and there are no problems > > found with it, so it must work in practice. But I am worried about it > > when moving insns back an unlimited distance. > > It builds def-use chains, but using a constant limit on the number of > explicitly-recorded uses. All other uses go in a numerical live range > from which they (conservatively) never escape. The def-use chains > represent memory as a single entity, a bit like in gimple. Ah. So that range thing ensures correctness. Why don't you use DF for the DU chains? > >> - it tries using REG_EQUAL notes for the final instruction. > > > > And that. > > I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 > with i3's REG_EQUAL note and combining into that. Does combine do that? > I couldn't see it, and in: > >https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html > > you seemed to reject the idea of allowing it. Yes, I still do. Do you have an example where it helps? > >> - it can parallelise two independent instructions that both read from > >> the same register or both read from memory. > > > > That only if somehow there is a link between the two (so essentially > > never). The only combinations tried by combine are those via LOG_LINKs, > > which are between a SET and the first corresponding use. This is a key > > factor that makes it kind of linear (instead of exponential) complexity. > > Tracking limited def-use chains is what makes this last bit easy. > We can just try parallelising two instructions from the (bounded) list > of uses. And for this case there's not any garbage rtl involved, since > we reuse the same PARALLEL rtx between attempts. The cost is basically > all in the recog call (which would obviously mount up if we went > overboard). *All* examples above and below are just this. If you disable everything else, what do the statistics look like then? > > One thing I want to do is some mini-combine after every split, probably > > only with the insns new from the split. But we have no cfglayout mode > > anymore then, and only hard regs (except in the first split pass, which > > is just a little later than your new pass). > > Yeah, sounds like it could be useful. I guess there'd need to be > an extra condition on the combination that the new insn can't be > immediately split. It would run *after* split. Not interleaved with it. > > And amount of garbage produced? > > If -ftime-report stats are accurate, then the total amount of > memory allocated is: > > run-combine=2 (normal combine): 1793 kB > run-combine=4 (new pass only):98 kB > run-combine=6 (both passes):1871 kB (new pass accounts for 78 kB) > > But again that's not a fair comparison when the main combine pass does more. The way combine does SUBST is pretty
Re: Add a new combine pass
Segher Boessenkool writes: > On Sun, Nov 17, 2019 at 11:35:26PM +, Richard Sandiford wrote: >> While working on SVE, I've noticed several cases in which we fail >> to combine instructions because the combined form would need to be >> placed earlier in the instruction stream than the last of the >> instructions being combined. This includes one very important >> case in the handling of the first fault register (FFR). > > Do you have an example of that? It's difficult to share realistic examples at this stage since this isn't really the right forum for making them public for the first time. But in rtl terms we have: (set (reg/v:VNx16BI 102 [ ok ]) (reg:VNx16BI 85 ffrt)) (set (reg:VNx16BI 85 ffrt) (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) (set (reg:CC_NZC 66 cc) (unspec:CC_NZC [(reg:VNx16BI 106) repeated x2 (const_int 1 [0x1]) (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) and want to combine the first and third instruction at the site of the first instruction. Current combine gives: Trying 18 -> 24: 18: r102:VNx16BI=ffrt:VNx16BI 24: cc:CC_NZC=unspec[r106:VNx16BI,r106:VNx16BI,0x1,r102:VNx16BI] 104 Can't combine i2 into i3 because of: /* Make sure that the value that is to be substituted for the register does not use any registers whose values alter in between. However, If the insns are adjacent, a use can't cross a set even though we think it might (this can happen for a sequence of insns each setting the same destination; last_set of that register might point to a NOTE). If INSN has a REG_EQUIV note, the register is always equivalent to the memory so the substitution is valid even if there are intervening stores. Also, don't move a volatile asm or UNSPEC_VOLATILE across any other insns. */ || (! all_adjacent && (((!MEM_P (src) || ! find_reg_note (insn, REG_EQUIV, src)) && modified_between_p (src, insn, i3)) || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) || GET_CODE (src) == UNSPEC_VOLATILE)) >> Combine currently requires the combined instruction to live at the same >> location as i3. > > Or i2 and i3. > >> I thought about trying to relax that restriction, but it >> would be difficult to do with the current pass structure while keeping >> everything linear-ish time. > > s/difficult/impossible/, yes. > > A long time ago we had to only move insns forward for correctness even, > but that should no longer be required, combine always is finite by other > means now. > >> So this patch instead goes for an option that has been talked about >> several times over the years: writing a new combine pass that just >> does instruction combination, and not all the other optimisations >> that have been bolted onto combine over time. E.g. it deliberately >> doesn't do things like nonzero-bits tracking, since that really ought >> to be a separate, more global, optimisation. > > In my dreams tracking nonzero bits would be a dataflow problem. > >> This is still far from being a realistic replacement for the even >> the combine parts of the current combine pass. E.g.: >> >> - it only handles combinations that can be built up from individual >> two-instruction combinations. > > And combine does any of {2,3,4}->{1,2} combinations, and it also can > modify a third insn ("other_insn"). For the bigger ->1 combos, if it > *can* be decomposed in a bunch of 2->1, then those result in insns that > are greater cost than those we started with (or else those combinations > *would* be done). For the ->2 combinations, there are many ways those > two insns can be formed: it can be the two arms of a parallel, or > combine can break a non-matching insn into two at what looks like a good > spot for that, or it can use a define_split for it. > > All those things lead to many more successful combinations :-) Right. I definitely want to support multi-insn combos too. It's one of the TODOs in the head comment, along with the other points in this list. Like I say, it's not yet a realistic replacement for even the combine parts of the current pass. >> On a more positive note, the pass handles things that the current >> combine pass doesn't: >> >> - the main motivating feature mentioned above: it works out where >> the combined instruction could validly live and moves it there >> if necessary. If there are a range of valid places, it tries >> to pick the best one based on register pressure (although only >> with a simple heuristic for now). > > How are dependencies represented in your new pass? If it just does > walks over the insn stream for everything, you get quadratic complexity > if you move insns backwards. We have that in combine already, mostly > from modified_between_p, but that is limited because of how LOG_LINKS > work, and we have been doing this for so long and there are no
Re: Add a new combine pass
Hi! On Sun, Nov 17, 2019 at 11:35:26PM +, Richard Sandiford wrote: > While working on SVE, I've noticed several cases in which we fail > to combine instructions because the combined form would need to be > placed earlier in the instruction stream than the last of the > instructions being combined. This includes one very important > case in the handling of the first fault register (FFR). Do you have an example of that? > Combine currently requires the combined instruction to live at the same > location as i3. Or i2 and i3. > I thought about trying to relax that restriction, but it > would be difficult to do with the current pass structure while keeping > everything linear-ish time. s/difficult/impossible/, yes. A long time ago we had to only move insns forward for correctness even, but that should no longer be required, combine always is finite by other means now. > So this patch instead goes for an option that has been talked about > several times over the years: writing a new combine pass that just > does instruction combination, and not all the other optimisations > that have been bolted onto combine over time. E.g. it deliberately > doesn't do things like nonzero-bits tracking, since that really ought > to be a separate, more global, optimisation. In my dreams tracking nonzero bits would be a dataflow problem. > This is still far from being a realistic replacement for the even > the combine parts of the current combine pass. E.g.: > > - it only handles combinations that can be built up from individual > two-instruction combinations. And combine does any of {2,3,4}->{1,2} combinations, and it also can modify a third insn ("other_insn"). For the bigger ->1 combos, if it *can* be decomposed in a bunch of 2->1, then those result in insns that are greater cost than those we started with (or else those combinations *would* be done). For the ->2 combinations, there are many ways those two insns can be formed: it can be the two arms of a parallel, or combine can break a non-matching insn into two at what looks like a good spot for that, or it can use a define_split for it. All those things lead to many more successful combinations :-) > On a more positive note, the pass handles things that the current > combine pass doesn't: > > - the main motivating feature mentioned above: it works out where > the combined instruction could validly live and moves it there > if necessary. If there are a range of valid places, it tries > to pick the best one based on register pressure (although only > with a simple heuristic for now). How are dependencies represented in your new pass? If it just does walks over the insn stream for everything, you get quadratic complexity if you move insns backwards. We have that in combine already, mostly from modified_between_p, but that is limited because of how LOG_LINKS work, and we have been doing this for so long and there are no problems found with it, so it must work in practice. But I am worried about it when moving insns back an unlimited distance. If combine results in two insns it puts them at i2 and i3, and it can actually move a SET to i2 that was at i3 before the combination. > - once it has combined two instructions, it can try combining the > result with both later and earlier code, i.e. it can combine > in both directions. That is what combine does, too. > - it tries using REG_EQUAL notes for the final instruction. And that. > - it can parallelise two independent instructions that both read from > the same register or both read from memory. That only if somehow there is a link between the two (so essentially never). The only combinations tried by combine are those via LOG_LINKs, which are between a SET and the first corresponding use. This is a key factor that makes it kind of linear (instead of exponential) complexity. > The pass is supposed to be linear time without debug insns. > It only tries a constant number C of combinations per instruction > and its bookkeeping updates are constant-time. But how many other insns does it look at, say by modified_between_p or the like? > The patch adds two instances of the new pass: one before combine and > one after it. One thing I want to do is some mini-combine after every split, probably only with the insns new from the split. But we have no cfglayout mode anymore then, and only hard regs (except in the first split pass, which is just a little later than your new pass). > As far as compile-time goes, I tried compiling optabs.ii at -O2 > with an --enable-checking=release compiler: > > run-combine=2 (normal combine): 100.0% (baseline) > run-combine=4 (new pass only) 98.0% > run-combine=6 (both passes) 100.3% > > where the results are easily outside the noise. So the pass on > its own is quicker than combine, but that's not a fair comparison > when it doesn't do everything combine does. Running both passes > only has a slight overhead. And amount of garbage
Re: Add a new combine pass
Richard Sandiford writes: > (It's 23:35 local time, so it's still just about stage 1. :-)) Or actually, just under 1 day after end of stage 1. Oops. Could have sworn stage 1 ended on the 17th :-( Only realised I'd got it wrong when catching up on Saturday's email traffic. And inevitably, I introduced a couple of stupid mistakes while trying to clean the patch up for submission by that (non-)deadline. Here's a version that fixes an inverted overlapping memref check and that correctly prunes the use list for combined instructions. (This last one is just a compile-time saving -- the old code was correct, just suboptimal.) And those comparisons that looked too good to be true were: I'd bodged the choice of run-combine parameters when setting up the tests. All in all, not a great a day. Here are the (much less impressive) real values: Target Tests DeltaBest Worst Median == = = = == aarch64-linux-gnu412-786-270 520 -1 aarch64_be-linux-gnu 288 -3314-270 33 -1 alpha-linux-gnu 399 -2721-370 22 -2 amdgcn-amdhsa2011938-4841259 -1 arc-elf 530 -5901 -1529 356 -1 arm-linux-gnueabi193 -1167-612 680 -1 arm-linux-gnueabihf 193 -1167-612 680 -1 avr-elf 1331 -111093 -13824 680 -9 bfin-elf1347 -18928 -8461 465 -2 bpf-elf 63-475 -60 6 -2 c6x-elf 183 -10508 -10084 41 -2 cr16-elf1610 -51360 -10657 42 -13 cris-elf 143 -1534-702 4 -2 csky-elf 136 -3371-474 6 -2 epiphany-elf 178-389-149 84 -1 fr30-elf 161 -1756-756 289 -2 frv-linux-gnu807 -13324 -2074 67 -1 ft32-elf 282 -1666-111 5 -2 h8300-elf522 -11451 -1747 68 -3 hppa64-hp-hpux11.23 186-848-142 34 -1 i686-apple-darwin344 -1298 -56 44 -1 i686-pc-linux-gnu242 -1953-556 33 -1 ia64-linux-gnu 150 -4834 -1134 40 -4 iq2000-elf 177 -1333 -61 3 -2 lm32-elf 193 -1792-316 47 -2 m32r-elf 73-595 -98 11 -2 m68k-linux-gnu 210 -2351-332 148 -2 mcore-elf133 -1213-146 7 -1 microblaze-elf 445 -4493 -2094 32 -2 mipsel-linux-gnu 134 -2038-222 60 -2 mmix 108-233 -26 4 -1 mn10300-elf 224 -1024-234 80 -1 moxie-rtems 154-743 -79 4 -2 msp430-elf 182-586 -63 19 -1 nds32le-elf 267-485 -37 136 -1 nios2-linux-gnu 83-323 -66 5 -1 nvptx-none 568 -1124-208 16 1 or1k-elf 61-281 -25 4 -1 pdp11248 -1292-182 83 -1 powerpc-ibm-aix7.0 1288 -3031-3702046 -1 powerpc64-linux-gnu 1118 692-2742934 -2 powerpc64le-linux-gnu 1044 -4719-688 156 -1 pru-elf 48 -7014 -6921 6 -1 riscv32-elf 63 -1364-139 7 -2 riscv64-elf 91 -1557-264 7 -1 rl78-elf 354 -16805 -1665 42 -6 rx-elf95-186 -53 8 -1 s390-linux-gnu 184 -2282 -1485 63 -1 s390x-linux-gnu 257-363-159 522 -1 sh-linux-gnu 225-405-108 68 -1 sparc-linux-gnu 164-859 -99 18 -1 sparc64-linux-gnu169-791-102 15 -1 tilepro-linux-gnu 1037 -4896-315 332 -2 v850-elf 54-408 -53 3 -2 vax-netbsdelf251 -3315-400 2 -2 visium-elf 101-693-138 16 -1 x86_64-darwin350 -2145-490 72 -1 x86_64-linux-gnu 311-853-288 210 -1 xstormy16-elf219-770-156 59 -1 xtensa-elf 201 -1418-322 36 1 Also, the number of LDPs on aarch64-linux-gnu went up from 3543 to 5235. The number of STPs went up from 10494 to 12151. All the new pairs should be aligned ones. Retested on aarch64-linux-gnu and x86_64-linux-gnu. It missed the deadline, but I thought I'd post it anyway to put the record straight. Thanks, Richard 2019-11-18 Richard Sandiford gcc/
Re: Add a new combine pass
On Sun, Nov 17, 2019 at 3:35 PM Richard Sandiford wrote: > > (It's 23:35 local time, so it's still just about stage 1. :-)) > > While working on SVE, I've noticed several cases in which we fail > to combine instructions because the combined form would need to be > placed earlier in the instruction stream than the last of the > instructions being combined. This includes one very important > case in the handling of the first fault register (FFR). > > Combine currently requires the combined instruction to live at the same > location as i3. I thought about trying to relax that restriction, but it > would be difficult to do with the current pass structure while keeping > everything linear-ish time. > > So this patch instead goes for an option that has been talked about > several times over the years: writing a new combine pass that just > does instruction combination, and not all the other optimisations > that have been bolted onto combine over time. E.g. it deliberately > doesn't do things like nonzero-bits tracking, since that really ought > to be a separate, more global, optimisation. > > This is still far from being a realistic replacement for the even > the combine parts of the current combine pass. E.g.: > > - it only handles combinations that can be built up from individual > two-instruction combinations. > > - it doesn't allow new hard register clobbers to be added. > > - it doesn't have the special treatment of CC operations. > > - etc. > > But we have to start somewhere. > > On a more positive note, the pass handles things that the current > combine pass doesn't: > > - the main motivating feature mentioned above: it works out where > the combined instruction could validly live and moves it there > if necessary. If there are a range of valid places, it tries > to pick the best one based on register pressure (although only > with a simple heuristic for now). > > - once it has combined two instructions, it can try combining the > result with both later and earlier code, i.e. it can combine > in both directions. > > - it tries using REG_EQUAL notes for the final instruction. > > - it can parallelise two independent instructions that both read from > the same register or both read from memory. > > This last feature is useful for generating more load-pair combinations > on AArch64. In some cases it can also produce more store-pair combinations, > but only for consecutive stores. However, since the pass currently does > this in a very greedy, peephole way, it only allows load/store-pair > combinations if the first memory access has a higher alignment than > the second, i.e. if we can be sure that the combined access is naturally > aligned. This should help it to make better decisions than the post-RA > peephole pass in some cases while not being too aggressive. > > The pass is supposed to be linear time without debug insns. > It only tries a constant number C of combinations per instruction > and its bookkeeping updates are constant-time. Once it has combined two > instructions, it'll try up to C combinations on the result, but this can > be counted against the instruction that was deleted by the combination > and so effectively just doubles the constant. (Note that C depends > on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.) > > Unfortunately, debug updates via propagate_for_debug are more expensive. > This could probably be fixed if the pass did more to track debug insns > itself, but using propagate_for_debug matches combine's behaviour. > > The patch adds two instances of the new pass: one before combine and > one after it. By default both are disabled, but this can be changed > using the new 3-bit run-combine param, where: > > - bit 0 selects the new pre-combine pass > - bit 1 selects the main combine pass > - bit 2 selects the new post-combine pass > > The idea is that run-combine=3 can be used to see which combinations > are missed by the new pass, while run-combine=6 (which I hope to be > the production setting for AArch64 at -O2+) just uses the new pass > to mop up cases that normal combine misses. Maybe in some distant > future, the pass will be good enough for run-combine=[14] to be a > realistic option. > > I ended up having to add yet another validate_simplify_* routine, > this time to do the equivalent of: > >newx = simplify_replace_rtx (*loc, old_rtx, new_rtx); >validate_change (insn, loc, newx, 1); > > but in a more memory-efficient way. validate_replace_rtx isn't suitable > because it deliberately only tries simplifications in limited cases: > > /* Do changes needed to keep rtx consistent. Don't do any other > simplifications, as it is not our job. */ > > And validate_simplify_insn isn't useful for this case because it works > on patterns that have already had changes made to them and expects > those patterns to be valid rtxes. simplify-replace operations instead > need to simplify as they go, when the original modes are still to
Add a new combine pass
(It's 23:35 local time, so it's still just about stage 1. :-)) While working on SVE, I've noticed several cases in which we fail to combine instructions because the combined form would need to be placed earlier in the instruction stream than the last of the instructions being combined. This includes one very important case in the handling of the first fault register (FFR). Combine currently requires the combined instruction to live at the same location as i3. I thought about trying to relax that restriction, but it would be difficult to do with the current pass structure while keeping everything linear-ish time. So this patch instead goes for an option that has been talked about several times over the years: writing a new combine pass that just does instruction combination, and not all the other optimisations that have been bolted onto combine over time. E.g. it deliberately doesn't do things like nonzero-bits tracking, since that really ought to be a separate, more global, optimisation. This is still far from being a realistic replacement for the even the combine parts of the current combine pass. E.g.: - it only handles combinations that can be built up from individual two-instruction combinations. - it doesn't allow new hard register clobbers to be added. - it doesn't have the special treatment of CC operations. - etc. But we have to start somewhere. On a more positive note, the pass handles things that the current combine pass doesn't: - the main motivating feature mentioned above: it works out where the combined instruction could validly live and moves it there if necessary. If there are a range of valid places, it tries to pick the best one based on register pressure (although only with a simple heuristic for now). - once it has combined two instructions, it can try combining the result with both later and earlier code, i.e. it can combine in both directions. - it tries using REG_EQUAL notes for the final instruction. - it can parallelise two independent instructions that both read from the same register or both read from memory. This last feature is useful for generating more load-pair combinations on AArch64. In some cases it can also produce more store-pair combinations, but only for consecutive stores. However, since the pass currently does this in a very greedy, peephole way, it only allows load/store-pair combinations if the first memory access has a higher alignment than the second, i.e. if we can be sure that the combined access is naturally aligned. This should help it to make better decisions than the post-RA peephole pass in some cases while not being too aggressive. The pass is supposed to be linear time without debug insns. It only tries a constant number C of combinations per instruction and its bookkeeping updates are constant-time. Once it has combined two instructions, it'll try up to C combinations on the result, but this can be counted against the instruction that was deleted by the combination and so effectively just doubles the constant. (Note that C depends on MAX_RECOG_OPERANDS and the new NUM_RANGE_USERS constant.) Unfortunately, debug updates via propagate_for_debug are more expensive. This could probably be fixed if the pass did more to track debug insns itself, but using propagate_for_debug matches combine's behaviour. The patch adds two instances of the new pass: one before combine and one after it. By default both are disabled, but this can be changed using the new 3-bit run-combine param, where: - bit 0 selects the new pre-combine pass - bit 1 selects the main combine pass - bit 2 selects the new post-combine pass The idea is that run-combine=3 can be used to see which combinations are missed by the new pass, while run-combine=6 (which I hope to be the production setting for AArch64 at -O2+) just uses the new pass to mop up cases that normal combine misses. Maybe in some distant future, the pass will be good enough for run-combine=[14] to be a realistic option. I ended up having to add yet another validate_simplify_* routine, this time to do the equivalent of: newx = simplify_replace_rtx (*loc, old_rtx, new_rtx); validate_change (insn, loc, newx, 1); but in a more memory-efficient way. validate_replace_rtx isn't suitable because it deliberately only tries simplifications in limited cases: /* Do changes needed to keep rtx consistent. Don't do any other simplifications, as it is not our job. */ And validate_simplify_insn isn't useful for this case because it works on patterns that have already had changes made to them and expects those patterns to be valid rtxes. simplify-replace operations instead need to simplify as they go, when the original modes are still to hand. As far as compile-time goes, I tried compiling optabs.ii at -O2 with an --enable-checking=release compiler: run-combine=2 (normal combine): 100.0% (baseline) run-combine=4 (new pass only) 98.0% run-combine=6 (both passes)