On Mon, May 15, 2017 at 9:27 AM, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, May 12, 2017 at 7:51 PM, Bernd Schmidt <bschm...@redhat.com> wrote: >> If you look at certain testcases like the one for PR78972, you'll find that >> the code generated by TER is maximally pessimal in terms of register >> pressure: we can generate a large number of intermediate results, and defer >> all the statements that use them up. >> >> Another observation one can make is that doing TER doesn't actually buy us >> anything for a large subset of the values it finds: only a handful of places >> in the expand phase actually make use of the information. In cases where we >> know we aren't going to be making use of it, we could move expressions >> freely without doing TER-style substitution. >> >> This patch uses the information collected by TER about the moveability of >> statements and performs a mini scheduling pass with the aim of reducing >> register pressure. The heuristic is fairly simple: something that consumes >> more values than it produces is preferred. This could be tuned further, but >> it already produces pretty good results: for the 78972 testcase, the stack >> size is reduced from 2448 bytes to 288, and for PR80283, the stackframe of >> 496 bytes vanishes with the pass enabled. >> >> In terms of benchmarks I've run SPEC a few times, and the last set of >> results showed not much of a change. Getting reproducible results has been >> tricky but all runs I made have been within 0%-1% improvement. >> >> In this patch, the changed behaviour is gated with a -fschedule-ter option >> which is off by default; with that default it bootstraps and tests without >> regressions. The compiler also bootstraps with the option enabled, in that >> case there are some optimization issues. I'll address some of them with two >> followup patches, the remaining failures are: >> * a handful of guality/PR43077.c failures >> Debug insn generation is somewhat changed, and the peephole2 pass >> drops one of them on the floor. >> * three target/i386/bmi-* tests fail. These expect the combiner to >> build certain instruction patterns, and that turns out to be a >> little fragile. It would be nice to be able to use match.pd to >> produce target-specific patterns during expand. >> >> Thoughts? Ok to apply? > > I appreciate that you experimented with partially disabling TER. Last year > I tried to work towards this in a more aggressive way: > > https://gcc.gnu.org/ml/gcc-patches/2016-06/msg02062.html > > that patch tried to preserve the scheduling effect of TER because there's > on my list of nice things to have a GIMPLE scheduling pass that should > try to reduce (SSA) register pressure and that can work with GIMPLE > data dependences. > > One of the goals of the patch above was to actually _see_ the scheduling > effects in the IL. > > So what I'd like to see is a simple single-BB scheduling pass right before > RTL expansion (so we get a dump file). That can use your logic (and I had a simple scheduler pass based on register pressure patches posted last week, but it's totally based on live range information. > "TERable" would be simply having single-uses). The advantage of doing > this before RTL expansion is that coalescing can benefit from the scheduling > as well. > > Then simply disable TER for the decide_schedule_stmt () defs during > RTL expansion. > > That means the effect of TER scheduling is not fully visible but we're > a step closer. It also means that some of the scheduling we did > in the simple scheduler persists anyway because coalescing / TER > wasn't going to undo it anyway. > > In the (very) distant future I'd like to perform (more) instruction selection > on GIMPLE so that all the benefits of TER are applied before RTL > expansion. > > + tree_code c = gimple_assign_rhs_code (use_stmt); > + if (TREE_CODE_CLASS (c) != tcc_comparison > + && c != FMA_EXPR > + && c != SSA_NAME > + && c != MEM_REF > + && c != TARGET_MEM_REF > + && def_c != VIEW_CONVERT_EXPR) > > I think on some archs it was important to handle combining > POINTER_PLUS_EXPR with NEGATE_EXPR of the offset. > > Anyway, the effects of TER and where it matters are hard to > see given its recursive nature (and the history of trying to > preserve expanding of "large" GENERIC trees ...). One would > think combine should be able to handle all those cases > (for example the FMA_EXPR one from above), but it clearly > isn't (esp. in the case of forwarding memory references). Another example on aarch64 is TER can generate conditional compare (ccmp) instructions, while combine can't if TER was disabled.
Thanks, bin > > Richard. > >> >> Bernd