On Mon, May 15, 2017 at 9:27 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, May 12, 2017 at 7:51 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
>> If you look at certain testcases like the one for PR78972, you'll find that
>> the code generated by TER is maximally pessimal in terms of register
>> pressure: we can generate a large number of intermediate results, and defer
>> all the statements that use them up.
>>
>> Another observation one can make is that doing TER doesn't actually buy us
>> anything for a large subset of the values it finds: only a handful of places
>> in the expand phase actually make use of the information. In cases where we
>> know we aren't going to be making use of it, we could move expressions
>> freely without doing TER-style substitution.
>>
>> This patch uses the information collected by TER about the moveability of
>> statements and performs a mini scheduling pass with the aim of reducing
>> register pressure. The heuristic is fairly simple: something that consumes
>> more values than it produces is preferred. This could be tuned further, but
>> it already produces pretty good results: for the 78972 testcase, the stack
>> size is reduced from 2448 bytes to 288, and for PR80283, the stackframe of
>> 496 bytes vanishes with the pass enabled.
>>
>> In terms of benchmarks I've run SPEC a few times, and the last set of
>> results showed not much of a change. Getting reproducible results has been
>> tricky but all runs I made have been within 0%-1% improvement.
>>
>> In this patch, the changed behaviour is gated with a -fschedule-ter option
>> which is off by default; with that default it bootstraps and tests without
>> regressions. The compiler also bootstraps with the option enabled, in that
>> case there are some optimization issues. I'll address some of them with two
>> followup patches, the remaining failures are:
>>  * a handful of guality/PR43077.c failures
>>    Debug insn generation is somewhat changed, and the peephole2 pass
>>    drops one of them on the floor.
>>  * three target/i386/bmi-* tests fail. These expect the combiner to
>>    build certain instruction patterns, and that turns out to be a
>>    little fragile. It would be nice to be able to use match.pd to
>>    produce target-specific patterns during expand.
>>
>> Thoughts? Ok to apply?
>
> I appreciate that you experimented with partially disabling TER.  Last year
> I tried to work towards this in a more aggressive way:
>
> https://gcc.gnu.org/ml/gcc-patches/2016-06/msg02062.html
>
> that patch tried to preserve the scheduling effect of TER because there's
> on my list of nice things to have a GIMPLE scheduling pass that should
> try to reduce (SSA) register pressure and that can work with GIMPLE
> data dependences.
>
> One of the goals of the patch above was to actually _see_ the scheduling
> effects in the IL.
>
> So what I'd like to see is a simple single-BB scheduling pass right before
> RTL expansion (so we get a dump file).  That can use your logic (and
I had a simple scheduler pass based on register pressure patches
posted last week, but it's totally based on live range information.
> "TERable" would be simply having single-uses).  The advantage of doing
> this before RTL expansion is that coalescing can benefit from the scheduling
> as well.
>
> Then simply disable TER for the decide_schedule_stmt () defs during
> RTL expansion.
>
> That means the effect of TER scheduling is not fully visible but we're
> a step closer.  It also means that some of the scheduling we did
> in the simple scheduler persists anyway because coalescing / TER
> wasn't going to undo it anyway.
>
> In the (very) distant future I'd like to perform (more) instruction selection
> on GIMPLE so that all the benefits of TER are applied before RTL
> expansion.
>
> +      tree_code c = gimple_assign_rhs_code (use_stmt);
> +      if (TREE_CODE_CLASS (c) != tcc_comparison
> +         && c != FMA_EXPR
> +         && c != SSA_NAME
> +         && c != MEM_REF
> +         && c != TARGET_MEM_REF
> +         && def_c != VIEW_CONVERT_EXPR)
>
> I think on some archs it was important to handle combining
> POINTER_PLUS_EXPR with NEGATE_EXPR of the offset.
>
> Anyway, the effects of TER and where it matters are hard to
> see given its recursive nature (and the history of trying to
> preserve expanding of "large" GENERIC trees ...).  One would
> think combine should be able to handle all those cases
> (for example the FMA_EXPR one from above), but it clearly
> isn't (esp. in the case of forwarding memory references).
Another example on aarch64 is TER can generate conditional compare
(ccmp) instructions, while combine can't if TER was disabled.

Thanks,
bin
>
> Richard.
>
>>
>> Bernd

Reply via email to