On 04/20/2015 08:04 AM, Kyrill Tkachov wrote:
Hi all,
I'm trying to reduce the cases where the midend calls the backend rtx
costs on bogus rtl for which the backend
doesn't have patterns or ways of handling. Having to handle these kinds
of rtxes sanely bloats those
functions and makes them harder to maintain.
One of the cases where this occurs is in combine and
distribute_and_simplify_rtx in particular.
Citing the comment at that function:
" See if X is of the form (* (+ A B) C), and if so convert to
(+ (* A C) (* B C)) and try to simplify.
Most of the time, this results in no change. However, if some of
the operands are the same or inverses of each other, simplifications
will result."
The problem is that after it applies the distributive law it calls rtx
costs
to figure out whether the rtx became simpler. This rtx can get pretty
complex.
For example, on arm I've seen it try to cost:
(plus:SI (mult:SI (plus:SI (reg:SI 232 [ m1 ])
(const_int 1 [0x1]))
(reg:SI 232 [ m1 ]))
(plus:SI (reg:SI 232 [ m1 ])
(const_int 1 [0x1])))
which is never going to match anything on arm anyway, so why should the
costs function handle it?
In any case, I believe combine's design is such that it should first be
attempting to call
recog and split on the rtxes, and only if that succeeds should it be
making a target-specific
decision on which rtx to prefer. distribute_and_simplify_rtx goes
against that by calling
rtx costs on an unverified rtx in attempt to gauge its complexity.
This patch remedies that by removing the call to rtx costs and instead
manually performing
a relatively simple check on whether the resultant rtx was simplified.
That is, using the example
from the comment, whether (+ (* A C) (* B C)) still has + at the top and
* in the two operands.
This should give a good indication on whether any meaningful
simplification was made (The '+' and '*'
operators in the example can be any operators that can be distributed
over).
Initially, I wanted to just return the distributed version and let recog
reject the invalid rtxes
but that caused some code quality regressions on arm where the original
rtx would not recog but
would match a beneficial splitter, whereas the distributed rtx would not.
With this patch I saw almost no codegen differences on arm for the whole
of SPEC2006.
The one exception was 416.gamess where it managed to merge a mul and an
add into an mla
which resulted in a slightly better code sequence. That was in a pretty
large file and I
don't speak Fortran'ese, so I couldn't really reduce a testcase for it,
but my guess is that
before the patch the costs would return some essentially random value
for an arbitrarily complex rtx
that it was passed to, which changed the decision in
distribute_and_simplify_rtx on whether
to return the distributed rtx, which could have impacted further
optimisations in combine.
I tried it on x86_64 as well. Again, there were almost no codegen
differences. The exception
was tonto and wrf where a few instructions were eliminated, but no
significant difference.
The resultant binaries for these two were a tiny bit smaller, with no
impact on runtime.
Therefore I claim that this a safe thing to do, as it leaves the
target-specific rtx cost
judgements in combine to be made only on valid recog-ed rtxes, and not
having them cancel
optimisations early due to rtx costs not handling arbitrary rtxes well.
Bootstrapped on arm, x86_64, aarch64 (all linux). Tested on arm,aarch64.
Ok for trunk?
Thanks,
Kyrill
2015-04-20 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* combine.c (distribute_and_simplify_rtx): Do not check rtx costs.
Look at the rtx codes to see if a simplification occured.
OK.
Though I do wonder if, in practice, we can identify those cases that do
simplify more directly apriori and just punt everything else rather than
this rather convoluted approach.
jeff