On 11/6/2013, 4:17 AM, Richard Biener wrote:
On Tue, Nov 5, 2013 at 4:35 PM, Vladimir Makarov <vmaka...@redhat.com> wrote:
I'd like to add a new experimental optimization to the trunk. This
optimization was discussed on RA BOF of this summer GNU Cauldron.
It is a register pressure relief through live-range shrinkage. It
is implemented on the scheduler base and uses register-pressure insn
scheduling infrastructure. By rearranging insns we shorten pseudo
live-ranges and increase a chance to them be assigned to a hard
register.
The code looks pretty simple but there are a lot of works behind
this patch. I've tried about ten different versions of this code
(different heuristics for two currently existing register-pressure
algorithms).
I think it is *upto target maintainers* to decide to use or not to
use this optimization for their targets. I'd recommend to use this at
least for x86/x86-64. I think any OOO processor with small or
moderate register file which does not use the 1st insn scheduling
might benefit from this too.
On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
general tuning), the optimization usage results in smaller code size
in average (for floating point and integer benchmarks in 32- and
64-bit mode). The improvement better visible for SPECFP2000 (although
I have the same improvement on x86-64 SPECInt2000 but it might be
attributed mostly mcf benchmark unstability). It is about 0.5% for
32-bit and 64-bit mode. It is understandable, as the optimization has
more opportunities to improve the code on longer BBs. Different from
other heuristic optimizations, I don't see any significant worse
performance. It gives practically the same or better performance (a
few benchmarks imporoved by 1% or more upto 3%).
The single but significant drawback is additional compilation time
(4%-6%) as the 1st insn scheduling pass is quite expensive. So I'd
recommend target maintainers to switch it on only for -Ofast.
Generally I'd not recomment viewing -Ofast as -O4 but as -O3
plus generally "unsafe" optimizations. So I'd not enable it for -Ofast
but for -O3 - possibly also with -Os if indeed the main motivation is
also code-size improvements (-Os is a similar beast as -O3, spend
as much time as you can on optimizing size).
Ok. Probably my recommendation is wrong. It is actually upto target
maintainers to decide when to use the optimization and or use it at all
for default (may be they just decide to use it only for SPEC reporting).
I guess that in some time we will need to use something like -O4 for
greedy algorithms (there are a lot of researches in this area, e.g. I am
reading an article about optimal register-pressure sensitive insn
scheduling but the optimization can be constrained for time, for example
1ms for each insn, and still to produce better results than the current
heuristics). I am sure such algorithms will be coming.
Btw, thanks for working on this. How does it relate to
-fsched-pressure?
It is based on -fsched-pressure infrastructure but has different
heuristics and goals. GCC with 1st insn scheduling even with
-fsched-pressure still produces worse results on mainstream x86/x86-64
processors that GCC without it. I've also tried -flive-range-shrinkage
-fschedule-insns -fsched-pressure, but just -flive-range-shrinkage is
better for x86/x86-64.
By the way, LLVM uses insn-scheduling for x86/x86-64 before RA, but it
goal is only register-pressure decrease (for x86, for x86-64 it is a bit
more complicated). So with this optimization we are just catching up
with LLVM (which is unusual for us in optimization area).
Does it treat all register classes the same?
On x86 mostly the few fixed registers for some of the integer pipeline
instructions hurt, x86_64 has enough general and FP registers?
It treats them the same (although it is different for different classes
as they have different number of available regs). It is always some
kind of approximation as we use register pressure classes here not the
classes which will be actually used for RA. It is even more complicated
as IRA actually uses dynamic classes (only sets of regs which are
profitable, e.g. it can be different from classes defined in the target
file as reg in classes are caller-saved or some specific hard regs are
used for arg passing). It makes graph coloring better for irregular
register file architectures. In whole as I remember, dynamic classes
gave about 1% improvement even for ppc.
I should say that presence of hard regs in RTL (e.g. for parameter
passing) is still a challenge for live-range shrinkage and
register-pressure scheduling. It should be addressed somehow.