On 02/14/2013 03:46 AM, Michael Veksler wrote:
On 02/14/2013 03:28 AM, Vladimir Makarov wrote:
On 13-02-13 6:36 PM, Michael Eager wrote:
[snip]
I thought about register pressure causing this, but I think that
should cause
spilling of one of the registers which were not used in this long
sequence,
rather than causing a large number of additional loads.
Longer living pseudos has more probability to be spilled as they
usually conflicts with more pseudos during their lives and spilling
them frees a hard reg for many conflicting pseudos. That how RA
heuristics work (in the old RA log (live range span) was used. The
bigger the value, the more probability for spilling).
Perhaps the cost analysis has a problem.
I've checked it and it looks ok to me.
RA focused on generation of faster code. Looking at the fragment
you provided it, it is hard to say
something about it. I tried -Os for gcc4.8 and it generates
desirable code for the fragment in
question (by the way the peak register pressure decreased to 66 in
this case).
It's both larger and slower, since the additional loads take much
longer. I'll take a
look at -Os.
It looks like the values of p++ are being pre-calculated and stored
on the stack. This results in
a load, rather than an increment of a register.
If it is so. It might be another optimization which moves p++
calculations higher. IRA does not do it (more correctly a new IRA
feature implemented by Bernd Schmidt in gcc4.7 can move insns
downwards in CFG graph to decrease reg pressure).
I checked all rtl passes these calcualtions are not created by any
RTL pass. So it is probably some tree-ssa optimization.
Any industrial RA uses heuristic algorithms, in some cases better
heuristics can work worse than
worse heuristics. So you should probably check is there any
progress moving from gcc4.1 to gcc4.6
with performance point of view for variety benchmarks. Introducing
IRA improves code for x86 4% on
SPEC2000. Subsequent improving (like using dynamic register
classes) made further performance
improvements.
My impression is that the performance is worse. Reports I've seen
are that the code is
substantially larger, which means more instructions.
I'm skeptical about comparisons between x86 and RISC processors.
What works well
for one may not work well for the other.
IRA improved code for many RISC processors. Although tetter RA has
smaller effect for these processors as they have more registers.
Looking at the test code, I can make some conclusions for myself:
o We need a common pass decreasing reg pressure (I already
expressed this in the past) as
optimizations become more aggressive. Some progress was made to
make few optimizations aware about
RA (reg-pressure scheduling, loop-invariant motions, and code
hoisting) but there are too many
passes and it is wrong and impossible to make them all aware of
RA. Some register pressure
decreasing heuristics are difficult to implement in RA (like insn
rearrangements or complex
rematerialization) and this pass could focus on them.
That might be useful.
o Implement RA live range splitting in regions different from loops
or BB (now IRA makes splitting
only on loop bounds and LRA in BB, the old RA had no live range
splitting at all).
Each of the blocks of code is in it's own BB. I haven't checked,
but I'd guess
that most of the registers are in use on entry and still live on
exit, so the
block has no registers to allocate.
Splitting in BB scope this case is not profitable.
I'd also recommend to try the following options concerning RA:
-fira-loop-pressure,
-fsched-pressure, -fira-algorithm=CB|priority,
-fira-region=one,all,mixed. Actually
-fira-algorithm=priority + -fira-region=one is analog of what the
old RA did.
I am reading this thread and getting more and more puzzled. The RA
stuff is very complicated,
having many constraints and many dependencies with other passes.
Taking this into
account, it seems that no heuristic algorithm can even get close to an
optimal register
allocation. A heuristic algorithm can't take all effects and
side-effects into account
simultaneously.
Considering all that, why GCC does not use generic optimization
algorithms for RA?
A generic optimization can take all issues into account,
simultaneously. I am talking
about ILP/MIP (Integer Linear Programming/Mixed IP), SAT and CSP [1]
(Constraint
Satisfaction Problem). There has been a lot of progress in these
areas, and the
solvers are much faster (orders of magnitude) than they were 10 years
ago.
Actually, I tried ILP solver first time in 2003. I am returning to this
about every 3 years and every time I fail to produce something useful
(at least an acceptable in some cases solution which makes compiler only
ten times slower). I've tried a lot of models. From more sophisticated
ones to simpler ones, pure solvers or some hybrids of heuristics and ILP
solutions. All solutions were too slow or generated not better code.
Last time, I had hope to use massive parallelism of GPUs for this. After
investing of much my time to learning simplex algorithm theory (a base
for ILP solvers), I found that simple tableu-based algorithm is
parallelized very well but for RA tasks even using 1000 GPU elements is
still behind revised simplex algorithm which is not parallelized. GPUs
are not good for very sparse matrix problems which is the case of RA
problems.
Still some important RA problems (like rematerialization) is very hard
to describe by ILP.
I don't see any progress, ILP solver may be ten times faster but they
still have exponential complexity. I don't know any industrial compiler
which uses exact solver for RA (RA in GCC is even more complicated as it
requires code selection too).
So why ILP/SAT/CSP aren't used in RA? I don't think they will work
much slower than
what RA does today ( they will be somewhat slower but not much). I do
believe that
there is a good chance to get much better results from RA with these
technologies.
You know, some LLVM guys had the same thoughts and as remember they had
PBQP (I tried this too) based RA (they had too many different RAs) and
now they have finished with just a good heursitics RA which works best.
I'd like to invest some time into a feasibility check, if someone is
willing to work
with me on modeling the RA problem (or at least several examples, such
as the above).
Good luck with that. I am serious.