Last fall I produced the RABLE document which described the approach I
thought should be taken to write a new register allocator for GCC.

A new register allocator written from scratch is a very long term
project (measured in years), and there is no guarantee after all that
work that we'd end up with something which is remarkably better. One
would hope that it is a lot more maintainable, but the generated code is
a crapshot. It will surely look better but will it really run faster?
The current plate of spaghetti we call the register allocator has had a
lot of fine tuning go into it over the years, and it generally generates
pretty darn good code IF it doesn't have to spill much, which is much of
the time.

This describes my current work-in-progress, RABLET, which stands for
RABLE-Themes, and conveniently implies something smaller. Rather than
write a new allocator, I think there are things we can do that are a lot
less work which will reap many of the benefits. This works from the
premise that we generate good code when we don't have to spill, so try
to detect early that we are going to spill and do something about it at
a point where we have good analysis.

THEMES
------

1 - One of the core themes in RABLE was very early selection of
instructions from patterns.  RTL patterns are initially chosen by the
EXPAND pass. EXPAND tends to generates better rtl patterns by being
handed complex trees which it can process and get better combinations.

  When TREE-SSA was first implemented, we got very poor RTL because
expand was seeing very small trees.  TER (Temporary Expression
Replacement) was invented, which mashed any single-def/single-use
ssa_names together into more complex trees. This gave expand a better
chance of selecting better instructions, and made a huge difference.

2 - Rematerialization/register pressure reduction was another major
component.  The tree-ssa optimizers pay no attention to potential
register pressure (which is as it should be). This sometimes results in
very high register pressure entering the back end.  RABLE defined passes
performing expression rematerialization and other register pressure
reducing techniques to reduce excessive register pressure before
actually trying to do allocation.  If you have 120 register live at some
point, and only 16 physical registers, the allocator is clearly going to
have a difficult time. Try to present it with something more manageable.

RABLET CORE
-----------

On to the meat of the subject. RABLET involves reworking the out-of-ssa
pass, rewriting expand such that it is tightly integrated with the
out-of-ssa pass, and adding some new work to reduce register pressure.
Thats a lot less work than a whole new register allocator.

If we look at the RABLE architecture, the passes before global
allocation are as follows:

Live-range-disjointer
instruction-selection
Register coalescing
register pressure reduction
<.. then regular register allocation activity ...>

In the context of RABLET:

-Out of SSA naturally performs live range disjointing.
-Initial RTL pattern selection is performed by expand.
-Register coalescing -  ssa_name coalescing is done by out-of-ssa, and
ssa_name == register.
-Register pressure reduction – This is new work which would be
implemented.

If we create a new black box super pass called
out-of-ssa-expand-register-pressure-reducer,  (ewww,  lets just refer to
it as ssa-to-rtl :-),  then we'd have something close to the first part
of RABLE. Not exactly of course, because 'instruction selection' means
something slightly different.  In RABLE it means choosing an instruction
alternative from an RTL pattern , in RABLET it simple means "choose good
RTL patterns".

“But wait...” I hear some clever person saying... “A lot of things
happen between ssa-to-rtl and global register allocation.”.  Hold that
thought until I get through describing ssa-to-rtl since there are some
important considerations which affect that discussion

SSA-TO-RTL
----------

When out-of-ssa was originally written, tree-ssa was still evolving. We
didnt know exactly what was going to be expected of it, so it was
written to be flexible and had a lot of gorp added after the fact.
TREE-SSA is now mature and we understand much more fully what is
expected of translation out of ssa. I have begun rewriting it to be
smaller and faster and to eliminate all the unused features which were
initially provided.  It is also being rewritten with RABLET in mind.  

The new generation out-of-ssa will eliminate PHIs, much as it does
today. This is done by coalescing together ssa_names connected by copies
(and PHIs are just copies), and issuing copies on edges required by the
PHIs.  Instead of  then mapping these back to VAR_DECLs and writing this
all back to the trees, it will simply be maintained in a partition list
map.  This is *key*. At this point, we have a mapping of what ssa_names
have been coalesced together, and the copies required to perform this.
The trees themselves are unaltered, which means we are STILL IN SSA
FORM.

OK, why is that so important. That means we still have all the SSA live
range and data flow information which we can use to generate RTL insns.
Today,  expand produces better instructions selections by seeing larger
hunks of trees at one. If expand is rewritten to be ssa-aware, then it
can paw back through the IL to see what stmts and instructions feed this
stmt. It can look at as large a hunk  of the program as it sees fit.
When it actually generates RTL, it will use the partition map to
determine what ssa_names variables are actually substituted into the RTL
stream.  

“Interesting, but why is that so grand?”.  A couple of reasons.  One is
that RTL doesn't have to be generated linearly in complete absence of
flow information.  We can generate code for a stmt, attach it to the
stmt, and at a later stmt, determine that it would be better to combine
it some other way and remove/modify the RTL that was first generated. It
can also look forward to see how this stmt is going to be used, and how
often in making decisions.  Another nice feature is that it still has
flow/use info for a variable. So if 'A' requires a load from a base
register, we can tell if that base register has already been loaded, and
where.  If register pressure is high, it can force a new load of the
base register thus splitting the live range.  Which then brings home the
point that we can look at the rtl being generated, and overlay what we
know about the SSA_NAMES over the register being generated. Its not
truly RTL-SSA, but it gives us some information to work with as
addressing modes are exposed.

Furthermore, we can calculate more accurate register pressure at this
point, which can help to determine the code we generate. For instance,
local variables are all turned into registers today. If the register
pressure is high, we could elect instead to load the local from its
stack location instead of leaving it in a register all the time. This
would prevent the register allocator fro having to perform this spill.
Instead, we get a load from the home address.

That's a flavor of what combining out-of-ssa and expand can do. I'm not
terribly familiar with expand (yet), so there will be other things come
up Im sure.  As well as issues. This is just the initial plan.

REGISTER PRESSURE REDUCTION
---------------------------

On to the register pressure reduction component. As mentioned, some
small level of register pressure reduction can be done during the rtl
expansion phase.  We can do some higher level work just before expansion
to rtl. Once we have the ssa_name mappings, we can do rough register
pressure calculations. Integer ssa_names are 1 int register, double ints
are two, etc. This would not take into account any extra registers
exposed by addressing modes and expansion, but it can give you a rough
idea if pressure is excessive or not. 25 lives values and 6 hardware
registers could use some register pressure reduction!

So register pressure calculations atr implemented, and then the same
techniques RABLE would have used, rematerialization, live range
splitting, pre-spilling, etc., ON SSA FORM. This makes the
implementation much easier since it becomes simply another ssa pass,
with some minor caveats.  There are some minor issues to resolve, but in
general, any of these types of operations will result in new local
temporary registers which don't coalesce with any other partition, and
results in exactly what RABLE would have done. The bonus is that it is
performed *before* RTL expansion , so we should get better instruction
selections.


And that forms the new ssa-to-rtl pass.  A side effect of this is that I
am suddenly very interested in the work done earlier (Honza was it?) to
make opt0 go into ssa and do a few basic optimizations before generating
rtl.  If expand is going to be swallowed by ssa-to-rtl, we wont want to
keep the old expand around for non-optimized code.
 
RTL OPTIMIZATIONS
-----------------

Back to that earlier observation, “But wait, A lot of things happen
between ssa-to-rtl and global register allocation.”

If we substitute ssa-to-rtl into the RABLE architecture flow, it looks
something like:

ssa-to-rtl
spill cost analysis
global allocation
spiller
spill location optimizer
instruction rewriter.

Since RABLET doesn't propose writing RABLE as such, we'll just have to
substitute in the current register allocator passes for the RABLE ones.

What about all the stuff in between? CSE, GCSE, scheduling, and all
that?

If expand is made much smarter, I would argue that much of GCSE and CSE
isn't needed.  We've already performed those optimizations at  a high
level, and we can hopefully do a lot of the factoring and things on
addressing registers exposed during expand.  I'm sure there are other
things to do, but I would argue that they are significantly less than a
"general purpose" CSE and GCSE pass. And in the cases of high register
pressure, how much would you want them to do anyway?  Its really these
high register pressure areas that RABLET is attacking anyway.

If I recall, scheduling is register pressure aware and normally doesn't
increase register pressure dramatically. If it does increase pressure,
well, this won't solve every problem after all.

I don't see the need for anything between RTl generation and
scheduling/register allocation that would undo all the work done in
ssa-to-rtl.  We may still have need for some RTL optimizations, sure,
but their impact on register pressure is likely to be minimal, and
should be so. Those that can increase pressure, like CSE and GCSE either
arent needed, or can be handcuffed. (ie, don't work on registers which
have been handled by ssa already, only look at registers that are not
flagged as   'optimized' by the ssa-to-rtl phase, or dont run on areas
where register pressure is above a threshold value, etc.).  There are
multiple options which can be considered.

Based on that, I would argue that all those things between ssa-to-rtl
and the register allocator are irrelevant as far as undoing the register
pressure reductions done in RABLET.  This makes the point of view on RTL
optimizations to be more like “do good things only to registers which
earlier optimizations couldn't see”, noting that ssa-to-rtl does more
optimizations to rtl registers than expand did by itself.

SUMMARY
-------

Thats the project in a nutshell.  There are quite a few more details of
course, but that's the gist. Its not a perfect implementation of RABLE
of course, but it takes the major themes, and with a small shoe-horn,
you can fit the description of RABLET into the architectural diagram of
RABLE. The hope is that although we certainly wont see 100% of the
benefit of RABLE, we will see enough benefit to justify this work and
perhaps make RABLE unnecessary.  If work on RABLE begins (or some other
new allocator arrives...), it will still be presented with something
better to work with than we have today, making its job easier.  Well, at
least making the job of allocating registers easier, although it might
be a more difficult job to get performance gains :-P. 

Unlike RABLE, I am currently working on RABLET, with the out-of-ssa
rewrite underway.  The new out-of-ssa will be ready for GCC 4.3. It
should be faster and not suffer from slowdowns in some of the
pathological cases we've found with the current implementation. 

Once that is complete, the next step is to do some of the higher level
register pressure reduction work in SSA.  Combined with TER as it exists
today (with perhaps a few modifications), I would expect to see some
benefit in the generated code simply due to RA not having as much
pressure to deal with. There is some possibility this will make it into
4.3 as well, but that's not a sure thing. There is also the impact
MEM-SSA will have on this pass, which is not clear at this point.

Then its time for the integration of expand.  I really don't know how
long that would take, but I am pretty sure that's not in the 4.3 time
frame! :-)  When tree-ssa was first implemented, a rewrite of expand was
on the table as eventually being needed (as well as a new allocator).
TER was  stopgap solution until it could be tackled, and I doubt anyone
will be sorry to see it go! :-)

I'd be more than happy to respond to any useful feedback anyone might
have. As I said, I didn't delve deeply into some of the details,  and I
can think of half a dozen obvious questions/concerns people might have.
With a much shorter time frame, I think there will be enough benefit
from this project to reduce the urgency of a new register allocator, at
least from a performance point of view.  As always with compilers, you
can never be 100% sure :-)

I am not actually attending the conference in Ottawa next week, but I
will be in Ottawa late Tuesday afternoon through Wednesday night if
anyone wants to track me down, (or arrange to meet up with me), to
discuss any of this. (Or even RABLE for that matter).

Andrew


Reply via email to