Last fall I produced the RABLE document which described the approach I thought should be taken to write a new register allocator for GCC.
A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. This describes my current work-in-progress, RABLET, which stands for RABLE-Themes, and conveniently implies something smaller. Rather than write a new allocator, I think there are things we can do that are a lot less work which will reap many of the benefits. This works from the premise that we generate good code when we don't have to spill, so try to detect early that we are going to spill and do something about it at a point where we have good analysis. THEMES ------ 1 - One of the core themes in RABLE was very early selection of instructions from patterns. RTL patterns are initially chosen by the EXPAND pass. EXPAND tends to generates better rtl patterns by being handed complex trees which it can process and get better combinations. When TREE-SSA was first implemented, we got very poor RTL because expand was seeing very small trees. TER (Temporary Expression Replacement) was invented, which mashed any single-def/single-use ssa_names together into more complex trees. This gave expand a better chance of selecting better instructions, and made a huge difference. 2 - Rematerialization/register pressure reduction was another major component. The tree-ssa optimizers pay no attention to potential register pressure (which is as it should be). This sometimes results in very high register pressure entering the back end. RABLE defined passes performing expression rematerialization and other register pressure reducing techniques to reduce excessive register pressure before actually trying to do allocation. If you have 120 register live at some point, and only 16 physical registers, the allocator is clearly going to have a difficult time. Try to present it with something more manageable. RABLET CORE ----------- On to the meat of the subject. RABLET involves reworking the out-of-ssa pass, rewriting expand such that it is tightly integrated with the out-of-ssa pass, and adding some new work to reduce register pressure. Thats a lot less work than a whole new register allocator. If we look at the RABLE architecture, the passes before global allocation are as follows: Live-range-disjointer instruction-selection Register coalescing register pressure reduction <.. then regular register allocation activity ...> In the context of RABLET: -Out of SSA naturally performs live range disjointing. -Initial RTL pattern selection is performed by expand. -Register coalescing - ssa_name coalescing is done by out-of-ssa, and ssa_name == register. -Register pressure reduction – This is new work which would be implemented. If we create a new black box super pass called out-of-ssa-expand-register-pressure-reducer, (ewww, lets just refer to it as ssa-to-rtl :-), then we'd have something close to the first part of RABLE. Not exactly of course, because 'instruction selection' means something slightly different. In RABLE it means choosing an instruction alternative from an RTL pattern , in RABLET it simple means "choose good RTL patterns". “But wait...” I hear some clever person saying... “A lot of things happen between ssa-to-rtl and global register allocation.”. Hold that thought until I get through describing ssa-to-rtl since there are some important considerations which affect that discussion SSA-TO-RTL ---------- When out-of-ssa was originally written, tree-ssa was still evolving. We didnt know exactly what was going to be expected of it, so it was written to be flexible and had a lot of gorp added after the fact. TREE-SSA is now mature and we understand much more fully what is expected of translation out of ssa. I have begun rewriting it to be smaller and faster and to eliminate all the unused features which were initially provided. It is also being rewritten with RABLET in mind. The new generation out-of-ssa will eliminate PHIs, much as it does today. This is done by coalescing together ssa_names connected by copies (and PHIs are just copies), and issuing copies on edges required by the PHIs. Instead of then mapping these back to VAR_DECLs and writing this all back to the trees, it will simply be maintained in a partition list map. This is *key*. At this point, we have a mapping of what ssa_names have been coalesced together, and the copies required to perform this. The trees themselves are unaltered, which means we are STILL IN SSA FORM. OK, why is that so important. That means we still have all the SSA live range and data flow information which we can use to generate RTL insns. Today, expand produces better instructions selections by seeing larger hunks of trees at one. If expand is rewritten to be ssa-aware, then it can paw back through the IL to see what stmts and instructions feed this stmt. It can look at as large a hunk of the program as it sees fit. When it actually generates RTL, it will use the partition map to determine what ssa_names variables are actually substituted into the RTL stream. “Interesting, but why is that so grand?”. A couple of reasons. One is that RTL doesn't have to be generated linearly in complete absence of flow information. We can generate code for a stmt, attach it to the stmt, and at a later stmt, determine that it would be better to combine it some other way and remove/modify the RTL that was first generated. It can also look forward to see how this stmt is going to be used, and how often in making decisions. Another nice feature is that it still has flow/use info for a variable. So if 'A' requires a load from a base register, we can tell if that base register has already been loaded, and where. If register pressure is high, it can force a new load of the base register thus splitting the live range. Which then brings home the point that we can look at the rtl being generated, and overlay what we know about the SSA_NAMES over the register being generated. Its not truly RTL-SSA, but it gives us some information to work with as addressing modes are exposed. Furthermore, we can calculate more accurate register pressure at this point, which can help to determine the code we generate. For instance, local variables are all turned into registers today. If the register pressure is high, we could elect instead to load the local from its stack location instead of leaving it in a register all the time. This would prevent the register allocator fro having to perform this spill. Instead, we get a load from the home address. That's a flavor of what combining out-of-ssa and expand can do. I'm not terribly familiar with expand (yet), so there will be other things come up Im sure. As well as issues. This is just the initial plan. REGISTER PRESSURE REDUCTION --------------------------- On to the register pressure reduction component. As mentioned, some small level of register pressure reduction can be done during the rtl expansion phase. We can do some higher level work just before expansion to rtl. Once we have the ssa_name mappings, we can do rough register pressure calculations. Integer ssa_names are 1 int register, double ints are two, etc. This would not take into account any extra registers exposed by addressing modes and expansion, but it can give you a rough idea if pressure is excessive or not. 25 lives values and 6 hardware registers could use some register pressure reduction! So register pressure calculations atr implemented, and then the same techniques RABLE would have used, rematerialization, live range splitting, pre-spilling, etc., ON SSA FORM. This makes the implementation much easier since it becomes simply another ssa pass, with some minor caveats. There are some minor issues to resolve, but in general, any of these types of operations will result in new local temporary registers which don't coalesce with any other partition, and results in exactly what RABLE would have done. The bonus is that it is performed *before* RTL expansion , so we should get better instruction selections. And that forms the new ssa-to-rtl pass. A side effect of this is that I am suddenly very interested in the work done earlier (Honza was it?) to make opt0 go into ssa and do a few basic optimizations before generating rtl. If expand is going to be swallowed by ssa-to-rtl, we wont want to keep the old expand around for non-optimized code. RTL OPTIMIZATIONS ----------------- Back to that earlier observation, “But wait, A lot of things happen between ssa-to-rtl and global register allocation.” If we substitute ssa-to-rtl into the RABLE architecture flow, it looks something like: ssa-to-rtl spill cost analysis global allocation spiller spill location optimizer instruction rewriter. Since RABLET doesn't propose writing RABLE as such, we'll just have to substitute in the current register allocator passes for the RABLE ones. What about all the stuff in between? CSE, GCSE, scheduling, and all that? If expand is made much smarter, I would argue that much of GCSE and CSE isn't needed. We've already performed those optimizations at a high level, and we can hopefully do a lot of the factoring and things on addressing registers exposed during expand. I'm sure there are other things to do, but I would argue that they are significantly less than a "general purpose" CSE and GCSE pass. And in the cases of high register pressure, how much would you want them to do anyway? Its really these high register pressure areas that RABLET is attacking anyway. If I recall, scheduling is register pressure aware and normally doesn't increase register pressure dramatically. If it does increase pressure, well, this won't solve every problem after all. I don't see the need for anything between RTl generation and scheduling/register allocation that would undo all the work done in ssa-to-rtl. We may still have need for some RTL optimizations, sure, but their impact on register pressure is likely to be minimal, and should be so. Those that can increase pressure, like CSE and GCSE either arent needed, or can be handcuffed. (ie, don't work on registers which have been handled by ssa already, only look at registers that are not flagged as 'optimized' by the ssa-to-rtl phase, or dont run on areas where register pressure is above a threshold value, etc.). There are multiple options which can be considered. Based on that, I would argue that all those things between ssa-to-rtl and the register allocator are irrelevant as far as undoing the register pressure reductions done in RABLET. This makes the point of view on RTL optimizations to be more like “do good things only to registers which earlier optimizations couldn't see”, noting that ssa-to-rtl does more optimizations to rtl registers than expand did by itself. SUMMARY ------- Thats the project in a nutshell. There are quite a few more details of course, but that's the gist. Its not a perfect implementation of RABLE of course, but it takes the major themes, and with a small shoe-horn, you can fit the description of RABLET into the architectural diagram of RABLE. The hope is that although we certainly wont see 100% of the benefit of RABLE, we will see enough benefit to justify this work and perhaps make RABLE unnecessary. If work on RABLE begins (or some other new allocator arrives...), it will still be presented with something better to work with than we have today, making its job easier. Well, at least making the job of allocating registers easier, although it might be a more difficult job to get performance gains :-P. Unlike RABLE, I am currently working on RABLET, with the out-of-ssa rewrite underway. The new out-of-ssa will be ready for GCC 4.3. It should be faster and not suffer from slowdowns in some of the pathological cases we've found with the current implementation. Once that is complete, the next step is to do some of the higher level register pressure reduction work in SSA. Combined with TER as it exists today (with perhaps a few modifications), I would expect to see some benefit in the generated code simply due to RA not having as much pressure to deal with. There is some possibility this will make it into 4.3 as well, but that's not a sure thing. There is also the impact MEM-SSA will have on this pass, which is not clear at this point. Then its time for the integration of expand. I really don't know how long that would take, but I am pretty sure that's not in the 4.3 time frame! :-) When tree-ssa was first implemented, a rewrite of expand was on the table as eventually being needed (as well as a new allocator). TER was stopgap solution until it could be tackled, and I doubt anyone will be sorry to see it go! :-) I'd be more than happy to respond to any useful feedback anyone might have. As I said, I didn't delve deeply into some of the details, and I can think of half a dozen obvious questions/concerns people might have. With a much shorter time frame, I think there will be enough benefit from this project to reduce the urgency of a new register allocator, at least from a performance point of view. As always with compilers, you can never be 100% sure :-) I am not actually attending the conference in Ottawa next week, but I will be in Ottawa late Tuesday afternoon through Wednesday night if anyone wants to track me down, (or arrange to meet up with me), to discuss any of this. (Or even RABLE for that matter). Andrew