Re: Project RABLET
On Fri, Jun 23, 2006 at 03:23:04PM -0400, Andrew MacLeod wrote: A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. One area where the current register allocator is really lousy is when faced with pseudos spanning multiple registers. A recent example: URL:http://gcc.gnu.org/ml/gcc-help/2006-04/msg00064.html. And 16-bit targets suffer from this much of the time. I can only imagine that the AVR camp must be tearing their hair out in frustration. The register allocater really needs to be able to allocate subregs independently. -- Rask Ingemann Lambertsen
Re: Project RABLET
On Sun, 2006-06-25 at 01:04 -0400, Vladimir N. Makarov wrote: Andrew MacLeod wrote: Having no information about the final register allocator decision, the partial register pressure reducing through rematerialization is not working in many cases. For example, making rematerialization of a - b + c when you reduce the pressure from 100 to 50 for x86 there is a big chance that b and c will be not placed in hard registers. Instead of one load (of a), two loads (b and c) will be needed. This result code is even worse than before reducing pressure. Having implemented a complex rematerialization pass before, I understand it well enough to know that replacing 'a' with 'b + c' is the wrong thing to do, unless at least one of b or c is used again right next to the use of 'a'. That can actually increase register pressure, or have nothing more than a neutral effect. Its *way* more complicated than simply moving expressions downward. Its tracking all the things used in expressions and determining that at a given location, it *is* worthwhile to do it because the correct values are already trivially available, or can alternatively be recomputed in a worthwhile way. Often, there are only a few worthwhile remats out of all the possible ones. In general, I have never found the primary benefit of rematerialization to be register pressure reduction, but rather one of avoiding placing stores to spill in the instruction stream. When there is a lot of spilling, those stores can really bog down a pipeline. So rematerialization in out-of-ssa pass will work well only for full pressure relief (to the level equal to the number of hard registers) or close to the full relief. That's a pretty broad assertion to make, and I disagree with it :-) Especially when you only talk about a single component of the whole. RABLET is a group of things which tend to enable each other. Any one of them by themselves would actually accomplish less. Some of the required components have a benefit of some sort unrelated to the others (such as faster out of ssa translation). Others require interaction with the whole to see any benefit. Remember that one of the components is a new integrated expand... RABLET will have an understanding of the instructions that can be generated, and will try to make use of those when making decisions. An additional benefit likely to be seen from this is that code tends to utilize more of the variable names the user will recognize, rather than something like 'SPILL_678'. And maybe, just maybe, it'll be easier to get the debug info correct (TER can muck it up pretty badly :-) The SSA pressure relief through rematerialization described in Simpson's theses is oriented for such architectures (with a big regular register file size of 32 as I remember). So it can work for ppc but it will be less successful for major interest platforms x86 and x86_64. I haven't read the thesis, but I would be surprised if it describes what I am planning to do. Without the integration with expand to do instruction selection, a few tweaks in some RTL optimizations, and a very specific gcc-oriented union of all these components, what I am planning to do would be completely worthless. Remat is really a small part of it. What benefit I will see? Well, time will tell :-) Andrew
Re: Project RABLET
On 6/26/06, Andrew MacLeod [EMAIL PROTECTED] wrote: On Sun, 2006-06-25 at 01:04 -0400, Vladimir N. Makarov wrote: Andrew MacLeod wrote: Having no information about the final register allocator decision, the partial register pressure reducing through rematerialization is not working in many cases. For example, making rematerialization of a - b + c when you reduce the pressure from 100 to 50 for x86 there is a big chance that b and c will be not placed in hard registers. Instead of one load (of a), two loads (b and c) will be needed. This result code is even worse than before reducing pressure. Having implemented a complex rematerialization pass before, I understand it well enough to know that replacing 'a' with 'b + c' is the wrong thing to do, unless at least one of b or c is used again right next to the use of 'a'. That can actually increase register pressure, or have nothing more than a neutral effect. Its *way* more complicated than simply moving expressions downward. Its tracking all the things used in expressions and determining that at a given location, it *is* worthwhile to do it because the correct values are already trivially available, or can alternatively be recomputed in a worthwhile way. Often, there are only a few worthwhile remats out of all the possible ones. In general, I have never found the primary benefit of rematerialization to be register pressure reduction, but rather one of avoiding placing stores to spill in the instruction stream. When there is a lot of spilling, those stores can really bog down a pipeline. So rematerialization in out-of-ssa pass will work well only for full pressure relief (to the level equal to the number of hard registers) or close to the full relief. That's a pretty broad assertion to make, and I disagree with it :-) Especially when you only talk about a single component of the whole. RABLET is a group of things which tend to enable each other. Any one of them by themselves would actually accomplish less. Some of the required components have a benefit of some sort unrelated to the others (such as faster out of ssa translation). Others require interaction with the whole to see any benefit. Remember that one of the components is a new integrated expand... RABLET will have an understanding of the instructions that can be generated, and will try to make use of those when making decisions. I think the overall idea of RABLET is quite sound. Especially if integrating expand with out-of-ssa this might provide early instruction selection (now, that would possibly require moving all of RTL loop optimizations before this point). Of course re-doing expand is the hard part here... Richard.
Re: Project RABLET
Andrew MacLeod wrote: On Sun, 2006-06-25 at 01:04 -0400, Vladimir N. Makarov wrote: Andrew MacLeod wrote: The SSA pressure relief through rematerialization described in Simpson's theses is oriented for such architectures (with a big regular register file size of 32 as I remember). So it can work for ppc but it will be less successful for major interest platforms x86 and x86_64. I haven't read the thesis, but I would be surprised if it describes what I am planning to do. Without the integration with expand to do instruction selection, a few tweaks in some RTL optimizations, and a very specific gcc-oriented union of all these components, what I am planning to do would be completely worthless. Remat is really a small part of it. The thesis contains a small part about different heuristics of reducing register pressure on SSA through rematerializaion (what and where to rematerialize). I see your plan is much bigger, not only reducing through rematerialization but other things including better RTL expansion and, more imprortant, doing it in an integrated way. In any case, it looks as an interesting project. Good luck with that. What benefit I will see? Well, time will tell :-)
Re: Project RABLET
On 6/24/06, Andrew MacLeod [EMAIL PROTECTED] wrote: On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. Which brings up a good discussion. I presume the rtl loop optimizers see things exposed by addressing modes which aren't seen in the higher level code. I wonder what the big gains are here... Knowning which address computations are loop invariant. Knowing the number of instructions (sadly not the exact size because instructions haven't been selected) to determine whether it is worth unrolling/peeling/unswitching a loop. Finding loops that can use a doloop pattern. and if they are detectable at expansion time... For most of them, I don't think so. In general, I didnt mention anything that tends not to increase register pressure, at least not in any significant manner as far as RABLET is concerned. So do you have hard data showing that CSE increases register pressure? Given the thinks CSE does, it would probably be much more useful, then, to make it possible to have liveness information in CSE so that it can take register pressure into account in its cost considerations ;-) No magic new expand is going to make CSE obsolete, and it simply does too much to just throw it out. (FWIW I'm still working on simplifying cse.c...) Clearly there will be a lot of further investigation required once implementation reaches this point. Ultimately CSE and all RTL optimizations can be re-evaluated to see if things can be simplified. *laughs* Every time some RTL optimizer is re-re-re-re-re-evaluated, it turns out we lose without it. Good luck to you, but I think you're seriously underestimating the complexity of things here. Modulo the above comments, I don't see anything wrong with your basic idea. But I also wonder whether you couldn't get a similar effect by forcing instruction selection to occur before register allocation. If that is done well, reload will have much less work to do. Hurray. This is what new-ra did. It was probably the only thing there that worked well, but it was a great idea. (Sadly it was just reload rewritten so pre-reload.c was ugly, but the idea was good). Its clearly not as good as a new register allocator would be, but the effort to benefit ratio ought to be a lot higher for RABLET than for a register allocator rewrite. There is a register allocator rewrite under way, from one of your co-workers even. Is there any relation between Vlad's project and yours, or are you going different ways with the same goal in mind? :-D Gr. Steven
Re: Project RABLET
Steven Bosscher wrote: Every time some RTL optimizer is re-re-re-re-re-evaluated, it turns out we lose without it. Good luck to you, but I think you're seriously underestimating the complexity of things here. Its clearly not as good as a new register allocator would be, but the effort to benefit ratio ought to be a lot higher for RABLET than for a register allocator rewrite. There is a register allocator rewrite under way, from one of your co-workers even. Is there any relation between Vlad's project and yours, or are you going different ways with the same goal in mind? :-D Working on register allocation issues last three years and looking at the new-ra project I can say that any project in this area has a big chance to fail despite how good design looked at the first glance. The problem is in complexity of RTL and lot of ports with very specific issues which are described by gazillion of macros. Redesign and simplification of RTL could solve a lot of problems like code selection, register allocation etc (although might create others). But this task is much bigger than introducing tree-SSA because it means rewriting all machine description files and practically equal to redoing all ports. Do we have resources for this? I don't think so. So saying this, my point of view that the more projects we have in this area, the better chance we will have to solve the problem. Therefore I really appreciate what Andrew and Bernd Schmidt do. It might look as a waste of resources but we can not people force not to do what they believe and want to do (e.g. we can not force Bernd not to improve reload because he can work on a new register allocator. He improves reload because he knows it best than others). As for Andrew's proposal, my opinion is that all this transformations are done too early and we need them to do again on rtl sometime. o coalescing. CSE can create more moves but more important thing is the extended coalescing can not be done here (or I don't know how it can be done here). It is about moves generated because of two-address architecture constraints (regmove and global tries to solve this problem in ad hoc way e.g. through hard register preference by global). It should be part of coalescing pass, because removing a move can prevent removing a higher priority move generated by reload because of the two address constraints. o register pressure relief through live range splitting and/or rematerialization. We have no accurate information here, because after that there are passes which change the pressure like insn scheduling and CSE. Although insn scheduling has heuristic not to increase register pressure, it has very small priority (third or fourth). Therefore insn scheduling can increase the pressure a lot (but sometimes decrease it too). Insn scheduler with register renaming being implemented by ISP RAS might solve this problem, if it works only after the register allocator. But this insn scheduler can work before the register allocator too and only its usage will show will it work only after the register allocator or in traditional way (before and the after the register allocator). Even without changing the register pressure by subsequent passes, there is another problem which is difficulty to calculate the register pressure excess. We don't know what register class will be used for a pseudo-register (e.g. AREG or GENERAL_REGS for x86 which creates difference 6 in the register pressure). Although reducing register pressure from 100 to 6 will be very helpful, my experience shows that the most frequent and interesting cases are on the border. o register renaming is already done and effectively (because it uses the data flow analysis framework) by -fweb. But I think it can be done more effectively by out-of-ssa pass. Actually what Andrew proposes (and more) I did two years ago on RTL level close to the register allocator (see gcc summit article fighting register pressure in gcc). The result was not satisfactory for me and I moved on rewriting the register allocator. Probably, I should have committed more what I've done into the mainline. Probably what Andrew proposes can be done faster on tree-SSA although doing it on RTL we would have more accurate information. In any case it will improve code in some cases and can be used as a temporary solution (until new register allocator projects will be done or forever if they failed). Andrew's proposal has a sense too with the code reuse point of view if he wants to move on with RABLE project. As for my project YARA, I don't know when it will be ready for the mainline (at least one more year) because it includes removing reload (the biggest and most complicated part of the RA). It works now only for x86 and x86_64, generates better code for SPECint2000 and SPECFp2000 (at least for pentium4, nocona and coming woodcrest. I have no free AMD machine to make benchmarking). I've just started
Re: Project RABLET
On Sat, 2006-06-24 at 13:04 +0200, Steven Bosscher wrote: On 6/24/06, Andrew MacLeod [EMAIL PROTECTED] wrote: On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: = In general, I didnt mention anything that tends not to increase register pressure, at least not in any significant manner as far as RABLET is concerned. So do you have hard data showing that CSE increases register pressure? Given the thinks CSE does, it would probably be much more useful, then, to make it possible to have liveness information in CSE so that it can take register pressure into account in its cost considerations ;-) No magic new expand is going to make CSE obsolete, and it simply does too much to just throw it out. (FWIW I'm still working on simplifying cse.c...) no, no hard data, its just the kind of activity which can undo what RABLET proposes doing. Ie, an expression which I go to the effort to rematerializing in 3 places is likely to be commoned back to the original live range without stepping in and telling CSE not to do that. What I really had in mind was marking these register values/expression/whatever with a flag such that when CSE or GCSE or whoever asks, they get returned as not-to-be-looked-at. That way they don't undo the opportunities which RABLET has so kindly exposed to these optimizations :-) Clearly there will be a lot of further investigation required once implementation reaches this point. Ultimately CSE and all RTL optimizations can be re-evaluated to see if things can be simplified. *laughs* Every time some RTL optimizer is re-re-re-re-re-evaluated, it turns out we lose without it. Good luck to you, but I think you're seriously underestimating the complexity of things here. I was not really looking to rewrite the passes so much as tell them not to work on certain registers. This could potentially be extended to ssa_names which the tree optimizers have processed and which get expanded into single RTL registers. A new expand would know that the translation into RTL was simple enough that nothing new has really been exposed to the RTL optimizers. That was my thought anyway. Then CSE et al work on whatever is left. Perhaps there are then some hunks that can be remoived as redundant, or maybe not. Its clearly not as good as a new register allocator would be, but the effort to benefit ratio ought to be a lot higher for RABLET than for a register allocator rewrite. There is a register allocator rewrite under way, from one of your co-workers even. Is there any relation between Vlad's project and yours, or are you going different ways with the same goal in mind? :-D Totally different scale of project with completely different goal. Had I set out and actually started writing RABLE, that would be going in different directions with the same goal. In theory, RABLET should make the job of any register allocator a bit easier. These days, I think any register allocator's goal ought to be to assign registers and be the final authority. No reload undoing any of the work. That is well beyond the scope of what Im doing. It is within the scope of what Vlad is doing. Andrew
Re: Project RABLET
On Sat, 2006-06-24 at 12:36 -0400, Vladimir N. Makarov wrote: Steven Bosscher wrote: As for Andrew's proposal, my opinion is that all this transformations are done too early and we need them to do again on rtl sometime. o coalescing. CSE can create more moves but more important thing is RABLET will do nothing different than is done today.. out of ssa coalesces ssa_names out the wazoo. In the interest of register pressure reduction, it may actually coalesce less to split up live ranges, and leave loads/stores from/to the stack. o register pressure relief through live range splitting and/or rematerialization. We have no accurate information here, because after that there are passes which change the pressure like insn Sure, Im not suggesting that RABLET will reduce the register pressure to something that isn't going to spill. Far from it. I am saying that RABLET can reduce something completely unmanageable to something more manageable. instead of handing the RTL passes a basic block that contains a peak register pressure of 120 when there are 16 hardware registers, perhaps it will be a basic block that has been reduced down to a peak of 25 or something. The calculations at the tree level are only going to be rough, enough to use as a guideline like that. If RA doesnt have to spill its guts, it has a chance to do a better job I think. scheduling and CSE. Although insn scheduling has heuristic not to increase register pressure, it has very small priority (third or fourth). Therefore insn scheduling can increase the pressure a lot sure, but it wont increase it from 25 back up to 140, so there should still be benefit. Actually what Andrew proposes (and more) I did two years ago on RTL level close to the register allocator (see gcc summit article fighting register pressure in gcc). The result was not satisfactory for me and I moved on rewriting the register allocator. Probably, I should have committed more what I've done into the mainline. I think its hard to do what I am going to do at the RTL level. I have all the information from tree-ssa available to make quite a few interesting decisions. and with a rewritten expand, decisions about whether things are regisrer or memory based can be made more fine grain. It just seems like the last good place to do some of this work. ANd perhaps we can get better instructions selected by seeing more that expand currently sees. Probably what Andrew proposes can be done faster on tree-SSA although doing it on RTL we would have more accurate information. In any case it will improve code in some cases and can be used as a temporary solution (until new register allocator projects will be done or forever if they failed). Andrew's proposal has a sense too with the code reuse point of view if he wants to move on with RABLE project. we'll see if that ever happens :-) Im hoping RABLET makes it less urgent. If not, then its time to consider something different, perhaps some of the key individual compents such as forced instructoin selection can be done, or maybe YARA will be a success and you'll take care of it! Andrew
Re: Project RABLET
Andrew MacLeod wrote: o register pressure relief through live range splitting and/or rematerialization. We have no accurate information here, because after that there are passes which change the pressure like insn Sure, Im not suggesting that RABLET will reduce the register pressure to something that isn't going to spill. Far from it. I am saying that RABLET can reduce something completely unmanageable to something more manageable. instead of handing the RTL passes a basic block that contains a peak register pressure of 120 when there are 16 hardware registers, perhaps it will be a basic block that has been reduced down to a peak of 25 or something. The calculations at the tree level are only going to be rough, enough to use as a guideline like that. Having no information about the final register allocator decision, the partial register pressure reducing through rematerialization is not working in many cases. For example, making rematerialization of a - b + c when you reduce the pressure from 100 to 50 for x86 there is a big chance that b and c will be not placed in hard registers. Instead of one load (of a), two loads (b and c) will be needed. This result code is even worse than before reducing pressure. So rematerialization in out-of-ssa pass will work well only for full pressure relief (to the level equal to the number of hard registers) or close to the full relief. But even if you can decrease register pressure relief to the level of the hard register number, it is hard to know have you achieved the full register pressure relief because you can not be sure what register class will be used (e.g. AREG or GENERAL_REGS for x86). Although it can work for architectures with big regular register files (e.g. classic RISC processors). The SSA pressure relief through rematerialization described in Simpson's theses is oriented for such architectures (with a big regular register file size of 32 as I remember). So it can work for ppc but it will be less successful for major interest platforms x86 and x86_64.
Project RABLET
Last fall I produced the RABLE document which described the approach I thought should be taken to write a new register allocator for GCC. A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. This describes my current work-in-progress, RABLET, which stands for RABLE-Themes, and conveniently implies something smaller. Rather than write a new allocator, I think there are things we can do that are a lot less work which will reap many of the benefits. This works from the premise that we generate good code when we don't have to spill, so try to detect early that we are going to spill and do something about it at a point where we have good analysis. THEMES -- 1 - One of the core themes in RABLE was very early selection of instructions from patterns. RTL patterns are initially chosen by the EXPAND pass. EXPAND tends to generates better rtl patterns by being handed complex trees which it can process and get better combinations. When TREE-SSA was first implemented, we got very poor RTL because expand was seeing very small trees. TER (Temporary Expression Replacement) was invented, which mashed any single-def/single-use ssa_names together into more complex trees. This gave expand a better chance of selecting better instructions, and made a huge difference. 2 - Rematerialization/register pressure reduction was another major component. The tree-ssa optimizers pay no attention to potential register pressure (which is as it should be). This sometimes results in very high register pressure entering the back end. RABLE defined passes performing expression rematerialization and other register pressure reducing techniques to reduce excessive register pressure before actually trying to do allocation. If you have 120 register live at some point, and only 16 physical registers, the allocator is clearly going to have a difficult time. Try to present it with something more manageable. RABLET CORE --- On to the meat of the subject. RABLET involves reworking the out-of-ssa pass, rewriting expand such that it is tightly integrated with the out-of-ssa pass, and adding some new work to reduce register pressure. Thats a lot less work than a whole new register allocator. If we look at the RABLE architecture, the passes before global allocation are as follows: Live-range-disjointer instruction-selection Register coalescing register pressure reduction .. then regular register allocation activity ... In the context of RABLET: -Out of SSA naturally performs live range disjointing. -Initial RTL pattern selection is performed by expand. -Register coalescing - ssa_name coalescing is done by out-of-ssa, and ssa_name == register. -Register pressure reduction – This is new work which would be implemented. If we create a new black box super pass called out-of-ssa-expand-register-pressure-reducer, (ewww, lets just refer to it as ssa-to-rtl :-), then we'd have something close to the first part of RABLE. Not exactly of course, because 'instruction selection' means something slightly different. In RABLE it means choosing an instruction alternative from an RTL pattern , in RABLET it simple means choose good RTL patterns. “But wait...” I hear some clever person saying... “A lot of things happen between ssa-to-rtl and global register allocation.”. Hold that thought until I get through describing ssa-to-rtl since there are some important considerations which affect that discussion SSA-TO-RTL -- When out-of-ssa was originally written, tree-ssa was still evolving. We didnt know exactly what was going to be expected of it, so it was written to be flexible and had a lot of gorp added after the fact. TREE-SSA is now mature and we understand much more fully what is expected of translation out of ssa. I have begun rewriting it to be smaller and faster and to eliminate all the unused features which were initially provided. It is also being rewritten with RABLET in mind. The new generation out-of-ssa will eliminate PHIs, much as it does today. This is done by coalescing together ssa_names connected by copies (and PHIs are just copies), and issuing copies on edges required by the PHIs. Instead of then mapping these back to VAR_DECLs and writing this all back to the trees, it will simply be maintained in a partition list map. This is *key*. At this point, we have a mapping of what ssa_names have been coalesced together, and the copies required to perform this. The trees themselves are unaltered,
Re: Project RABLET
Andrew MacLeod wrote: A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. If you are starting from scratch would it not be better to adopt the approach of combining register allocation and scheduling. Significant progress has been made in this area in recent years.
Re: Project RABLET
On Fri, 2006-06-23 at 15:29 -0400, Robert Dewar wrote: Andrew MacLeod wrote: A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. If you are starting from scratch would it not be better to adopt the approach of combining register allocation and scheduling. Significant progress has been made in this area in recent years. I am personally not a believer in combining register allocation and scheduling. They are two different problems, and although there is some interaction, I am still in the keep them seperate camp. However, RABLET is not writing a register allocator so its moot anyway :-). Andrew
Re: Project RABLET
Andrew MacLeod wrote: I am personally not a believer in combining register allocation and scheduling. They are two different problems, and although there is some interaction, I am still in the keep them seperate camp. I disagree, there is in fact much more than some interaction, there is a very strong interaction between scheduling and register allocation, particularly on modern machines like the typical x86 chips (which have only a fraction of their registers directly nameable). The research results on combining the two steps looks very promising to me. However, RABLET is not writing a register allocator so its moot anyway :-). indeed, moot = disussable, undecided, so here we are discussing (or if you like to use the verb, mooting) the issue. Andrew
Re: Project RABLET
On Fri, Jun 23, 2006 at 04:30:01PM -0400, Robert Dewar wrote: However, RABLET is not writing a register allocator so its moot anyway :-). indeed, moot = disussable, undecided, so here we are discussing (or if you like to use the verb, mooting) the issue. Please try the other definition, which he clearly meant: 2. Of purely theoretical or academic interest; having no practical consequence; as, the team won in spite of the bad call, and whether the ruling was correct is a moot question. -- Daniel Jacobowitz CodeSourcery
Re: Project RABLET
Daniel Jacobowitz wrote: Please try the other definition, which he clearly meant: 2. Of purely theoretical or academic interest; having no practical consequence; as, the team won in spite of the bad call, and whether the ruling was correct is a moot question. Well I am not sure what he meant, but for sure it is not the case that optimal register allocation and scheduling is of only theoretical or academic interest with no practical consequences!
Re: Project RABLET
On 6/23/06, Robert Dewar [EMAIL PROTECTED] wrote: Well I am not sure what he meant, but for sure it is not the case that optimal register allocation and scheduling is of only theoretical or academic interest with no practical consequences! Thanks for making that point. Now, what do you think about this RABLET idea, which has nothing to do with either register allocation or scheduling? ;-) Gr. Steven
Re: Project RABLET
Steven Bosscher wrote: Now, what do you think about this RABLET idea, which has nothing to do with either register allocation or scheduling? ;-) Well I would not say that it has nothing to do with register allocation! But indeed this seems a promising approach. The real question in my mind is whether it can be done in a way that simplifies and clarifies rather than adding to what is now very complex code to follow. I think the answer to that is probably yes.
Re: Project RABLET
Andrew MacLeod [EMAIL PROTECTED] writes: This describes my current work-in-progress, RABLET, which stands for RABLE-Themes, and conveniently implies something smaller. Thanks for this proposal. ssa-to-rtl spill cost analysis global allocation spiller spill location optimizer instruction rewriter. You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. If expand is made much smarter, I would argue that much of GCSE and CSE isn't needed. We've already performed those optimizations at a high level, and we can hopefully do a lot of the factoring and things on addressing registers exposed during expand. I'm sure there are other things to do, but I would argue that they are significantly less than a general purpose CSE and GCSE pass. And in the cases of high register pressure, how much would you want them to do anyway? Its really these high register pressure areas that RABLET is attacking anyway. Here I think you are waving your hands a little too hard. RTL level CSE is significant for handling common expressions exposed by address calculations and by DImode (and larger) computations. On some processors giving up CSE on address calculations would be very painful. There needs to be a plan to handle that. Also at present may vector calculations are not exposed at the tree level--they are hidden inside builtin functions until they are expanded--and vector heavy code can also have a lot of common subexpressions. If I recall, scheduling is register pressure aware and normally doesn't increase register pressure dramatically. If it does increase pressure, well, this won't solve every problem after all. Unfortunately, scheduling is currently not register pressure aware at all. The scheduler will gleefully increase register pressure. That's why we don't even run the scheduler before register allocation on x86. Modulo the above comments, I don't see anything wrong with your basic idea. But I also wonder whether you couldn't get a similar effect by forcing instruction selection to occur before register allocation. If that is done well, reload will have much less work to do. One of the basic issues with the current code is not that we do register allocation well or poorly, but that reload takes the output of the register allocator and munges it unpredictably. That's going to happen with your proposal as well. It doesn't mean that your proposal won't improve things. But no register allocator can do a good job when it can't make the final decisions. Ian
Re: Project RABLET
On 6/23/06, Andrew MacLeod [EMAIL PROTECTED] wrote: ... 1 - One of the core themes in RABLE was very early selection of instructions from patterns. RTL patterns are initially chosen by the EXPAND pass. EXPAND tends to generates better rtl patterns by being handed complex trees which it can process and get better combinations. When TREE-SSA was first implemented, we got very poor RTL because expand was seeing very small trees. TER (Temporary Expression Replacement) was invented, which mashed any single-def/single-use ssa_names together into more complex trees. This gave expand a better chance of selecting better instructions, and made a huge difference. Have you considered using BURG/IBURG style tree pattern matching instruction selection ? http://www.cs.princeton.edu/software/iburg/ That approach can certainly provide a low register pressure high quality instruction selection. -- #pragma ident Seongbae Park, compiler, http://seongbae.blogspot.com;
Re: Project RABLET
Ian Lance Taylor wrote: Andrew MacLeod [EMAIL PROTECTED] writes: This describes my current work-in-progress, RABLET, which stands for RABLE-Themes, and conveniently implies something smaller. Thanks for this proposal. ssa-to-rtl spill cost analysis global allocation spiller spill location optimizer instruction rewriter. You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. If expand is made much smarter, I would argue that much of GCSE and CSE isn't needed. We've already performed those optimizations at a high level, and we can hopefully do a lot of the factoring and things on addressing registers exposed during expand. I'm sure there are other things to do, but I would argue that they are significantly less than a general purpose CSE and GCSE pass. And in the cases of high register pressure, how much would you want them to do anyway? Its really these high register pressure areas that RABLET is attacking anyway. Here I think you are waving your hands a little too hard. RTL level CSE is significant for handling common expressions exposed by address calculations and by DImode (and larger) computations. On some processors giving up CSE on address calculations would be very painful. There needs to be a plan to handle that. I agree with Ian completely. Also, after having stared and worked on df in the backend with Kenny and watched the amount of work that has had to be done, i think you may be underestimating the complexity of what is really going on in the backend right now. Not that i wouldn't love to see our backend become simpler and have a bunch of relatively non-complex df based passes, because I would, but i *also* don't think RABLET is going to enable that (or the removal of CSE/GCSE) through smarter expand. It's possible you'd remove GCSE, but only because the last time i remember someone looking (stevenb, i think), it wasn't doing all *that* much. Again, like Ian, I'd argue you'd need to do real instruction selection before register allocation before that can happen. Luckily, these days, BURG based instruction selection has become production usable, so that task isn't as horrid as it used to be. --Dan
Re: Project RABLET
On Fri, 2006-06-23 at 23:08 -0400, Daniel Berlin wrote: Ian Lance Taylor wrote: Here I think you are waving your hands a little too hard. RTL level CSE is significant for handling common expressions exposed by address calculations and by DImode (and larger) computations. On some processors giving up CSE on address calculations would be very painful. There needs to be a plan to handle that. I agree with Ian completely. Also, after having stared and worked on df in the backend with Kenny and watched the amount of work that has had to be done, i think you may be underestimating the complexity of what is really going on in the backend right now. Not that i wouldn't love to see our backend become simpler and have a bunch of relatively non-complex df based passes, because I would, but i *also* don't think RABLET is going to enable that (or the removal of CSE/GCSE) through smarter expand. It's possible you'd remove GCSE, but only because the last time i remember someone looking (stevenb, i think), it wasn't doing all *that* much. Again, like Ian, I'd argue you'd need to do real instruction selection before register allocation before that can happen. Luckily, these days, BURG based instruction selection has become production usable, so that task isn't as horrid as it used to be. It occurs to me I think there is a misunderstanding here. Perhaps I didn't communicate this well enough, or perhaps I got a little carried away trying to make RABLET look like RABLE. Im not actually proposing that RABLET will enable the backend to suddenly become simple... The initial impact of RABLET is to simply remove some of the onus of dealing with excessive register pressure from the register allocator. RABLET will really do nothing when register pressure is not high, things would be pretty much exactly as they are today. When register pressure is high, many of the things the RTL optimizations I mentioned do really become irrelevant (I think), since they increase register pressure more, and cause more spilling. This generally offsets whatever good they do. I was trying to claim that some level of this work can be done in expand and in *this* circumstance, thats all that needs to be done. Making the resulting model look somewhat like RABLE, and simplifying the view of the RTL optimizations. Its possible that some of this work can simplify the RTL optimizations in other cases, perhaps not. If we can simplify anything, that great. I'd love to see it, and I hope some of it is possible. I do see possibilities that will hopefully pan out. Ultimately, RABLET simply tries to present the backend with code that looks more like low register pressure code which the current backend is pretty good at handling. Anything else we can get from it is a bonus. For the work involved in RABLET vs. the work involved in a new allocator like RABLE, I think RABLET is well worth doing. (months vs. years). I think RABLET will show a significant benefit. (famous last words, ho ho :-) Andrew
Re: Project RABLET
On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. Which brings up a good discussion. I presume the rtl loop optimizers see things exposed by addressing modes which aren't seen in the higher level code. I wonder what the big gains are here... and if they are detectable at expansion time... In general, I didnt mention anything that tends not to increase register pressure, at least not in any significant manner as far as RABLET is concerned. If expand is made much smarter, I would argue that much of GCSE and CSE isn't needed. We've already performed those optimizations at a high level, and we can hopefully do a lot of the factoring and things on addressing registers exposed during expand. I'm sure there are other things to do, but I would argue that they are significantly less than a general purpose CSE and GCSE pass. And in the cases of high register pressure, how much would you want them to do anyway? Its really these high register pressure areas that RABLET is attacking anyway. Here I think you are waving your hands a little too hard. RTL level CSE is significant for handling common expressions exposed by address calculations and by DImode (and larger) computations. On some processors giving up CSE on address calculations would be very painful. There needs to be a plan to handle that. Yes, there is some hand waving, mostly because I haven't gotten that far in details yet. I expect to be able to do some of this type of commoning at rtl generation time as things are generated. (much like RABLE's spiller reuses spill loads nearby). That may turn out to be more difficult than I anticipate however. Pain is in the implementation :-) I am not proposing that CSE necessarily be eliminated *all* the time, but in cases when register pressure is already excessively high, is further commoning of DImode values going to make things better? Its really this case I'm interested in evaluating since this is the case we already have problems. if we don't spill, RABLET would effectively do nothing. Clearly there will be a lot of further investigation required once implementation reaches this point. Ultimately CSE and all RTL optimizations can be re-evaluated to see if things can be simplified. Also at present may vector calculations are not exposed at the tree level--they are hidden inside builtin functions until they are expanded--and vector heavy code can also have a lot of common subexpressions. I have no plan at moment for vector operations :-). That could change, but for now we'll have to keep whatever we do today for those. If I recall, scheduling is register pressure aware and normally doesn't increase register pressure dramatically. If it does increase pressure, well, this won't solve every problem after all. Unfortunately, scheduling is currently not register pressure aware at all. The scheduler will gleefully increase register pressure. That's why we don't even run the scheduler before register allocation on x86. hum, too bad. for some reason I was under the impression that it at least tried not to increase register pressure when it was above a certain threshold value. Not running it at least means it wont increase register pressure, so that works :-) Modulo the above comments, I don't see anything wrong with your basic idea. But I also wonder whether you couldn't get a similar effect by forcing instruction selection to occur before register allocation. If that is done well, reload will have much less work to do. That was one of the premises of RABLE. Since out of ssa needs some TLC and TER has been a wart for years, this seems like a good way of dealing with those issues, and perhaps dealing with some significant RA issues at the same time. (Anything to avoid actually rewriting RA eh!) One of the basic issues with the current code is not that we do register allocation well or poorly, but that reload takes the output of the register allocator and munges it unpredictably. That's going to happen with your proposal as well. It doesn't mean that your proposal won't improve things. But no register allocator can do a good job when it can't make the final decisions. Truer words have never been spoken. RABLET makes no attempt to do anything about reload. It simply attempts to present the backend with code that isn't full of excessive register pressure. If it turns out to be something reload screws up today, it will continue to be screwed up. I suspect a lot of the time we do have excessive spill, RABLET will show benefit. Its clearly not as good as a new register allocator would be, but the effort to benefit ratio ought to be a lot higher for RABLET than for a register allocator rewrite. Andrew
Re: Project RABLET
On Jun 23, 2006, at 9:39 PM, Andrew MacLeod wrote: On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. Which brings up a good discussion. I presume the rtl loop optimizers see things exposed by addressing modes which aren't seen in the higher level code. I wonder what the big gains are here... and if they are detectable at expansion time... The one rtl loop optimizer which has nothing to do with addressing modes and loops is the doloop optimizer which is most likely possible to do expansion time and is one of the few loop optimizer which lowers register pressure. The reason why it lowers register pressure is because it makes use of a special register for loops (at least on PowerPC). Thanks, Andrew Pinski
Re: Project RABLET
Andrew MacLeod [EMAIL PROTECTED] writes: On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. Which brings up a good discussion. I presume the rtl loop optimizers see things exposed by addressing modes which aren't seen in the higher level code. I wonder what the big gains are here... and if they are detectable at expansion time... One obvious gain is hoisting constants exposed by address expansion out of loops. Also once addressing modes are expanded, there are new IVs. I am not proposing that CSE necessarily be eliminated *all* the time, but in cases when register pressure is already excessively high, is further commoning of DImode values going to make things better? Its really this case I'm interested in evaluating since this is the case we already have problems. if we don't spill, RABLET would effectively do nothing. I think that even when pressure is high, it helps a lot to do CSE after DImode values have been split up, as will be the case even today for, e.g., DImode bitwise operations. It tends to reduce register pressure if anything. As you say, none of these arguments that RABLET is a bad idea, they are just arguments that we can't expect to remove the RTL passes without a lot more work, whether or not they increase register pressure. One thing we could perhaps consider would be expanding addressing mode calculations at the tree level. Ian