Re: Project RABLET

2006-07-19 Thread Rask Ingemann Lambertsen
On Fri, Jun 23, 2006 at 03:23:04PM -0400, Andrew MacLeod wrote:
 A new register allocator written from scratch is a very long term
 project (measured in years), and there is no guarantee after all that
 work that we'd end up with something which is remarkably better. One
 would hope that it is a lot more maintainable, but the generated code is
 a crapshot. It will surely look better but will it really run faster?
 The current plate of spaghetti we call the register allocator has had a
 lot of fine tuning go into it over the years, and it generally generates
 pretty darn good code IF it doesn't have to spill much, which is much of
 the time.

One area where the current register allocator is really lousy is when faced
with pseudos spanning multiple registers. A recent example:
URL:http://gcc.gnu.org/ml/gcc-help/2006-04/msg00064.html. And 16-bit
targets suffer from this much of the time. I can only imagine that the AVR
camp must be tearing their hair out in frustration. The register allocater
really needs to be able to allocate subregs independently.

-- 
Rask Ingemann Lambertsen


Re: Project RABLET

2006-06-26 Thread Andrew MacLeod
On Sun, 2006-06-25 at 01:04 -0400, Vladimir N. Makarov wrote:
 Andrew MacLeod wrote:

   Having no information about the final register allocator decision,
 the partial register pressure reducing through rematerialization is
 not working in many cases.  For example, making rematerialization of
 
 a - b + c
 
 when you reduce the pressure from 100 to 50 for x86 there is a big
 chance that b and c will be not placed in hard registers.  Instead of
 one load (of a), two loads (b and c) will be needed.  This result code
 is even worse than before reducing pressure.

Having implemented a complex rematerialization pass before, I understand
it well enough to know that replacing 'a' with 'b + c' is the wrong
thing to do, unless at least one of b or c is used again right next to
the use of 'a'. That can actually increase register pressure, or have
nothing more than a neutral effect. Its *way* more complicated than
simply moving expressions downward. Its tracking all the things used in
expressions and determining that at a given location, it *is* worthwhile
to do it because the correct values are already trivially available, or
can alternatively be recomputed in a worthwhile way. Often, there are
only a few worthwhile remats out of all the possible ones.

In general, I have never found the primary benefit of rematerialization
to be register pressure reduction, but rather one of avoiding placing
stores to spill in the instruction stream.  When there is a lot of
spilling, those stores can really bog down a pipeline. 

 
   So rematerialization in out-of-ssa pass will work well only for full
 pressure relief (to the level equal to the number of hard registers)
 or close to the full relief.

That's a pretty broad assertion to make, and I disagree with it :-)
Especially when you only talk about a single component of the whole. 

 RABLET is a group of things which tend to enable each other. Any one of
them by themselves would actually accomplish less. Some of the required
components have a benefit of some sort unrelated to the others (such as
faster out of ssa translation). Others require interaction with the
whole to see any benefit. 

Remember that one of the components is a new integrated expand... RABLET
will have an understanding of the instructions that can be generated,
and will try to make use of those when making decisions. 

An additional benefit likely to be seen from this is that code tends to
utilize more of the variable names the user will recognize, rather than
something like 'SPILL_678'. And maybe, just maybe, it'll be easier to
get the debug info correct (TER can muck it up pretty badly :-)

 
 The SSA pressure relief through rematerialization described in
 Simpson's theses is oriented for such architectures (with a big
 regular register file size of 32 as I remember).  So it can work for
 ppc but it will be less successful for major interest platforms x86 and
 x86_64.

I haven't read the thesis, but I would be surprised if it describes what
I am planning to do. Without the integration with expand to do
instruction selection, a few tweaks in some RTL optimizations, and a
very specific gcc-oriented union of all these components, what I am
planning to do would be completely worthless. Remat is really a small
part of it.

What benefit I will see?  Well, time will tell :-)

Andrew



Re: Project RABLET

2006-06-26 Thread Richard Guenther

On 6/26/06, Andrew MacLeod [EMAIL PROTECTED] wrote:

On Sun, 2006-06-25 at 01:04 -0400, Vladimir N. Makarov wrote:
 Andrew MacLeod wrote:

   Having no information about the final register allocator decision,
 the partial register pressure reducing through rematerialization is
 not working in many cases.  For example, making rematerialization of

 a - b + c

 when you reduce the pressure from 100 to 50 for x86 there is a big
 chance that b and c will be not placed in hard registers.  Instead of
 one load (of a), two loads (b and c) will be needed.  This result code
 is even worse than before reducing pressure.

Having implemented a complex rematerialization pass before, I understand
it well enough to know that replacing 'a' with 'b + c' is the wrong
thing to do, unless at least one of b or c is used again right next to
the use of 'a'. That can actually increase register pressure, or have
nothing more than a neutral effect. Its *way* more complicated than
simply moving expressions downward. Its tracking all the things used in
expressions and determining that at a given location, it *is* worthwhile
to do it because the correct values are already trivially available, or
can alternatively be recomputed in a worthwhile way. Often, there are
only a few worthwhile remats out of all the possible ones.

In general, I have never found the primary benefit of rematerialization
to be register pressure reduction, but rather one of avoiding placing
stores to spill in the instruction stream.  When there is a lot of
spilling, those stores can really bog down a pipeline.


   So rematerialization in out-of-ssa pass will work well only for full
 pressure relief (to the level equal to the number of hard registers)
 or close to the full relief.

That's a pretty broad assertion to make, and I disagree with it :-)
Especially when you only talk about a single component of the whole.

 RABLET is a group of things which tend to enable each other. Any one of
them by themselves would actually accomplish less. Some of the required
components have a benefit of some sort unrelated to the others (such as
faster out of ssa translation). Others require interaction with the
whole to see any benefit.

Remember that one of the components is a new integrated expand... RABLET
will have an understanding of the instructions that can be generated,
and will try to make use of those when making decisions.


I think the overall idea of RABLET is quite sound.  Especially if integrating
expand with out-of-ssa this might provide early instruction selection (now, that
would possibly require moving all of RTL loop optimizations before this point).

Of course re-doing expand is the hard part here...

Richard.


Re: Project RABLET

2006-06-26 Thread Vladimir Makarov

Andrew MacLeod wrote:


On Sun, 2006-06-25 at 01:04 -0400, Vladimir N. Makarov wrote:
 


Andrew MacLeod wrote:
   


The SSA pressure relief through rematerialization described in
Simpson's theses is oriented for such architectures (with a big
regular register file size of 32 as I remember).  So it can work for
ppc but it will be less successful for major interest platforms x86 and
x86_64.
   



I haven't read the thesis, but I would be surprised if it describes what
I am planning to do. Without the integration with expand to do
instruction selection, a few tweaks in some RTL optimizations, and a
very specific gcc-oriented union of all these components, what I am
planning to do would be completely worthless. Remat is really a small
part of it.
 

The thesis contains a small part about different heuristics of reducing 
register pressure on SSA through rematerializaion (what and where to 
rematerialize).


I see your plan is much bigger, not only reducing through 
rematerialization but other things including better RTL expansion and, 
more imprortant, doing it in an integrated way.


In any case, it looks as an interesting project.  Good luck with that.


What benefit I will see?  Well, time will tell :-)

 






Re: Project RABLET

2006-06-24 Thread Steven Bosscher

On 6/24/06, Andrew MacLeod [EMAIL PROTECTED] wrote:

On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote:

 You omitted the RTL loop optimizer passes, which still do quite a bit
 of work despite the tree-ssa loop passes.  Also if-conversion and some
 minor passes, though they are less relevant.

Which brings up a good discussion. I presume the rtl loop optimizers see
things exposed by addressing modes which aren't seen in the higher level
code. I wonder what the big gains are here...


Knowning which address computations are loop invariant. Knowing the
number of instructions (sadly not the exact size because instructions
haven't been selected) to determine whether it is worth
unrolling/peeling/unswitching a loop. Finding loops that can use a
doloop pattern.


and if they are
detectable at expansion time...


For most of them, I don't think so.


In general, I didnt mention anything that tends not to increase register
pressure, at least not in any significant manner as far as RABLET is
concerned.


So do you have hard data showing that CSE increases register pressure?
Given the thinks CSE does, it would probably be much more useful,
then, to make it possible to have liveness information in CSE so that
it can take register pressure into account in its cost considerations
;-)  No magic new expand is going to make CSE obsolete, and it simply
does too much to just throw it out. (FWIW I'm still working on
simplifying cse.c...)



Clearly there will be a lot of further investigation required once
implementation reaches this point. Ultimately CSE and all RTL
optimizations can be re-evaluated to see if things can be simplified.


*laughs*

Every time some RTL optimizer is re-re-re-re-re-evaluated, it turns
out we lose without it. Good luck to you, but I think you're seriously
underestimating the complexity of things here.



 Modulo the above comments, I don't see anything wrong with your basic
 idea.  But I also wonder whether you couldn't get a similar effect by
 forcing instruction selection to occur before register allocation.  If
 that is done well, reload will have much less work to do.


Hurray.
This is what new-ra did. It was probably the only thing there that
worked well, but it was a great idea. (Sadly it was just reload
rewritten so pre-reload.c was ugly, but the idea was good).


Its clearly not as good as a new register allocator would be, but the
effort to benefit ratio ought to be a lot higher for RABLET than for a
register allocator rewrite.


There is a register allocator rewrite under way, from one of your
co-workers even. Is there any relation between Vlad's project and
yours, or are you going different ways with the same goal in mind? :-D

Gr.
Steven


Re: Project RABLET

2006-06-24 Thread Vladimir N. Makarov

Steven Bosscher wrote:



Every time some RTL optimizer is re-re-re-re-re-evaluated, it turns
out we lose without it. Good luck to you, but I think you're seriously
underestimating the complexity of things here.




Its clearly not as good as a new register allocator would be, but the
effort to benefit ratio ought to be a lot higher for RABLET than for a
register allocator rewrite.



There is a register allocator rewrite under way, from one of your
co-workers even. Is there any relation between Vlad's project and
yours, or are you going different ways with the same goal in mind? :-D



 Working on register allocation issues last three years and looking
at the new-ra project I can say that any project in this area has a
big chance to fail despite how good design looked at the first glance.
The problem is in complexity of RTL and lot of ports with very
specific issues which are described by gazillion of macros.  Redesign
and simplification of RTL could solve a lot of problems like code
selection, register allocation etc (although might create others).
But this task is much bigger than introducing tree-SSA because it
means rewriting all machine description files and practically equal to
redoing all ports.  Do we have resources for this?  I don't think so.

 So saying this, my point of view that the more projects we have in
this area, the better chance we will have to solve the problem.
Therefore I really appreciate what Andrew and Bernd Schmidt do.  It
might look as a waste of resources but we can not people force not to
do what they believe and want to do (e.g. we can not force Bernd not
to improve reload because he can work on a new register allocator. He
improves reload because he knows it best than others).

 As for Andrew's proposal, my opinion is that all this
transformations are done too early and we need them to do again on
rtl sometime.

o coalescing.  CSE can create more moves but more important thing is
 the extended coalescing can not be done here (or I don't know how it
 can be done here).  It is about moves generated because of
 two-address architecture constraints (regmove and global tries to
 solve this problem in ad hoc way e.g. through hard register
 preference by global).  It should be part of coalescing pass,
 because removing a move can prevent removing a higher priority move
 generated by reload because of the two address constraints.

o register pressure relief through live range splitting and/or
 rematerialization.  We have no accurate information here, because
 after that there are passes which change the pressure like insn
 scheduling and CSE.  Although insn scheduling has heuristic not to
 increase register pressure, it has very small priority (third or
 fourth).  Therefore insn scheduling can increase the pressure a lot
 (but sometimes decrease it too).  Insn scheduler with register
 renaming being implemented by ISP RAS might solve this problem, if
 it works only after the register allocator.  But this insn scheduler
 can work before the register allocator too and only its usage will
 show will it work only after the register allocator or in
 traditional way (before and the after the register allocator).

 Even without changing the register pressure by subsequent passes,
 there is another problem which is difficulty to calculate the
 register pressure excess.  We don't know what register class will be
 used for a pseudo-register (e.g. AREG or GENERAL_REGS for x86 which
 creates difference 6 in the register pressure).  Although reducing
 register pressure from 100 to 6 will be very helpful, my experience
 shows that the most frequent and interesting cases are on the
 border.

o register renaming is already done and effectively (because it uses
 the data flow analysis framework) by -fweb.  But I think it can be
 done more effectively by out-of-ssa pass.

 Actually what Andrew proposes (and more) I did two years ago on RTL
level close to the register allocator (see gcc summit article
fighting register pressure in gcc).  The result was not satisfactory
for me and I moved on rewriting the register allocator.  Probably, I
should have committed more what I've done into the mainline.

 Probably what Andrew proposes can be done faster on tree-SSA
although doing it on RTL we would have more accurate information.  In
any case it will improve code in some cases and can be used as a
temporary solution (until new register allocator projects will be done
or forever if they failed).  Andrew's proposal has a sense too with
the code reuse point of view if he wants to move on with RABLE
project.

 As for my project YARA, I don't know when it will be ready for the
mainline (at least one more year) because it includes removing reload
(the biggest and most complicated part of the RA).  It works now only
for x86 and x86_64, generates better code for SPECint2000 and
SPECFp2000 (at least for pentium4, nocona and coming woodcrest.  I
have no free AMD machine to make benchmarking).  I've just started

Re: Project RABLET

2006-06-24 Thread Andrew MacLeod
On Sat, 2006-06-24 at 13:04 +0200, Steven Bosscher wrote:
 On 6/24/06, Andrew MacLeod [EMAIL PROTECTED] wrote:
  On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote:
 =
  In general, I didnt mention anything that tends not to increase register
  pressure, at least not in any significant manner as far as RABLET is
  concerned.
 
 So do you have hard data showing that CSE increases register pressure?
 Given the thinks CSE does, it would probably be much more useful,
 then, to make it possible to have liveness information in CSE so that
 it can take register pressure into account in its cost considerations
 ;-)  No magic new expand is going to make CSE obsolete, and it simply
 does too much to just throw it out. (FWIW I'm still working on
 simplifying cse.c...)

no, no hard data, its just the kind of activity which can undo what
RABLET proposes doing. Ie, an expression which I go to the effort to
rematerializing in 3 places is likely to be commoned back to the
original live range without stepping in and telling CSE not to do that.

What I really had in mind was marking these register
values/expression/whatever with a flag such that when CSE or GCSE or
whoever asks, they get returned as not-to-be-looked-at. That way they
don't undo the opportunities which RABLET has so kindly exposed to
these optimizations :-) 


 
  Clearly there will be a lot of further investigation required once
  implementation reaches this point. Ultimately CSE and all RTL
  optimizations can be re-evaluated to see if things can be simplified.
 
 *laughs*
 
 Every time some RTL optimizer is re-re-re-re-re-evaluated, it turns
 out we lose without it. Good luck to you, but I think you're seriously
 underestimating the complexity of things here.

I was not really looking to rewrite the passes so much as tell them not
to work on certain registers. This could potentially be extended to
ssa_names which the tree optimizers have processed and which get
expanded into single RTL registers. A new expand would know that the
translation into RTL was simple enough that nothing new has really been
exposed to the RTL optimizers.  That was my thought anyway. Then CSE et
al work on whatever is left. Perhaps there are then some hunks that can
be remoived as redundant, or maybe not.


  Its clearly not as good as a new register allocator would be, but the
  effort to benefit ratio ought to be a lot higher for RABLET than for a
  register allocator rewrite.
 
 There is a register allocator rewrite under way, from one of your
 co-workers even. Is there any relation between Vlad's project and
 yours, or are you going different ways with the same goal in mind? :-D

Totally different scale of project with completely different goal. Had I
set out and actually started writing RABLE, that would be going in
different directions with the same goal. In theory, RABLET should make
the job of any register allocator a bit easier.

These days, I think any register allocator's goal ought to be to assign
registers and be the final authority. No reload undoing any of the work.
That is well beyond the scope of what Im doing. It is within the scope
of what Vlad is doing.

Andrew




Re: Project RABLET

2006-06-24 Thread Andrew MacLeod
On Sat, 2006-06-24 at 12:36 -0400, Vladimir N. Makarov wrote:
 Steven Bosscher wrote:

   As for Andrew's proposal, my opinion is that all this
 transformations are done too early and we need them to do again on
 rtl sometime.
 
 o coalescing.  CSE can create more moves but more important thing is

RABLET will do nothing different than is done today.. out of ssa
coalesces ssa_names out the wazoo. In the interest of register pressure
reduction, it may actually coalesce less to split up live ranges, and
leave loads/stores from/to the stack.
 
 
 o register pressure relief through live range splitting and/or
   rematerialization.  We have no accurate information here, because
   after that there are passes which change the pressure like insn

Sure, Im not suggesting that RABLET will reduce the register pressure to
something that isn't going to spill. Far from it. I am saying that
RABLET can reduce something completely unmanageable to something more
manageable. instead of handing the RTL passes a basic block that
contains a peak register pressure of 120 when there are 16 hardware
registers, perhaps it will be a basic block that has been reduced down
to a peak of 25 or something. The calculations at the tree level are
only going to be rough, enough to use as a guideline like that.  

If RA doesnt have to spill its guts, it has a chance to do a better job
I think.

 
   scheduling and CSE.  Although insn scheduling has heuristic not to
   increase register pressure, it has very small priority (third or
   fourth).  Therefore insn scheduling can increase the pressure a lot

sure, but it wont increase it from 25 back up to 140, so there should
still be benefit.

   Actually what Andrew proposes (and more) I did two years ago on RTL
 level close to the register allocator (see gcc summit article
 fighting register pressure in gcc).  The result was not satisfactory
 for me and I moved on rewriting the register allocator.  Probably, I
 should have committed more what I've done into the mainline.
 

I think its hard to do what I am going to do at the RTL level. I have
all the information from tree-ssa available to make quite a few
interesting decisions. and with a rewritten expand, decisions about
whether things are regisrer or memory based can be made more fine grain.
It just seems like the last good place to do some of this work. ANd
perhaps we can get better instructions selected by seeing more that
expand currently sees.
 
   Probably what Andrew proposes can be done faster on tree-SSA
 although doing it on RTL we would have more accurate information.  In
 any case it will improve code in some cases and can be used as a
 temporary solution (until new register allocator projects will be done
 or forever if they failed).  Andrew's proposal has a sense too with
 the code reuse point of view if he wants to move on with RABLE
 project.

we'll see if that ever happens :-) Im hoping RABLET makes it less
urgent. If not, then its time to consider something different, perhaps
some of the key individual compents such as forced instructoin selection
can be done, or maybe YARA will be a success and you'll take care of it!


Andrew



Re: Project RABLET

2006-06-24 Thread Vladimir N. Makarov

Andrew MacLeod wrote:




o register pressure relief through live range splitting and/or
 rematerialization.  We have no accurate information here, because
 after that there are passes which change the pressure like insn
   



Sure, Im not suggesting that RABLET will reduce the register pressure to
something that isn't going to spill. Far from it. I am saying that
RABLET can reduce something completely unmanageable to something more
manageable. instead of handing the RTL passes a basic block that
contains a peak register pressure of 120 when there are 16 hardware
registers, perhaps it will be a basic block that has been reduced down
to a peak of 25 or something. The calculations at the tree level are
only going to be rough, enough to use as a guideline like that.  

 


 Having no information about the final register allocator decision,
the partial register pressure reducing through rematerialization is
not working in many cases.  For example, making rematerialization of

a - b + c

when you reduce the pressure from 100 to 50 for x86 there is a big
chance that b and c will be not placed in hard registers.  Instead of
one load (of a), two loads (b and c) will be needed.  This result code
is even worse than before reducing pressure.

 So rematerialization in out-of-ssa pass will work well only for full
pressure relief (to the level equal to the number of hard registers)
or close to the full relief.

 But even if you can decrease register pressure relief to the level
of the hard register number, it is hard to know have you achieved the
full register pressure relief because you can not be sure what
register class will be used (e.g. AREG or GENERAL_REGS for x86).
Although it can work for architectures with big regular register
files (e.g. classic RISC processors).

The SSA pressure relief through rematerialization described in
Simpson's theses is oriented for such architectures (with a big
regular register file size of 32 as I remember).  So it can work for
ppc but it will be less successful for major interest platforms x86 and
x86_64.



Project RABLET

2006-06-23 Thread Andrew MacLeod

Last fall I produced the RABLE document which described the approach I
thought should be taken to write a new register allocator for GCC.

A new register allocator written from scratch is a very long term
project (measured in years), and there is no guarantee after all that
work that we'd end up with something which is remarkably better. One
would hope that it is a lot more maintainable, but the generated code is
a crapshot. It will surely look better but will it really run faster?
The current plate of spaghetti we call the register allocator has had a
lot of fine tuning go into it over the years, and it generally generates
pretty darn good code IF it doesn't have to spill much, which is much of
the time.

This describes my current work-in-progress, RABLET, which stands for
RABLE-Themes, and conveniently implies something smaller. Rather than
write a new allocator, I think there are things we can do that are a lot
less work which will reap many of the benefits. This works from the
premise that we generate good code when we don't have to spill, so try
to detect early that we are going to spill and do something about it at
a point where we have good analysis.

THEMES
--

1 - One of the core themes in RABLE was very early selection of
instructions from patterns.  RTL patterns are initially chosen by the
EXPAND pass. EXPAND tends to generates better rtl patterns by being
handed complex trees which it can process and get better combinations.

  When TREE-SSA was first implemented, we got very poor RTL because
expand was seeing very small trees.  TER (Temporary Expression
Replacement) was invented, which mashed any single-def/single-use
ssa_names together into more complex trees. This gave expand a better
chance of selecting better instructions, and made a huge difference.

2 - Rematerialization/register pressure reduction was another major
component.  The tree-ssa optimizers pay no attention to potential
register pressure (which is as it should be). This sometimes results in
very high register pressure entering the back end.  RABLE defined passes
performing expression rematerialization and other register pressure
reducing techniques to reduce excessive register pressure before
actually trying to do allocation.  If you have 120 register live at some
point, and only 16 physical registers, the allocator is clearly going to
have a difficult time. Try to present it with something more manageable.

RABLET CORE
---

On to the meat of the subject. RABLET involves reworking the out-of-ssa
pass, rewriting expand such that it is tightly integrated with the
out-of-ssa pass, and adding some new work to reduce register pressure.
Thats a lot less work than a whole new register allocator.

If we look at the RABLE architecture, the passes before global
allocation are as follows:

Live-range-disjointer
instruction-selection
Register coalescing
register pressure reduction
.. then regular register allocation activity ...

In the context of RABLET:

-Out of SSA naturally performs live range disjointing.
-Initial RTL pattern selection is performed by expand.
-Register coalescing -  ssa_name coalescing is done by out-of-ssa, and
ssa_name == register.
-Register pressure reduction – This is new work which would be
implemented.

If we create a new black box super pass called
out-of-ssa-expand-register-pressure-reducer,  (ewww,  lets just refer to
it as ssa-to-rtl :-),  then we'd have something close to the first part
of RABLE. Not exactly of course, because 'instruction selection' means
something slightly different.  In RABLE it means choosing an instruction
alternative from an RTL pattern , in RABLET it simple means choose good
RTL patterns.

“But wait...” I hear some clever person saying... “A lot of things
happen between ssa-to-rtl and global register allocation.”.  Hold that
thought until I get through describing ssa-to-rtl since there are some
important considerations which affect that discussion

SSA-TO-RTL
--

When out-of-ssa was originally written, tree-ssa was still evolving. We
didnt know exactly what was going to be expected of it, so it was
written to be flexible and had a lot of gorp added after the fact.
TREE-SSA is now mature and we understand much more fully what is
expected of translation out of ssa. I have begun rewriting it to be
smaller and faster and to eliminate all the unused features which were
initially provided.  It is also being rewritten with RABLET in mind.  

The new generation out-of-ssa will eliminate PHIs, much as it does
today. This is done by coalescing together ssa_names connected by copies
(and PHIs are just copies), and issuing copies on edges required by the
PHIs.  Instead of  then mapping these back to VAR_DECLs and writing this
all back to the trees, it will simply be maintained in a partition list
map.  This is *key*. At this point, we have a mapping of what ssa_names
have been coalesced together, and the copies required to perform this.
The trees themselves are unaltered, 

Re: Project RABLET

2006-06-23 Thread Robert Dewar

Andrew MacLeod wrote:


A new register allocator written from scratch is a very long term
project (measured in years), and there is no guarantee after all that
work that we'd end up with something which is remarkably better. One
would hope that it is a lot more maintainable, but the generated code is
a crapshot. It will surely look better but will it really run faster?
The current plate of spaghetti we call the register allocator has had a
lot of fine tuning go into it over the years, and it generally generates
pretty darn good code IF it doesn't have to spill much, which is much of
the time.


If you are starting from scratch would it not be better to adopt
the approach of combining register allocation and scheduling.
Significant progress has been made in this area in recent years.



Re: Project RABLET

2006-06-23 Thread Andrew MacLeod
On Fri, 2006-06-23 at 15:29 -0400, Robert Dewar wrote:
 Andrew MacLeod wrote:
 
  A new register allocator written from scratch is a very long term
  project (measured in years), and there is no guarantee after all that
  work that we'd end up with something which is remarkably better. One
  would hope that it is a lot more maintainable, but the generated code is
  a crapshot. It will surely look better but will it really run faster?
  The current plate of spaghetti we call the register allocator has had a
  lot of fine tuning go into it over the years, and it generally generates
  pretty darn good code IF it doesn't have to spill much, which is much of
  the time.
 
 If you are starting from scratch would it not be better to adopt
 the approach of combining register allocation and scheduling.
 Significant progress has been made in this area in recent years.
 

I am personally not a believer in combining register allocation and
scheduling. They are two different problems, and although there is some
interaction, I am still in the keep them seperate camp. 

However, RABLET is not writing a register allocator so its moot
anyway :-).

Andrew



Re: Project RABLET

2006-06-23 Thread Robert Dewar

Andrew MacLeod wrote:


I am personally not a believer in combining register allocation and
scheduling. They are two different problems, and although there is some
interaction, I am still in the keep them seperate camp. 


I disagree, there is in fact much more than some interaction, there
is a very strong interaction between scheduling and register allocation,
particularly on modern machines like the typical x86 chips (which have
only a fraction of their registers directly nameable). The research
results on combining the two steps looks very promising to me.


However, RABLET is not writing a register allocator so its moot
anyway :-).


indeed, moot = disussable, undecided, so here we are discussing
(or if you like to use the verb, mooting) the issue.


Andrew





Re: Project RABLET

2006-06-23 Thread Daniel Jacobowitz
On Fri, Jun 23, 2006 at 04:30:01PM -0400, Robert Dewar wrote:
 However, RABLET is not writing a register allocator so its moot
 anyway :-).
 
 indeed, moot = disussable, undecided, so here we are discussing
 (or if you like to use the verb, mooting) the issue.

Please try the other definition, which he clearly meant:

 2. Of purely theoretical or academic interest; having no
practical consequence; as, the team won in spite of the
bad call, and whether the ruling was correct is a moot
question.

-- 
Daniel Jacobowitz
CodeSourcery


Re: Project RABLET

2006-06-23 Thread Robert Dewar

Daniel Jacobowitz wrote:


Please try the other definition, which he clearly meant:

 2. Of purely theoretical or academic interest; having no
practical consequence; as, the team won in spite of the
bad call, and whether the ruling was correct is a moot
question.


Well I am not sure what he meant, but for sure it is not the
case that optimal register allocation and scheduling is of only
theoretical or academic interest with no practical consequences!



Re: Project RABLET

2006-06-23 Thread Steven Bosscher

On 6/23/06, Robert Dewar [EMAIL PROTECTED] wrote:

Well I am not sure what he meant, but for sure it is not the
case that optimal register allocation and scheduling is of only
theoretical or academic interest with no practical consequences!


Thanks for making that point.

Now, what do you think about this RABLET idea, which has nothing to do
with either register allocation or scheduling? ;-)
Gr.
Steven


Re: Project RABLET

2006-06-23 Thread Robert Dewar

Steven Bosscher wrote:


Now, what do you think about this RABLET idea, which has nothing to do
with either register allocation or scheduling? ;-)


Well I would not say that it has nothing to do with register allocation!
But indeed this seems a promising approach. The real question in my mind
is whether it can be done in a way that simplifies and clarifies rather
than adding to what is now very complex code to follow. I think the 
answer to that is probably yes.




Re: Project RABLET

2006-06-23 Thread Ian Lance Taylor
Andrew MacLeod [EMAIL PROTECTED] writes:

 This describes my current work-in-progress, RABLET, which stands for
 RABLE-Themes, and conveniently implies something smaller.

Thanks for this proposal.


 ssa-to-rtl
 spill cost analysis
 global allocation
 spiller
 spill location optimizer
 instruction rewriter.

You omitted the RTL loop optimizer passes, which still do quite a bit
of work despite the tree-ssa loop passes.  Also if-conversion and some
minor passes, though they are less relevant.


 If expand is made much smarter, I would argue that much of GCSE and CSE
 isn't needed.  We've already performed those optimizations at  a high
 level, and we can hopefully do a lot of the factoring and things on
 addressing registers exposed during expand.  I'm sure there are other
 things to do, but I would argue that they are significantly less than a
 general purpose CSE and GCSE pass. And in the cases of high register
 pressure, how much would you want them to do anyway?  Its really these
 high register pressure areas that RABLET is attacking anyway.

Here I think you are waving your hands a little too hard.  RTL level
CSE is significant for handling common expressions exposed by address
calculations and by DImode (and larger) computations.  On some
processors giving up CSE on address calculations would be very
painful.  There needs to be a plan to handle that.

Also at present may vector calculations are not exposed at the tree
level--they are hidden inside builtin functions until they are
expanded--and vector heavy code can also have a lot of common
subexpressions.


 If I recall, scheduling is register pressure aware and normally doesn't
 increase register pressure dramatically. If it does increase pressure,
 well, this won't solve every problem after all.

Unfortunately, scheduling is currently not register pressure aware at
all.  The scheduler will gleefully increase register pressure.  That's
why we don't even run the scheduler before register allocation on x86.


Modulo the above comments, I don't see anything wrong with your basic
idea.  But I also wonder whether you couldn't get a similar effect by
forcing instruction selection to occur before register allocation.  If
that is done well, reload will have much less work to do.

One of the basic issues with the current code is not that we do
register allocation well or poorly, but that reload takes the output
of the register allocator and munges it unpredictably.  That's going
to happen with your proposal as well.  It doesn't mean that your
proposal won't improve things.  But no register allocator can do a
good job when it can't make the final decisions.

Ian


Re: Project RABLET

2006-06-23 Thread Seongbae Park

On 6/23/06, Andrew MacLeod [EMAIL PROTECTED] wrote:
...

1 - One of the core themes in RABLE was very early selection of
instructions from patterns.  RTL patterns are initially chosen by the
EXPAND pass. EXPAND tends to generates better rtl patterns by being
handed complex trees which it can process and get better combinations.

  When TREE-SSA was first implemented, we got very poor RTL because
expand was seeing very small trees.  TER (Temporary Expression
Replacement) was invented, which mashed any single-def/single-use
ssa_names together into more complex trees. This gave expand a better
chance of selecting better instructions, and made a huge difference.


Have you considered using BURG/IBURG style tree pattern matching
instruction selection ?

http://www.cs.princeton.edu/software/iburg/

That approach can certainly provide a low register pressure
high quality instruction selection.
--
#pragma ident Seongbae Park, compiler, http://seongbae.blogspot.com;


Re: Project RABLET

2006-06-23 Thread Daniel Berlin
Ian Lance Taylor wrote:
 Andrew MacLeod [EMAIL PROTECTED] writes:
 
 This describes my current work-in-progress, RABLET, which stands for
 RABLE-Themes, and conveniently implies something smaller.
 
 Thanks for this proposal.
 
 
 ssa-to-rtl
 spill cost analysis
 global allocation
 spiller
 spill location optimizer
 instruction rewriter.
 
 You omitted the RTL loop optimizer passes, which still do quite a bit
 of work despite the tree-ssa loop passes.  Also if-conversion and some
 minor passes, though they are less relevant.
 
 
 If expand is made much smarter, I would argue that much of GCSE and CSE
 isn't needed.  We've already performed those optimizations at  a high
 level, and we can hopefully do a lot of the factoring and things on
 addressing registers exposed during expand.  I'm sure there are other
 things to do, but I would argue that they are significantly less than a
 general purpose CSE and GCSE pass. And in the cases of high register
 pressure, how much would you want them to do anyway?  Its really these
 high register pressure areas that RABLET is attacking anyway.
 
 Here I think you are waving your hands a little too hard.  RTL level
 CSE is significant for handling common expressions exposed by address
 calculations and by DImode (and larger) computations.  On some
 processors giving up CSE on address calculations would be very
 painful.  There needs to be a plan to handle that.

I agree with Ian completely.
Also, after having stared and worked on df in the backend with Kenny and
watched the amount of work that has had to be done, i think you may be
underestimating the complexity of what is really going on in the backend
right now.

Not that i wouldn't love to see our backend become simpler and have a
bunch of relatively non-complex df based passes, because I would, but i
*also* don't think RABLET is going to enable that (or the removal of
CSE/GCSE) through smarter expand.  It's possible you'd remove GCSE, but
only because the last time i remember someone looking (stevenb, i
think), it wasn't doing all *that* much.

Again, like Ian, I'd argue you'd need to do real instruction selection
before register allocation before that can happen.  Luckily, these days,
BURG based instruction selection has become production usable, so that
task isn't as horrid as it used to be.

--Dan


Re: Project RABLET

2006-06-23 Thread Andrew MacLeod
On Fri, 2006-06-23 at 23:08 -0400, Daniel Berlin wrote:
 Ian Lance Taylor wrote:

  
  Here I think you are waving your hands a little too hard.  RTL level
  CSE is significant for handling common expressions exposed by address
  calculations and by DImode (and larger) computations.  On some
  processors giving up CSE on address calculations would be very
  painful.  There needs to be a plan to handle that.
 
 I agree with Ian completely.
 Also, after having stared and worked on df in the backend with Kenny and
 watched the amount of work that has had to be done, i think you may be
 underestimating the complexity of what is really going on in the backend
 right now.
 
 Not that i wouldn't love to see our backend become simpler and have a
 bunch of relatively non-complex df based passes, because I would, but i
 *also* don't think RABLET is going to enable that (or the removal of
 CSE/GCSE) through smarter expand.  It's possible you'd remove GCSE, but
 only because the last time i remember someone looking (stevenb, i
 think), it wasn't doing all *that* much.
 
 Again, like Ian, I'd argue you'd need to do real instruction selection
 before register allocation before that can happen.  Luckily, these days,
 BURG based instruction selection has become production usable, so that
 task isn't as horrid as it used to be.

It occurs to me I think there is a misunderstanding here. Perhaps I
didn't communicate this well enough, or perhaps I got a little carried
away trying to make RABLET look like RABLE.

Im not actually proposing that RABLET will enable the backend to
suddenly become simple... The initial impact of RABLET is to simply
remove some of the onus of dealing with excessive register pressure from
the register allocator.

RABLET will really do nothing when register pressure is not high, things
would be pretty much exactly as they are today.

When register pressure is high, many of the things the RTL optimizations
I mentioned do really become irrelevant (I think), since they increase
register pressure more, and cause more spilling. This generally offsets
whatever good they do. I was trying to claim that some level of this
work can be done in expand and in *this* circumstance, thats all that
needs to be done. Making the resulting model look somewhat like RABLE,
and simplifying the view of the RTL optimizations. 

Its possible that some of this work can simplify the RTL optimizations
in other cases, perhaps not. If we can simplify anything, that great.
I'd love to see it, and I hope some of it is possible. I do see
possibilities that will hopefully pan out.

Ultimately, RABLET simply tries to present the backend with code that
looks more like low register pressure code which the current backend is
pretty good at handling. Anything else we can get from it is a bonus.

For the work involved in RABLET vs. the work involved in a new allocator
like RABLE, I think RABLET is well worth doing. (months vs. years). I
think RABLET will show a significant benefit. (famous last words, ho
ho :-)

Andrew



Re: Project RABLET

2006-06-23 Thread Andrew MacLeod
On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote:

 You omitted the RTL loop optimizer passes, which still do quite a bit
 of work despite the tree-ssa loop passes.  Also if-conversion and some
 minor passes, though they are less relevant.

Which brings up a good discussion. I presume the rtl loop optimizers see
things exposed by addressing modes which aren't seen in the higher level
code. I wonder what the big gains are here... and if they are
detectable at expansion time...   

In general, I didnt mention anything that tends not to increase register
pressure, at least not in any significant manner as far as RABLET is
concerned.

 
  If expand is made much smarter, I would argue that much of GCSE and CSE
  isn't needed.  We've already performed those optimizations at  a high
  level, and we can hopefully do a lot of the factoring and things on
  addressing registers exposed during expand.  I'm sure there are other
  things to do, but I would argue that they are significantly less than a
  general purpose CSE and GCSE pass. And in the cases of high register
  pressure, how much would you want them to do anyway?  Its really these
  high register pressure areas that RABLET is attacking anyway.
 
 Here I think you are waving your hands a little too hard.  RTL level
 CSE is significant for handling common expressions exposed by address
 calculations and by DImode (and larger) computations.  On some
 processors giving up CSE on address calculations would be very
 painful.  There needs to be a plan to handle that.
 

Yes, there is some hand waving, mostly because I haven't gotten that far
in details yet. I expect to be able to do some of this type of commoning
at rtl generation time as things are generated. (much like RABLE's
spiller reuses spill loads nearby). That may turn out to be more
difficult than I anticipate however. Pain is in the implementation :-)

I am not proposing that CSE necessarily be eliminated *all* the time,
but in cases when register pressure is already excessively high, is
further commoning of DImode values going to make things better? Its
really this case I'm interested in evaluating since this is the case we
already have problems. if we don't spill, RABLET would effectively do
nothing.

Clearly there will be a lot of further investigation required once
implementation reaches this point. Ultimately CSE and all RTL
optimizations can be re-evaluated to see if things can be simplified.

 Also at present may vector calculations are not exposed at the tree
 level--they are hidden inside builtin functions until they are
 expanded--and vector heavy code can also have a lot of common
 subexpressions.
 

I have no plan at moment for vector operations :-). That could change,
but for now we'll have to keep whatever we do today for those.

 
  If I recall, scheduling is register pressure aware and normally doesn't
  increase register pressure dramatically. If it does increase pressure,
  well, this won't solve every problem after all.
 
 Unfortunately, scheduling is currently not register pressure aware at
 all.  The scheduler will gleefully increase register pressure.  That's
 why we don't even run the scheduler before register allocation on x86.
 

hum, too bad. for some reason I was under the impression that it at
least tried not to increase register pressure when it was above a
certain threshold value. Not running it at least means it wont increase
register pressure, so that works :-)
  
 
 Modulo the above comments, I don't see anything wrong with your basic
 idea.  But I also wonder whether you couldn't get a similar effect by
 forcing instruction selection to occur before register allocation.  If
 that is done well, reload will have much less work to do.
 

That was one of the premises of RABLE. Since out of ssa needs some TLC
and TER has been a wart for years, this seems like a good way of dealing
with those issues, and perhaps dealing with some significant RA issues
at the same time. (Anything to avoid actually rewriting RA eh!)


 One of the basic issues with the current code is not that we do
 register allocation well or poorly, but that reload takes the output
 of the register allocator and munges it unpredictably.  That's going
 to happen with your proposal as well.  It doesn't mean that your
 proposal won't improve things.  But no register allocator can do a
 good job when it can't make the final decisions.
 
Truer words have never been spoken. RABLET makes no attempt to do
anything about reload. It simply attempts to present the backend with
code that isn't full of excessive register pressure. If it turns out to
be something reload screws up today, it will continue to be screwed up.
I suspect a lot of the time we do have excessive spill, RABLET will show
benefit. 

Its clearly not as good as a new register allocator would be, but the
effort to benefit ratio ought to be a lot higher for RABLET than for a
register allocator rewrite.

Andrew



Re: Project RABLET

2006-06-23 Thread Andrew Pinski


On Jun 23, 2006, at 9:39 PM, Andrew MacLeod wrote:


On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote:


You omitted the RTL loop optimizer passes, which still do quite a bit
of work despite the tree-ssa loop passes.  Also if-conversion and  
some

minor passes, though they are less relevant.


Which brings up a good discussion. I presume the rtl loop  
optimizers see
things exposed by addressing modes which aren't seen in the higher  
level

code. I wonder what the big gains are here... and if they are
detectable at expansion time...


The one rtl loop optimizer which has nothing to do with addressing  
modes and
loops is the doloop optimizer which is most likely possible to do  
expansion time
and is one of the few loop optimizer which lowers register pressure.   
The reason
why it lowers register pressure is because it makes use of a special  
register for

loops (at least on PowerPC).

Thanks,
Andrew Pinski 


Re: Project RABLET

2006-06-23 Thread Ian Lance Taylor
Andrew MacLeod [EMAIL PROTECTED] writes:

 On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote:
 
  You omitted the RTL loop optimizer passes, which still do quite a bit
  of work despite the tree-ssa loop passes.  Also if-conversion and some
  minor passes, though they are less relevant.
 
 Which brings up a good discussion. I presume the rtl loop optimizers see
 things exposed by addressing modes which aren't seen in the higher level
 code. I wonder what the big gains are here... and if they are
 detectable at expansion time...   

One obvious gain is hoisting constants exposed by address expansion
out of loops.  Also once addressing modes are expanded, there are new
IVs.


 I am not proposing that CSE necessarily be eliminated *all* the time,
 but in cases when register pressure is already excessively high, is
 further commoning of DImode values going to make things better? Its
 really this case I'm interested in evaluating since this is the case we
 already have problems. if we don't spill, RABLET would effectively do
 nothing.

I think that even when pressure is high, it helps a lot to do CSE
after DImode values have been split up, as will be the case even today
for, e.g., DImode bitwise operations.  It tends to reduce register
pressure if anything.


As you say, none of these arguments that RABLET is a bad idea, they
are just arguments that we can't expect to remove the RTL passes
without a lot more work, whether or not they increase register
pressure.

One thing we could perhaps consider would be expanding addressing mode
calculations at the tree level.

Ian