On 05/21/2014 12:25 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> Simpson does the Live range shrinking and reduction of register pressure by 
> using the computation that are not load and store but the arithmetic 
> computation. The computation
> where the operands and registers are live at the entry and exit of the basic 
> block but not touched inside the block then the computation is moved at the 
> end of the block the reducing the register pressure inside the block by one.  
> Extension of the Simpson work by extending the computation not being touched 
> inside the basic block to the spanning of the Loops. If the Live ranges spans 
> the Loops and live at the entry and exit of the Loop but the computation is 
> not being touched inside the Loops then the computation is moved after the 
> exit of the Loop. 
>
> REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE LOOPS
>
> for each Loop starting from inner to outer do the following
> begin
> RELIEFIN(i) = null if i is the entry of the cfg.
> Else
> For all predecessors j RELIEFOUT(j)
> RELIEFOUT(i) = RELIEFIN(i) exposed union relief
> INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection
> Live(i)
> end
>
> The Simpson approach does takes the nesting depth into consideration of 
> placing the computation and the relieve of the register pressure. Simpson 
> approach doesn't takes into
> consideration the computation which spans throughout the loop and the 
> operands and results are live at the entry of the Loop and exit of the Loop 
> but not touched inside the Loops can be useful in reduction of register 
> pressure inside the Loops. This  approach will be useful in Region Based 
> Register Allocator for Live Range Splitting at the Region Boundaries.
>
> Extension of the Simpson approach is to consider the data flow analysis with 
> respect to the given Loop rather than having it for entire control flow 
> graph. This data flow analysis starts from the inner loop and extends it to 
> the outer loop. If the reference is not through the nested depth or with some 
> depth then the computation can be placed accordingly. For register allocator 
> by Graph coloring the live ranges that are with respect to operands and 
> results of the computation are taken into consideration and for the above 
> approach put into the stack during simplification phase of Graph Coloring so 
> that there is a chance of getting such Live ranges colorable and thus reduces 
> the register pressure. This is extended to splitting
> approach based on containment of Live ranges
>
> OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE EXIT LOOPS
>
> The placement of the computation to reduce the register pressure for Single 
> Entry and Multiple exit by Simpson approach lead to unoptimal solution. The 
> unoptimal Solution
> is because of the exit node of the loop does not post dominates all the basic 
> block inside the Loops. Due to this the placement of the computation just 
> after the tail block of the Loop will lead to incorrect results. In order to 
> perform the Optimal Solution of the placement of the computation, the 
> computation needs to be placed the block just after all the exit points of 
> the Loop reconverge and which will post dominates all the blocks of the 
> Loops. This will take care of reducing the register pressure for the Loops 
> that are single Entry and Multiple Exit. For irreducible Loops the 
> optimization to convert to reducible is done before the register allocation 
> that reduces the register pressure and will be applicable to structured 
> control flow and thus reduces the register pressure.
>
> The Live range shrinkage reducing register pressure takes load and store into 
> consideration but not computation as proposed by Simpson. I am proposing to 
> extend in GCC for the computation to reduce register pressure and for the 
> Loop as given above for both Single Entry and Single Exit and Single Entry 
> and Multiple Exit Loops.
>
>
Thanks for sharing with this.  I have plans to work on a new register
pressure relief pass too this summer (more accurately this work is in my
company plans -- so I should work on this anyway).  It will use
Simpson's approach also.  I prefer to do it as a separate pass not as a
part of IRA because IRA is already complicated.  It also permits to
rematerialize not only on loop  borders (although it is the most
important points).

It is hard for me to say what approach will be better as there are too
many transformations even after IRA (e.g. in IRA you can think that
pseudos got hard registers and rematerilize from that but LRA may spill
this pseudos and it is more risk of this for x86/x86-64 than for other
architectures as x86/x86-64 uses irregular reg file and has smaller
number of regs).

So we could implement two approaches and choose the best if you want.

As for optimal placement of the computation for single entry/multiple
exit loops, I don't think it is really important.  IRA already put one
copy ld/st if the exits are to the same basic block (the most common
case of such loops).  So you can use the same approach.  But I guess
only benchmarking can tell the importance of this.

If you are really interesting in doing some RA project.  I can tell that
I am definitely not going to implement bitwidth-wise RA (putting and
using several scalar values into one wide-register, e.g. a vector reg). 
It might be benefitial for some cpus/architectures.  But I should say it
is a pretty big project (at least bigger what you are proposing).

Reply via email to