On  Friday, May 23, 2014 1:46 AM Vladimir Makarov wrote:
On 05/21/2014 12:25 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> Simpson does the Live range shrinking and reduction of register 
> pressure by using the computation that are not load and store but the 
> arithmetic computation. The computation where the operands and registers are 
> live at the entry and exit of the basic block but not touched inside the 
> block then the computation is moved at the end of the block the reducing the 
> register pressure inside the block by one.  Extension of the Simpson work by 
> extending the computation not being touched inside the basic block to the 
> spanning of the Loops. If the Live ranges spans the Loops and live at the 
> entry and exit of the Loop but the computation is not being touched inside 
> the Loops then the computation is moved after the exit of the Loop.
>
> REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE 
> LOOPS
>
> for each Loop starting from inner to outer do the following begin
> RELIEFIN(i) = null if i is the entry of the cfg.
> Else
> For all predecessors j RELIEFOUT(j)
> RELIEFOUT(i) = RELIEFIN(i) exposed union relief
> INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection
> Live(i)
> end
>
> The Simpson approach does takes the nesting depth into consideration 
> of placing the computation and the relieve of the register pressure. Simpson 
> approach doesn't takes into consideration the computation which spans 
> throughout the loop and the operands and results are live at the entry of the 
> Loop and exit of the Loop but not touched inside the Loops can be useful in 
> reduction of register pressure inside the Loops. This  approach will be 
> useful in Region Based Register Allocator for Live Range Splitting at the 
> Region Boundaries.
>
> Extension of the Simpson approach is to consider the data flow 
> analysis with respect to the given Loop rather than having it for 
> entire control flow graph. This data flow analysis starts from the 
> inner loop and extends it to the outer loop. If the reference is not 
> through the nested depth or with some depth then the computation can 
> be placed accordingly. For register allocator by Graph coloring the 
> live ranges that are with respect to operands and results of the 
> computation are taken into consideration and for the above approach 
> put into the stack during simplification phase of Graph Coloring so 
> that there is a chance of getting such Live ranges colorable and thus 
> reduces the register pressure. This is extended to splitting approach 
> based on containment of Live ranges
>
> OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE 
> EXIT LOOPS
>
> The placement of the computation to reduce the register pressure for 
> Single Entry and Multiple exit by Simpson approach lead to unoptimal 
> solution. The unoptimal Solution is because of the exit node of the loop does 
> not post dominates all the basic block inside the Loops. Due to this the 
> placement of the computation just after the tail block of the Loop will lead 
> to incorrect results. In order to perform the Optimal Solution of the 
> placement of the computation, the computation needs to be placed the block 
> just after all the exit points of the Loop reconverge and which will post 
> dominates all the blocks of the Loops. This will take care of reducing the 
> register pressure for the Loops that are single Entry and Multiple Exit. For 
> irreducible Loops the optimization to convert to reducible is done before the 
> register allocation that reduces the register pressure and will be applicable 
> to structured control flow and thus reduces the register pressure.
>
> The Live range shrinkage reducing register pressure takes load and store into 
> consideration but not computation as proposed by Simpson. I am proposing to 
> extend in GCC for the computation to reduce register pressure and for the 
> Loop as given above for both Single Entry and Single Exit and Single Entry 
> and Multiple Exit Loops.
>
>
>>Thanks for sharing with this.  I have plans to work on a new register 
>>pressure relief pass too this summer (more accurately this work is in my 
>>company plans -- so I should work on this >>anyway).  It will use Simpson's 
>>approach also.  I prefer to do it as a separate pass not as a part of IRA 
>>because IRA is already complicated.  It also permits to rematerialize not 
>>only on loop  >>borders (although it is the most important points).

>>It is hard for me to say what approach will be better as there are too many 
>>transformations even after IRA (e.g. in IRA you can think that pseudos got 
>>hard registers and rematerilize >>from that but LRA may spill this pseudos 
>>and it is more risk of this for x86/x86-64 than for other architectures as 
>>x86/x86-64 uses irregular reg file and has smaller number of regs).

>>So we could implement two approaches and choose the best if you want.


>>As for optimal placement of the computation for single entry/multiple exit 
>>loops, I don't think it is really important.  IRA already put one copy ld/st 
>>if the exits are to the same basic block >>(the most common case of such 
>>loops).  So you can use the same approach.  But I guess only benchmarking can 
>>tell the importance of this.

>>If you are really interesting in doing some RA project.  I can tell that I am 
>>definitely not going to implement bitwidth-wise RA (putting and using several 
>>scalar values into one wide->>register, e.g. a vector reg). 
>>It might be benefitial for some cpus/architectures.  But I should say it is a 
>>pretty big project (at least bigger what you are proposing).

Thanks for considering the proposal and its good to know that you have plans to 
implement this. Simpson approach
 is much more powerful with the context of rematerialization and the reduction 
of register pressure.
 
Regarding the Bit width-wise register allocator, I have already gone through 
the following papers:

Bit width Aware Global Register Allocation Sriraman Tallam Rajiv Gupta.
Enhanced Bit width-Aware Register Allocation. Rajkishore Barik and Vivek Sarkar.
Improved bit width-Aware Variable Packing V. KRISHNA NANDIVADA, IIT Madras 
RAJKISHORE BARIK, Intel Labs.
Bitwidth Analysis with application to Silicon Compilation.  Mark Stephenson and 
Saman Amarasinghe.

All the papers concerning the Bitwidth -wise Register Allocator does the 
following

1. Bitwidth representation and Analysis. This involves the construction of Live 
ranges with respect to bitwidth
 of variable at different program points as the live bits changes at different 
program points. S. Tallam and 
Gupta uses the partitioning of the variables in the leading bit, live bits and 
trailing bits and does the forward
 and backward data Analysis where the zero bits are populated with respect to 
forward Data Flow and dead bits
 with backward Data flow.
2.Interference Graph. The interference graph are augmented with the weights for 
the subword bit width
 that changes at different program points.
3. Variable Packing and Coalescing. This is the important step where the 
packing is done on top of coalescing.
 As the coalescing the merges the two nodes in the Interference Graph as they 
don't interfere. And the
 packing packs the interference variables. Gupta approach uses minimum width 
estimate for packing interfering
 nodes in the wide register till the subword fits into the wide register.

Recent paper by Barik's proposes the aggressive coalescing done first and then 
packs into wide register based 
on power of two approximation. They have used Bitwidth SSA variables for 
bitwidth analysis and then does the
 optimal packing by doing aggressive coalescing done first with respect to 
w-ssa variables for non-interfering
 nodes and then packing is done to pack subword into the wide register  for 
interfering variables on Power 
of two approximation(POTR). The main advantage of w-ssa variables is with 
respect to def and use of the
 actual bits remain constants and don't change. None of the approach does 
guarantee this
.
With respect to packing scalar variables into vector register we can use the 
Barik's approach of doing
 aggressive coalescing done first with respect to non-interfering nodes and 
then does the packing of 
interfering nodes based on greedy approach  which looks feasible approach for 
packing the scalar values
 into wide vector register along with bitwidth Analysis. The bitwidth Analysis 
can be extended to the
 scalar variables and the array variables with respect to loops. This mainly 
emphasize on the loop index 
variable which have data values ranges and can be packed into the wide register 
which will reduce the
 register pressure.

Please let me know what do you think.

Thanks & Regards
Ajit 

Reply via email to