https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #4 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Feng Xue from comment #0)
> Current regional RA uses a top-down allocation order, which may not properly
> split a long live range that crosses sub-region with high register pressure. 
> 
> In the following graph, lv0 is live in whole outer region, and suppose inner
> region is under high register pressure due to lots of live ranges inside it.
> 
> According to RA algorithm, out region is processed firstly, lv0 is picked up
> as spill candidate. And then turn to inner region, also the part of lv0 in
> inner region is marked as being spilled. Finally result is that the whole
> lv0 should be spilled. But if in area excluding inner region, there is with
> low register pressure, we can get a better choice to place lv0 in register
> instead of memory, and only spill/reload lv0 at boundary of
> entry-into(A)/exist-from(B) inner region. In other word, inner region
> boundary are split points for lv0.
> 
> 
>                |  
>        outer   | lv0 
>        region  | __________  split point
>                |/
>     .----------A---------------.
>     |          |               |
>     |          |  |         |  |
>     |  inner   |  | lv1     |  | 
>     |  region  |  |         |  |
>     |          |  |     lv2 |  |
>     |          |  |         |  |
>     |          |               |
>     '----------B---------------'
>                |\__________
>                |            split point
>                |
> 
> Here is an example to show this. gcc produces bad spills as we point
> out(-m64 -O3, for x86-64), but llvm generates better spill/reload as we
> expect.
> 

  Thank you for your detailed report and proposed solutions.

  I would say that top-down algorithm behaves better than bottom-up one.  I
implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years ago
and it behaved worse than the current one.  I think adding an additional pass
over regions probably would solve the problem but it will complicate already
complicated RA (which is about 60K lines now). In any case I'll think about
this problem.

  The problem you are mentioning in this code is quite insignificant (one
memory access in the top most region).

  I also checked the attachments.  What I see is GCC generates a better big
loop body (a loop with high register pressure).  GCC generated loop body has 50
x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM.

>   int value[20];
> 
>   int user0;
>   int user1;
>   int user2[100];
>   int user3;
> 
>   int fncall(void);
> 
>   void foo(int cond)
>   {
>       int lv_out = value[0];
>       int i;
> 
>       user0 = lv_out;   /* Better to place lv_out in register. */
> 
>       if (cond) {
>           int sum = 0;
>           int lv_in_1 = value[1];
>           int lv_in_2 = value[2];
>           int lv_in_3 = value[3];
>           int lv_in_4 = value[4];
>           int lv_in_5 = value[5];
>           int lv_in_6 = value[6];
>           int lv_in_7 = value[7];
>           int lv_in_8 = value[8];
>           int lv_in_9 = value[9];
>           int lv_in_10 = value[10];
>           int lv_in_11 = value[11];
>           int lv_in_12 = value[12];
>           int lv_in_13 = value[13];
>           int lv_in_14 = value[14];
>           int lv_in_15 = value[15];
> 
>           /* Better to spill lv_out here */
> 
>           for (i = 0; i < 1000; i++) {
>               sum += lv_in_1;
>               sum += lv_in_2;
>               sum += lv_in_3;
>               sum += lv_in_4;
>               sum += lv_in_5;
>               sum += lv_in_6;
>               sum += lv_in_7;
>               sum += lv_in_8;
>               sum += lv_in_9;
>               sum += lv_in_10;
>               sum += lv_in_11;
>               sum += lv_in_12;
>               sum += lv_in_13;
>               sum += lv_in_14;
>               sum += lv_in_15;
> 
>               fncall();
> 
>               lv_in_1 ^= i;
>               lv_in_2 ^= i;
>               lv_in_3 ^= i;
>               lv_in_4 ^= i;
>               lv_in_5 ^= i;
>               lv_in_6 ^= i;
>               lv_in_7 ^= i;
>               lv_in_8 ^= i;
>               lv_in_9 ^= i;
>               lv_in_10 ^= i;
>               lv_in_11 ^= i;
>               lv_in_12 ^= i;
>               lv_in_13 ^= i;
>               lv_in_14 ^= i;
>               lv_in_15 ^= i;
>           }
> 
>           /* Better to reload lv_out here */
> 
>           user1 = sum;
>       }
> 
>       for (i = 0; i < 100; i++) {
>           user2[i ^ 100] = lv_out; /* Better to place lv_out in register */
>       }
>     
>       user3 = lv_out;  /* Better to place lv_out in register */
>   }
> 
> 
> For top-down allocation, we can only adjust inner region allocation result,
> but no way to refine decision that has been made on outside live-range, it
> is an intrinsic weakness of the top-down algorithm. To fix that, we may need
> to add a new pass to explicitly split live ranges based on region boundary,
> or adopt a reverse means, from inner to outer to perform allocation.

Reply via email to