[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2020-06-25 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #13 from Vladimir Makarov  ---
(In reply to Tamar Christina from comment #12)
> (In reply to Vladimir Makarov from comment #11)
> >   I just expressed my point of view to the bottom-up approach.  If somebody
> > implements any new RA approach which at least does not hurt credible
> > benchmarks (e.g. SPEC) and improve some benchmarks and does not complicate
> > existing RA too much, nobody will have legitimate arguments not to include
> > the new code into GCC.
> > 
> >   I think that may be for some cases bottom-up approach could work better. 
> > Probably this is code for number crunching (with a lot of loop iterations). 
> > For some cases top-down approach works better for loops with smaller number
> > iterations  (e.g. most loops in GCC itself).
> > 
> 
> I don't know much about the current RA implementation, but would it be
> possible you think to have this be heuristics driven? or do you think it
> would be too expensive to try multiple strategies and keep the one that
> works best?

Probably trying multiple strategies will be too expensive.  Also compile-time
evaluation of allocation cost is not that accurate, especially taking into
account that the local register allocator can change global RA decisions and
the changes are unpredictable.

  If the top-down algorithm works better for some cases, we could get some
experience when it works and try to find heuristics to choose the right
algorithm.  I think it would be more realistic approach.  It sounds like a case
for ML but I am sure ML will not work.  It is not a smooth function, besides ML
create a lot of problems with GCC building (benchmark set, costly building,
valdiation and using ML model etc).  So heuristics built on understanding is a
way to go.  But I speculated too much because there is no bottom-up RA
implementation with finding its advantages (if there are any) and its
disadvantages.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2020-06-25 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #12 from Tamar Christina  ---
(In reply to Vladimir Makarov from comment #11)
> (In reply to Tamar Christina from comment #10)
> > Hi Vlad,
> > 
> > Just curious if you had a chance to think about an approach to this that
> > would be acceptable.
> 
>   Sorry for not working on this issue more although I thought about the
> problem for some time w/o finding any possible small code changes could
> solve the problem.
> 

Ah, hmm that's too bad but thanks for looking into it!

>   I just expressed my point of view to the bottom-up approach.  If somebody
> implements any new RA approach which at least does not hurt credible
> benchmarks (e.g. SPEC) and improve some benchmarks and does not complicate
> existing RA too much, nobody will have legitimate arguments not to include
> the new code into GCC.
> 
>   I think that may be for some cases bottom-up approach could work better. 
> Probably this is code for number crunching (with a lot of loop iterations). 
> For some cases top-down approach works better for loops with smaller number
> iterations  (e.g. most loops in GCC itself).
> 

I don't know much about the current RA implementation, but would it be possible
you think to have this be heuristics driven? or do you think it would be too
expensive to try multiple strategies and keep the one that works best?

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2020-06-24 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #11 from Vladimir Makarov  ---
(In reply to Tamar Christina from comment #10)
> Hi Vlad,
> 
> Just curious if you had a chance to think about an approach to this that
> would be acceptable.

  Sorry for not working on this issue more although I thought about the problem
for some time w/o finding any possible small code changes could solve the
problem.

  I just expressed my point of view to the bottom-up approach.  If somebody
implements any new RA approach which at least does not hurt credible benchmarks
(e.g. SPEC) and improve some benchmarks and does not complicate existing RA too
much, nobody will have legitimate arguments not to include the new code into
GCC.

  I think that may be for some cases bottom-up approach could work better. 
Probably this is code for number crunching (with a lot of loop iterations). 
For some cases top-down approach works better for loops with smaller number
iterations  (e.g. most loops in GCC itself).

  Simply, I tried already bottom-up approach and don't want to work on this
again because if it does not work I will have to throw this work off again.

  I am mostly in maintenance mode for GCC RA, don't want to work on big RA
changes.  This work would be also very hard to sell to my management.

  If somebody implements bottom-up RA as an optional (or default if it is
better on SPEC) algorithm and this improves some benchmarks, I will not object.
 The work I think would take several months for a person familiar with RA.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2020-06-24 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #10 from Tamar Christina  ---
Hi Vlad,

Just curious if you had a chance to think about an approach to this that would
be acceptable.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #9 from Feng Xue  ---
(In reply to Feng Xue from comment #5)
> >   I would say that top-down algorithm behaves better than bottom-up one.  I
> > implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years
> > ago and it behaved worse than the current one.  I think adding an additional
> Intuitively, it's better to give preference to regions with higher frequency
> to let them pick up registers earlier and with more freedom, which matches
> with bottom-up coloring strategy.
> 
> > pass over regions probably would solve the problem but it will complicate
> > already complicated RA (which is about 60K lines now). In any case I'll
> > think about this problem.
> > 
> >   The problem you are mentioning in this code is quite insignificant (one
> > memory access in the top most region).
> But it will be significant inside multi-level loop. Actually, the problem is
> exposed by a 8-level loop in a real Fortran application. 
> 
> >   
> >   I also checked the attachments.  What I see is GCC generates a better big
> > loop body (a loop with high register pressure).  GCC generated loop body has
> > 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for 
> > LLVM.
> As far as how to spilling live range at loop boundary is concerned, gcc is
> not very efficient. To make loop body not be the focus, when we remove two
> live-range variables in original case, we can get same amount of memory
> accesses for gcc and llvm, both of which generates 9 register spills in loop
> body. I attached modified case and assembly files.

For the modified case, allocation of llvm and gcc remain nearly the same for
loop body, but as to 'lv0', llvm gets better spill/split result over gcc.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #8 from Feng Xue  ---
Created attachment 46294
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46294=edit
asm file 2 generated by llvm

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #7 from Feng Xue  ---
Created attachment 46293
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46293=edit
asm file 2 generated by gcc

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #6 from Feng Xue  ---
Created attachment 46292
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46292=edit
test case with less live ranges

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-05 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #5 from Feng Xue  ---
>   I would say that top-down algorithm behaves better than bottom-up one.  I
> implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years
> ago and it behaved worse than the current one.  I think adding an additional
Intuitively, it's better to give preference to regions with higher frequency to
let them pick up registers earlier and with more freedom, which matches with
bottom-up coloring strategy.

> pass over regions probably would solve the problem but it will complicate
> already complicated RA (which is about 60K lines now). In any case I'll
> think about this problem.
> 
>   The problem you are mentioning in this code is quite insignificant (one
> memory access in the top most region).
But it will be significant inside multi-level loop. Actually, the problem is
exposed by a 8-level loop in a real Fortran application. 

>   
>   I also checked the attachments.  What I see is GCC generates a better big
> loop body (a loop with high register pressure).  GCC generated loop body has
> 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM.
As far as how to spilling live range at loop boundary is concerned, gcc is not
very efficient. To make loop body not be the focus, when we remove two
live-range variables in original case, we can get same amount of memory
accesses for gcc and llvm, both of which generates 9 register spills in loop
body. I attached modified case and assembly files.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-05-03 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #4 from Vladimir Makarov  ---
(In reply to Feng Xue from comment #0)
> Current regional RA uses a top-down allocation order, which may not properly
> split a long live range that crosses sub-region with high register pressure. 
> 
> In the following graph, lv0 is live in whole outer region, and suppose inner
> region is under high register pressure due to lots of live ranges inside it.
> 
> According to RA algorithm, out region is processed firstly, lv0 is picked up
> as spill candidate. And then turn to inner region, also the part of lv0 in
> inner region is marked as being spilled. Finally result is that the whole
> lv0 should be spilled. But if in area excluding inner region, there is with
> low register pressure, we can get a better choice to place lv0 in register
> instead of memory, and only spill/reload lv0 at boundary of
> entry-into(A)/exist-from(B) inner region. In other word, inner region
> boundary are split points for lv0.
> 
> 
>|  
>outer   | lv0 
>region  | __  split point
>|/
> .--A---.
> |  |   |
> |  |  | |  |
> |  inner   |  | lv1 |  | 
> |  region  |  | |  |
> |  |  | lv2 |  |
> |  |  | |  |
> |  |   |
> '--B---'
>|\__
>|split point
>|
> 
> Here is an example to show this. gcc produces bad spills as we point
> out(-m64 -O3, for x86-64), but llvm generates better spill/reload as we
> expect.
> 

  Thank you for your detailed report and proposed solutions.

  I would say that top-down algorithm behaves better than bottom-up one.  I
implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years ago
and it behaved worse than the current one.  I think adding an additional pass
over regions probably would solve the problem but it will complicate already
complicated RA (which is about 60K lines now). In any case I'll think about
this problem.

  The problem you are mentioning in this code is quite insignificant (one
memory access in the top most region).

  I also checked the attachments.  What I see is GCC generates a better big
loop body (a loop with high register pressure).  GCC generated loop body has 50
x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM.

>   int value[20];
> 
>   int user0;
>   int user1;
>   int user2[100];
>   int user3;
> 
>   int fncall(void);
> 
>   void foo(int cond)
>   {
>   int lv_out = value[0];
>   int i;
> 
>   user0 = lv_out;   /* Better to place lv_out in register. */
> 
>   if (cond) {
>   int sum = 0;
>   int lv_in_1 = value[1];
>   int lv_in_2 = value[2];
>   int lv_in_3 = value[3];
>   int lv_in_4 = value[4];
>   int lv_in_5 = value[5];
>   int lv_in_6 = value[6];
>   int lv_in_7 = value[7];
>   int lv_in_8 = value[8];
>   int lv_in_9 = value[9];
>   int lv_in_10 = value[10];
>   int lv_in_11 = value[11];
>   int lv_in_12 = value[12];
>   int lv_in_13 = value[13];
>   int lv_in_14 = value[14];
>   int lv_in_15 = value[15];
> 
>   /* Better to spill lv_out here */
> 
>   for (i = 0; i < 1000; i++) {
>   sum += lv_in_1;
>   sum += lv_in_2;
>   sum += lv_in_3;
>   sum += lv_in_4;
>   sum += lv_in_5;
>   sum += lv_in_6;
>   sum += lv_in_7;
>   sum += lv_in_8;
>   sum += lv_in_9;
>   sum += lv_in_10;
>   sum += lv_in_11;
>   sum += lv_in_12;
>   sum += lv_in_13;
>   sum += lv_in_14;
>   sum += lv_in_15;
> 
>   fncall();
> 
>   lv_in_1 ^= i;
>   lv_in_2 ^= i;
>   lv_in_3 ^= i;
>   lv_in_4 ^= i;
>   lv_in_5 ^= i;
>   lv_in_6 ^= i;
>   lv_in_7 ^= i;
>   lv_in_8 ^= i;
>   lv_in_9 ^= i;
>   lv_in_10 ^= i;
>   lv_in_11 ^= i;
>   lv_in_12 ^= i;
>   lv_in_13 ^= i;
>   lv_in_14 ^= i;
>   lv_in_15 ^= i;
>   }
> 
>   /* Better to reload lv_out here */
> 
>   user1 = sum;
>   }
> 
>   for (i = 0; i < 100; i++) {
>   user2[i ^ 100] = lv_out; /* Better to place lv_out in register */
>   }
> 
>   user3 = lv_out;  /* Better to place lv_out in register */
>   }
> 
> 
> For top-down allocation, we can only adjust inner region allocation result,
> but no way to refine decision that has been made on outside live-range, it
> is an intrinsic weakness of the top-down algorithm. To fix that, we may need
> to add a new 

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-04-24 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #3 from Feng Xue  ---
Created attachment 46237
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46237=edit
test case for aarch64

Add another case composed for aarch64.

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-04-21 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #2 from Feng Xue  ---
Created attachment 46219
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46219=edit
asm file generated by llvm

[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order

2019-04-21 Thread fxue at os dot amperecomputing.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174

--- Comment #1 from Feng Xue  ---
Created attachment 46218
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46218=edit
asm file generated by gcc