subject:"\[lra\] patch to solve most scalability problems for LRA"

Re: [lra] patch to solve most scalability problems for LRA

2012-10-10 Thread Steven Bosscher

On Wed, Oct 10, 2012 at 10:14 PM, Vladimir Makarov  wrote:
>   It is also interesting that your IRA range patch results in
> different code generation (i can not explain it too now). I saw the same
> on a small test (black jack playing and betting strategy).

I haven't looked into this, but I'm guessing this is because functions
whose semantics depend on the number of program points, like
build_conflict_bit_table, will behave differently.

For example, with my patch there are significantly fewer program
points and as a result OBJECT_MIN/OBJECT_MAX take on smaller values.
For build_conflict_bit_table this leads to a different cost metric to
decide whether the conflict table is small enough to build. Probably
other functions respond similarly to my patch, but I don't know IRA
well enough to say :-)

In any case, bootstrap&test passed (with and without checking enabled)
on x86_64-unknown-linux-gnu and powerpc64-unknown-linux-gnu so I
intend to commit the patch (with a few comment typos fixed) today or
tomorrow.

Ciao!
Steven

Re: [lra] patch to solve most scalability problems for LRA

2012-10-10 Thread Vladimir Makarov


On 12-10-10 10:53 AM, Steven Bosscher wrote:

On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov wrote:

   The following patch solves most of LRA scalability problems.

   It switches on simpler algorithms in LRA.  The first it switches off
trying to reassign hard registers to spilled pseudos (they usually for such
huge functions have long live ranges -- so the possibility to assign
them something very small but trying to reassign them a hard registers
is to expensive), inheritance, live range splitting, and memory
coalescing optimizations.  It seems that rematerialization is too
important for performance -- so I don't switch it off.  As splitting is
also necessary for generation of caller saves code, I switch off
caller-saves in IRA and force IRA to do non-regional RA.

Hi Vlad,

I've revisited this patch now that parts of the scalability issues
have been resolved. Something funny happened for our
soon-to-be-legendary PR54146 test case...

lra-branch yesterday (i.e. without the elimination and constraints
speedup patches):
  integrated RA   : 145.26 (18%)
  LRA non-specific:  46.94 ( 6%)
  LRA virtuals elimination:  51.56 ( 6%)
  LRA reload inheritance  :   0.03 ( 0%)
  LRA create live ranges  :  46.67 ( 6%)
  LRA hard reg assignment :   0.55 ( 0%)

lra-branch today + ira-speedup-1.diff:
  integrated RA   : 111.19 (15%) usr
  LRA non-specific:  21.16 ( 3%) usr
  LRA virtuals elimination:   0.65 ( 0%) usr
  LRA reload inheritance  :   0.01 ( 0%) usr
  LRA create live ranges  :  56.33 ( 8%) usr
  LRA hard reg assignment :   0.58 ( 0%) usr

lra-branch today + ira-speedup-1.diff + rm-lra_simple_p.diff:
  integrated RA   :  89.43 (11%) usr
  LRA non-specific:  21.43 ( 3%) usr
  LRA virtuals elimination:   0.61 ( 0%) usr
  LRA reload inheritance  :   6.10 ( 1%) usr
  LRA create live ranges  :  88.64 (11%) usr
  LRA hard reg assignment :  45.17 ( 6%) usr
  LRA coalesce pseudo regs:   2.24 ( 0%) usr

Note how IRA is *faster* without the lra_simple_p patch. The cost
comes back in "LRA hard reg assignment" and "LRA create live ranges"
where I assume the latter is a consequence of running
lra_create_live_ranges a few more times to work for the hard-reg
assignment phase.

Do you have an idea why IRA might be faster without the lra_simple_p
thing? Maybe there's a way to get the best of both...


  I have no idea.

  I can not confirm it on an Intel Corei7 machine.  Here is my timing.
Removing lra_simple_p makes the worst compilation time, but the best
code size.

  It is also interesting that your IRA range patch results in
different code generation (i can not explain it too now). I saw the same
on a small test (black jack playing and betting strategy).

  Another interesting thing is that IRA times are the same (with and
without simplified allocation for LRA).

--- branch this morning
 integrated RA   :  48.41 (13%) usr   0.25 ( 3%) sys  48.72 
(13%) wall  223608 kB (19%) ggc
 LRA non-specific:  14.47 ( 4%) usr   0.15 ( 2%) sys  14.57 ( 
4%) wall   41443 kB ( 4%) ggc
 LRA virtuals elimination:   0.40 ( 0%) usr   0.00 ( 0%) sys   0.41 ( 
0%) wall   36037 kB ( 3%) ggc
 LRA reload inheritance  :   0.14 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 
0%) wall1209 kB ( 0%) ggc
 LRA create live ranges  :  17.37 ( 5%) usr   0.21 ( 3%) sys  17.56 ( 
5%) wall5182 kB ( 0%) ggc
 LRA hard reg assignment :   1.77 ( 0%) usr   0.02 ( 0%) sys   1.76 ( 
0%) wall   0 kB ( 0%) ggc
 LRA coalesce pseudo regs:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 
0%) wall   0 kB ( 0%) ggc
real=377.25 user=367.58 system=8.36 share=99%% maxrss=33540720 ins=280 
outs=92544 mfaults=4448012 waits=17

   textdata bss dec hex filename
6395340  16 607 6395963  61983b s.o
--- branch this morning + ira range patch
 integrated RA   :  36.03 (10%) usr   0.03 ( 0%) sys  36.20 
(10%) wall  223608 kB (19%) ggc
 LRA non-specific:  14.57 ( 4%) usr   0.14 ( 2%) sys  14.89 ( 
4%) wall   41453 kB ( 4%) ggc
 LRA virtuals elimination:   0.36 ( 0%) usr   0.01 ( 0%) sys   0.41 ( 
0%) wall   36040 kB ( 3%) ggc
 LRA reload inheritance  :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 
0%) wall1210 kB ( 0%) ggc
 LRA create live ranges  :  17.36 ( 5%) usr   0.21 ( 3%) sys  17.53 ( 
5%) wall5184 kB ( 0%) ggc
 LRA hard reg assignment :   1.78 ( 1%) usr   0.02 ( 0%) sys   1.79 ( 
0%) wall   0 kB ( 0%) ggc
 LRA coalesce pseudo regs:   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 
0%) wall   0 kB ( 0%) ggc
 TOTAL : 351.82 7.50 360.52
1149460 kB
real=362.68 user=353.65 system=7.84 share=99%% maxrss=33540432 ins=224 
outs=92544 mfaults=4073281 waits=17

   textdata bss dec hex filename
6395424  16 607 6396047  61988f s.o
--- branch this morning + ira range patch + removing lra_simple_p
 integrated RA   :  37.87 ( 9%) usr   0.14 ( 2%) sys  38.30 ( 
9%) wall  744114 kB (45%) ggc
 LRA non-specific

Re: [lra] patch to solve most scalability problems for LRA

2012-10-10 Thread Steven Bosscher

On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov wrote:
>   The following patch solves most of LRA scalability problems.
>
>   It switches on simpler algorithms in LRA.  The first it switches off
> trying to reassign hard registers to spilled pseudos (they usually for such
> huge functions have long live ranges -- so the possibility to assign
> them something very small but trying to reassign them a hard registers
> is to expensive), inheritance, live range splitting, and memory
> coalescing optimizations.  It seems that rematerialization is too
> important for performance -- so I don't switch it off.  As splitting is
> also necessary for generation of caller saves code, I switch off
> caller-saves in IRA and force IRA to do non-regional RA.

Hi Vlad,

I've revisited this patch now that parts of the scalability issues
have been resolved. Something funny happened for our
soon-to-be-legendary PR54146 test case...

lra-branch yesterday (i.e. without the elimination and constraints
speedup patches):
 integrated RA   : 145.26 (18%)
 LRA non-specific:  46.94 ( 6%)
 LRA virtuals elimination:  51.56 ( 6%)
 LRA reload inheritance  :   0.03 ( 0%)
 LRA create live ranges  :  46.67 ( 6%)
 LRA hard reg assignment :   0.55 ( 0%)

lra-branch today + ira-speedup-1.diff:
 integrated RA   : 111.19 (15%) usr
 LRA non-specific:  21.16 ( 3%) usr
 LRA virtuals elimination:   0.65 ( 0%) usr
 LRA reload inheritance  :   0.01 ( 0%) usr
 LRA create live ranges  :  56.33 ( 8%) usr
 LRA hard reg assignment :   0.58 ( 0%) usr

lra-branch today + ira-speedup-1.diff + rm-lra_simple_p.diff:
 integrated RA   :  89.43 (11%) usr
 LRA non-specific:  21.43 ( 3%) usr
 LRA virtuals elimination:   0.61 ( 0%) usr
 LRA reload inheritance  :   6.10 ( 1%) usr
 LRA create live ranges  :  88.64 (11%) usr
 LRA hard reg assignment :  45.17 ( 6%) usr
 LRA coalesce pseudo regs:   2.24 ( 0%) usr

Note how IRA is *faster* without the lra_simple_p patch. The cost
comes back in "LRA hard reg assignment" and "LRA create live ranges"
where I assume the latter is a consequence of running
lra_create_live_ranges a few more times to work for the hard-reg
assignment phase.

Do you have an idea why IRA might be faster without the lra_simple_p
thing? Maybe there's a way to get the best of both...

Ciao!
Steven

Re: [lra] patch to solve most scalability problems for LRA

2012-10-04 Thread Richard Guenther

On Thu, Oct 4, 2012 at 7:07 PM, Steven Bosscher  wrote:
> On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov  wrote:
>>   The only issue now is PR54146 compilation time for IRA+LRA although it
>> was improved significantly.  I will continue work on PR54146.  But now I
>> am going to focus on proposals from reviews.
>
> Right, there still are opportunities to improve things.
>
> (The real solution may be to stop SRA from creating so many
> simultaneously live pseudos in the first place...)
>
>> +  lra_simple_p
>> += (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block);
>
> I think you should use n_basic_blocks here instead of
> last_basic_block, in case this runs without compacting the cfg first
> (n_basic_blocks is the real number of basic blocks in the cfg,
> last_basic_block is the highest index, so last_basic_block >=
> n_basic_blocks).

I also noticed that switching to IRA_REGION_ONE improves things
when we have a large number of loops (profile points to some
loop code in IRA).  Note that the magic number above should
be a new --param, and once we have a diagnostic flag that
shows whenever we back off like this it should notify the
user of that fact (and the params we have overflown) - this
just reminded me of that idea from somebody else ;)

> Thanks for working on this!

Indeed ;)

It, btw, also applies to IRA + reload ...

Richard.

> Ciao!
> Steven

Re: [lra] patch to solve most scalability problems for LRA

2012-10-04 Thread Steven Bosscher

On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov  wrote:
>   The only issue now is PR54146 compilation time for IRA+LRA although it
> was improved significantly.  I will continue work on PR54146.  But now I
> am going to focus on proposals from reviews.

Right, there still are opportunities to improve things.

(The real solution may be to stop SRA from creating so many
simultaneously live pseudos in the first place...)

> +  lra_simple_p
> += (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block);

I think you should use n_basic_blocks here instead of
last_basic_block, in case this runs without compacting the cfg first
(n_basic_blocks is the real number of basic blocks in the cfg,
last_basic_block is the highest index, so last_basic_block >=
n_basic_blocks).

Thanks for working on this!

Ciao!
Steven

[lra] patch to solve most scalability problems for LRA

2012-10-04 Thread Vladimir Makarov


  The following patch solves most of LRA scalability problems.

  Itswitches on simpler algorithms in LRA.  The first it switches off
trying to reassign hard registers to spilled pseudos (they usually for such
huge functions have long live ranges -- so the possibility to assign
them something very small but trying to reassign them a hard registers
is to expensive), inheritance, live range splitting, and memory
coalescing optimizations.  It seems that rematerialization is too
important for performance -- so I don't switch it off.  As splitting is
also necessary for generation of caller saves code, I switch off
caller-saves in IRA and force IRA to do non-regional RA.

  Here are the results on the huge tests in question.  The testing was
done on Corei7-2600 with 16GB memory to exclude swapping and make cpu
time close to real time (on 8GB LRA in real time worked already faster
reload when swapping occurs because better code/data locality).

  In the following table, time means cpu time of all GCC, size is text
segment size of generated code (data and bss segments are all the same
for given test), RA% means % of all cpu compiler time spent in RA
(IRA+reload or IRA+LRA) taken from -ftime-report (it is approximate
because it is sum of approximate numbers in which some of them are 0%
although they are not exactly 0%).

 The first row of the table is for the current IRA and reload, the 2nd
row is for current IRA+LRA on the branch, the 3rd row is the current
IRA+LRA with the patch.

  Because of small space I put data only for -m32 for the first 2
tests (-m64 has similar results), the 3rd test is for 64 bits because
it can not be correctly compiled for 32 bits.


time/size/RA% PR26854 PR37448PR54146
reload   565.15s  2102964  15%   293.47s  3122140  3% 349.93s  
6556630  19%
lra  624.18s  1707580  22%   311.76s  3221620  9% 469.30s  
6277934  40%
patched lra  524.51s  1720620  8%287.83s  3002372  1% 399.32s  
6395351  30%


  IRA+patched LRA behaves better than IRA+reload with compilation time
and code size point of view.

  Interesting enough, that for PR37448 regular algorithms results in bigger
code size that simpler ones.  I guess, it is because of live-range
splitting.  It can be profitable but has a tendency to generate a
bigger code.

  The only issue now is PR54146 compilation time for IRA+LRA although it
was improved significantly.  I will continue work on PR54146.  But now I
am going to focus on proposals from reviews.

  The patch was successfully bootstrapped on x86/x86-64.  I did it
twice, when simpler algorithms are always switched on (by setting
threshold #pseudos * #basic blocks very small) and when simpler
algorithms are used for huge functions (I believe there are no such
functions in GCC).

Committed as rev. 192086.

2012-10-04  Vladimir Makarov  

* lra.h (lra_simple_p): New external.
* lra.c (lra_simple_p): New global var.
(lra): Switch off inheritance and coalescing if lra_simple_p.
* lra-assigns.c (assign_by_spills): Don't try to reassign spilled
pseduos if lra_simple_p.
* ira.c (ira): Set up lra_simple_p and ira_conflicts_p.  Set up
and restore flag_caller_saves and flag_ira_region.

Index: ira.c
===
--- ira.c	(revision 192048)
+++ ira.c	(working copy)
@@ -4327,8 +4327,26 @@ ira (FILE *f)
   bool loops_p;
   int max_regno_before_ira, ira_max_point_before_emit;
   int rebuild_p;
+  bool saved_flag_caller_saves = flag_caller_saves;
+  enum ira_region saved_flag_ira_region = flag_ira_region;
+
+  ira_conflicts_p = optimize > 0;
 
   ira_use_lra_p = targetm.lra_p ();
+  /* If there are too many pseudos and/or basic blocks (e.g. 10K
+ pseudos and 10K blocks or 100K pseudos and 1K blocks), we will
+ use simplified and faster algorithms in LRA.  */
+  lra_simple_p
+= (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block);
+  if (lra_simple_p)
+{
+  /* It permits to skip live range splitting in LRA.  */
+  flag_caller_saves = false;
+  /* There is no sense to do regional allocation when we use
+	 simplified LRA.  */
+  flag_ira_region = IRA_REGION_ONE;
+  ira_conflicts_p = false;
+}
 
 #ifndef IRA_NO_OBSTACK
   gcc_obstack_init (&ira_obstack);
@@ -4349,7 +4367,6 @@ ira (FILE *f)
   ira_dump_file = stderr;
 }
 
-  ira_conflicts_p = optimize > 0;
   setup_prohibited_mode_move_regs ();
 
   df_note_add_problem ();
@@ -4530,6 +4547,13 @@ ira (FILE *f)
   /* See comment for find_moveable_pseudos call.  */
   if (ira_conflicts_p)
 move_unallocated_pseudos ();
+
+  /* Restore original values.  */
+  if (lra_simple_p)
+{
+  flag_caller_saves = saved_flag_caller_saves;
+  flag_ira_region = saved_flag_ira_region;
+}
 }
 
 static void
Index: lra-assigns.c
===
--- lra-assigns.c	(revision 192050)
+++ lra-assigns.c	(wo

Re: [lra] patch to solve most scalability problems for LRA

Re: [lra] patch to solve most scalability problems for LRA

Re: [lra] patch to solve most scalability problems for LRA

Re: [lra] patch to solve most scalability problems for LRA

Re: [lra] patch to solve most scalability problems for LRA

[lra] patch to solve most scalability problems for LRA

6 matches

Site Navigation

Mail list logo

Footer information