[PATCH] Avoid ggc_collect () after WPA forking

2014-03-19 Thread Richard Biener

This patch avoids calling ggc_collect after we possibly forked
during WPA phase as that necessarily causes a lot of page
unsharing.  I have verified that during a LTO bootstrap we
do not allocate GC memory during (or after) lto_wpa_write_files,
thus the effect on memory use should be positive (the patch
below contains checking code making sure that we don't alloc).

LTO bootstrapped on x86_64-unknown-linux-gnu, will apply shortly
(without the checking code of course).

That should fix the WPA memory explosion Martin sees with building
Chromium.

Richard.

2014-03-19  Richard Biener  rguent...@suse.de

* lto.c (lto_wpa_write_files): Move call to
lto_promote_cross_file_statics ...
(do_whole_program_analysis): ... here, into the partitioning
block.  Do not ggc_collect after lto_wpa_write_files but
for a last time before it.

Index: gcc/ggc-page.c
===
--- gcc/ggc-page.c  (revision 208642)
+++ gcc/ggc-page.c  (working copy)
@@ -1199,6 +1199,8 @@ ggc_round_alloc_size (size_t requested_s
   return size;
 }
 
+int may_alloc = 1;
+
 /* Allocate a chunk of memory of SIZE bytes.  Its contents are undefined.  */
 
 void *
@@ -1208,6 +1210,9 @@ ggc_internal_alloc_stat (size_t size MEM
   struct page_entry *entry;
   void *result;
 
+  if (!may_alloc)
+fatal_error (allocating GC memory);
+
   ggc_round_alloc_size_1 (size, order, object_size);
 
   /* If there are non-full pages for this size allocation, they are at
Index: gcc/lto/lto.c
===
--- gcc/lto/lto.c   (revision 208642)
+++ gcc/lto/lto.c   (working copy)
@@ -2565,11 +2566,6 @@ lto_wpa_write_files (void)
   FOR_EACH_VEC_ELT (ltrans_partitions, i, part)
 lto_stats.num_output_symtab_nodes += lto_symtab_encoder_size 
(part-encoder);
 
-  /* Find out statics that need to be promoted
- to globals with hidden visibility because they are accessed from multiple
- partitions.  */
-  lto_promote_cross_file_statics ();
-
   timevar_pop (TV_WHOPR_WPA);
 
   timevar_push (TV_WHOPR_WPA_IO);
@@ -3281,11 +3277,25 @@ do_whole_program_analysis (void)
 node-aux = NULL;
 
   lto_stats.num_cgraph_partitions += ltrans_partitions.length ();
+
+  /* Find out statics that need to be promoted
+ to globals with hidden visibility because they are accessed from multiple
+ partitions.  */
+  lto_promote_cross_file_statics ();
   timevar_pop (TV_WHOPR_PARTITIONING);
 
   timevar_stop (TV_PHASE_OPT_GEN);
-  timevar_start (TV_PHASE_STREAM_OUT);
 
+  /* Collect a last time - in lto_wpa_write_files we may end up forking
+ with the idea that this doesn't increase memory usage.  So we
+ absoultely do not want to collect after that.  */
+  ggc_collect ();
+{
+  extern int may_alloc;
+  may_alloc = 0;
+}
+
+  timevar_start (TV_PHASE_STREAM_OUT);
   if (!quiet_flag)
 {
   fprintf (stderr, \nStreaming out);
@@ -3294,10 +3304,8 @@ do_whole_program_analysis (void)
   lto_wpa_write_files ();
   if (!quiet_flag)
 fprintf (stderr, \n);
-
   timevar_stop (TV_PHASE_STREAM_OUT);
 
-  ggc_collect ();
   if (post_ipa_mem_report)
 {
   fprintf (stderr, Memory consumption after IPA\n);


Re: [PATCH] Avoid ggc_collect () after WPA forking

2014-03-19 Thread Steven Bosscher
On Wed, Mar 19, 2014 at 12:10 PM, Richard Biener wrote:
 Index: gcc/ggc-page.c
 ===
 --- gcc/ggc-page.c  (revision 208642)
 +++ gcc/ggc-page.c  (working copy)
 @@ -1199,6 +1199,8 @@ ggc_round_alloc_size (size_t requested_s
return size;
  }

 +int may_alloc = 1;

bool may_alloc?

Ciao!
Steven


Re: [PATCH] Avoid ggc_collect () after WPA forking

2014-03-19 Thread Richard Biener
On Wed, 19 Mar 2014, Steven Bosscher wrote:

 On Wed, Mar 19, 2014 at 12:10 PM, Richard Biener wrote:
  Index: gcc/ggc-page.c
  ===
  --- gcc/ggc-page.c  (revision 208642)
  +++ gcc/ggc-page.c  (working copy)
  @@ -1199,6 +1199,8 @@ ggc_round_alloc_size (size_t requested_s
 return size;
   }
 
  +int may_alloc = 1;
 
 bool may_alloc?

It's only checking code I didn't commit.  We may of course alloc
but I wanted to prove we don't.

Richard.


Re: [PATCH] Avoid ggc_collect () after WPA forking

2014-03-19 Thread Richard Biener
On Wed, 19 Mar 2014, Martin Liška wrote:

 There are stats for Firefox with LTO and -O2. According to graphs it
 looks that memory consumption for parallel WPA phase is similar.
 When I disable parallel WPA, wpa footprint is ~4GB, but ltrans memory
 footprint is similar to parallel WPA that reduces libxul.so linking by ~10%.

Ok, so I suppose this tracks RSS, not virtual memory use (what is
used and what is active)?

And it is WPA plus LTRANS stages, WPA ends where memory use first goes
down to zero?

I wonder if you can identify the point where parallel streaming
starts and where it ends ... ;)

Btw, I have another patch in my local tree, limiting the
exponential growth of blocks we allocate when outputting sections.
But it shouldn't be _that_ bad ... maybe you can try if it has
any effect?

Thanks,
Richard.

Index: gcc/lto-section-out.c
===
--- gcc/lto-section-out.c   (revision 208642)
+++ gcc/lto-section-out.c   (working copy)
@@ -99,13 +99,19 @@ lto_end_section (void)
 }
 
 
+/* We exponentially grow the size of the blocks as we need to make
+   room for more data to be written.  Start with a single page and go up
+   to 2MB pages for this.  */
+#define FIRST_BLOCK_SIZE 4096
+#define MAX_BLOCK_SIZE (2 * 1024 * 1024)
+
 /* Write all of the chars in OBS to the assembler.  Recycle the blocks
in obs as this is being done.  */
 
 void
 lto_write_stream (struct lto_output_stream *obs)
 {
-  unsigned int block_size = 1024;
+  unsigned int block_size = FIRST_BLOCK_SIZE;
   struct lto_char_ptr_base *block;
   struct lto_char_ptr_base *next_block;
   if (!obs-first_block)
@@ -135,6 +141,7 @@ lto_write_stream (struct lto_output_stre
   else
lang_hooks.lto.append_data (base, num_chars, block);
   block_size *= 2;
+  block_size = MIN (MAX_BLOCK_SIZE, block_size);
 }
 }
 
@@ -152,7 +159,7 @@ lto_append_block (struct lto_output_stre
 {
   /* This is the first time the stream has been written
 into.  */
-  obs-block_size = 1024;
+  obs-block_size = FIRST_BLOCK_SIZE;
   new_block = (struct lto_char_ptr_base*) xmalloc (obs-block_size);
   obs-first_block = new_block;
 }
@@ -162,6 +169,7 @@ lto_append_block (struct lto_output_stre
   /* Get a new block that is twice as big as the last block
 and link it into the list.  */
   obs-block_size *= 2;
+  obs-block_size = MIN (MAX_BLOCK_SIZE, obs-block_size);
   new_block = (struct lto_char_ptr_base*) xmalloc (obs-block_size);
   /* The first bytes of the block are reserved as a pointer to
 the next block.  Set the chain of the full block to the

Re: [PATCH] Avoid ggc_collect () after WPA forking

2014-03-19 Thread Martin Liška


On 03/19/2014 03:55 PM, Richard Biener wrote:

On Wed, 19 Mar 2014, Martin Liška wrote:


There are stats for Firefox with LTO and -O2. According to graphs it
looks that memory consumption for parallel WPA phase is similar.
When I disable parallel WPA, wpa footprint is ~4GB, but ltrans memory
footprint is similar to parallel WPA that reduces libxul.so linking by ~10%.

Ok, so I suppose this tracks RSS, not virtual memory use (what is
used and what is active)?


Data are given by vmstat, according to: 
http://stackoverflow.com/questions/18529723/what-is-active-memory-and-inactive-memory


*Active memory*is memory that is being used by a particular process.
*Inactive memory*is memory that was allocated to a process that is no 
longer running.


So please follow just 'blue' line that displays really used memory. 
According to man, vmstat tracks virtual memory statistics.



And it is WPA plus LTRANS stages, WPA ends where memory use first goes
down to zero?
I wonder if you can identify the point where parallel streaming
starts and where it ends ... ;)


Exactly, WPA ends when it goes to zero.


Btw, I have another patch in my local tree, limiting the
exponential growth of blocks we allocate when outputting sections.
But it shouldn't be _that_ bad ... maybe you can try if it has
any effect?


I can apply it.

Martin



Thanks,
Richard.

Index: gcc/lto-section-out.c
===
--- gcc/lto-section-out.c   (revision 208642)
+++ gcc/lto-section-out.c   (working copy)
@@ -99,13 +99,19 @@ lto_end_section (void)
  }
  
  
+/* We exponentially grow the size of the blocks as we need to make

+   room for more data to be written.  Start with a single page and go up
+   to 2MB pages for this.  */
+#define FIRST_BLOCK_SIZE 4096
+#define MAX_BLOCK_SIZE (2 * 1024 * 1024)
+
  /* Write all of the chars in OBS to the assembler.  Recycle the blocks
 in obs as this is being done.  */
  
  void

  lto_write_stream (struct lto_output_stream *obs)
  {
-  unsigned int block_size = 1024;
+  unsigned int block_size = FIRST_BLOCK_SIZE;
struct lto_char_ptr_base *block;
struct lto_char_ptr_base *next_block;
if (!obs-first_block)
@@ -135,6 +141,7 @@ lto_write_stream (struct lto_output_stre
else
lang_hooks.lto.append_data (base, num_chars, block);
block_size *= 2;
+  block_size = MIN (MAX_BLOCK_SIZE, block_size);
  }
  }
  
@@ -152,7 +159,7 @@ lto_append_block (struct lto_output_stre

  {
/* This is the first time the stream has been written
 into.  */
-  obs-block_size = 1024;
+  obs-block_size = FIRST_BLOCK_SIZE;
new_block = (struct lto_char_ptr_base*) xmalloc (obs-block_size);
obs-first_block = new_block;
  }
@@ -162,6 +169,7 @@ lto_append_block (struct lto_output_stre
/* Get a new block that is twice as big as the last block
 and link it into the list.  */
obs-block_size *= 2;
+  obs-block_size = MIN (MAX_BLOCK_SIZE, obs-block_size);
new_block = (struct lto_char_ptr_base*) xmalloc (obs-block_size);
/* The first bytes of the block are reserved as a pointer to
 the next block.  Set the chain of the full block to the




Re: [PATCH] Avoid ggc_collect () after WPA forking

2014-03-19 Thread Richard Biener
On Wed, 19 Mar 2014, Martin Liška wrote:

 
 On 03/19/2014 03:55 PM, Richard Biener wrote:
  On Wed, 19 Mar 2014, Martin Liška wrote:
  
   There are stats for Firefox with LTO and -O2. According to graphs it
   looks that memory consumption for parallel WPA phase is similar.
   When I disable parallel WPA, wpa footprint is ~4GB, but ltrans memory
   footprint is similar to parallel WPA that reduces libxul.so linking by
   ~10%.
  Ok, so I suppose this tracks RSS, not virtual memory use (what is
  used and what is active)?
 
 Data are given by vmstat, according to:
 http://stackoverflow.com/questions/18529723/what-is-active-memory-and-inactive-memory
 
 *Active memory*is memory that is being used by a particular process.
 *Inactive memory*is memory that was allocated to a process that is no longer
 running.

 So please follow just 'blue' line that displays really used memory. According
 to man, vmstat tracks virtual memory statistics.

But 'blue' is neither active nor inactive ... what is 'used'?  Does
it correspond to 'swpd'?

If it is virtual memory in use then this is expected to grow when 
fork()ing as the virtual memory space is obviously copied (just the pages 
are still shared).

For me allocating a GB memory and clearing it increases active by
1GB and then forking doesn't increase any of the metrics vmstat -a
outputs in any significant way.

  And it is WPA plus LTRANS stages, WPA ends where memory use first goes
  down to zero?
  I wonder if you can identify the point where parallel streaming
  starts and where it ends ... ;)
 
 Exactly, WPA ends when it goes to zero.

So the difference isn't that big (8GB vs. 7.2GB), and is likely attributed
to heap memory we allocate during the stream-out.  For example
we need some for the tree-ref-encoders (I remember that can be a
significant amount of memory, but I improved that already as far as
possible...).  So yes, we _do_ allocate memory during stream-out
and that is now required N times.

  Btw, I have another patch in my local tree, limiting the
  exponential growth of blocks we allocate when outputting sections.
  But it shouldn't be _that_ bad ... maybe you can try if it has
  any effect?
 
 I can apply it.

Thanks,
Richard.