Re: WPA stream_out form & memory consumption

Jan Hubicka Thu, 03 Apr 2014 12:41:11 -0700

> On Thu, Apr 3, 2014 at 2:07 PM, Martin Liška <mli...@suse.cz> wrote:
> >
> > On 04/03/2014 11:41 AM, Richard Biener wrote:
> >>
> >> On Wed, Apr 2, 2014 at 6:11 PM, Martin Liška <mli...@suse.cz> wrote:
> >>>
> >>> On 04/02/2014 04:13 PM, Martin Liška wrote:
> >>>>
> >>>>
> >>>> On 03/27/2014 10:48 AM, Martin Liška wrote:
> >>>>>
> >>>>> Previous patch is wrong, I did a mistake in name ;)
> >>>>>
> >>>>> Martin
> >>>>>
> >>>>> On 03/27/2014 09:52 AM, Martin Liška wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 03/25/2014 09:50 PM, Jan Hubicka wrote:
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>      I've been compiling Chromium with LTO and I noticed that WPA
> >>>>>>>> stream_out forks and do parallel:
> >>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.
> >>>>>>>>
> >>>>>>>> I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about
> >>>>>>>> 6GB. When WPA start to fork, memory consumption increases so that
> >>>>>>>> lto1 is killed. I would appreciate an --param option to disable this
> >>>>>>>> WPA fork. The number of forks is taken from build system (-flto=9)
> >>>>>>>> which is fine for ltrans phase, because LD releases aforementioned
> >>>>>>>> 8GB.
> >>>>>>>>
> >>>>>>>> What do you think about that?
> >>>>>>>
> >>>>>>> I can take a look - our measurements suggested that the WPA memory
> >>>>>>> will
> >>>>>>> be later dominated by ltrans.  Perhaps Chromium does something that
> >>>>>>> makes
> >>>>>>> WPA to explode that would be interesting to analyze.  I did not
> >>>>>>> managed
> >>>>>>> to get through Chromium LTO build process recently (ninja builds are
> >>>>>>> not
> >>>>>>> my friends), can you send me the instructions?
> >>>>>>>
> >>>>>>> Honza
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Martin
> >>>>>>
> >>>>>>
> >>>>>> There are instructions how can one build chromium with LTO:
> >>>>>> 1) install depot-tools and export PATH variable according to guide:
> >>>>>> http://www.chromium.org/developers/how-tos/install-depot-tools
> >>>>>> 2) Checkout source code: gclient sync; cd src
> >>>>>> 3) Apply patch (enables system gold linker and disables LTO for a
> >>>>>> sandbox that uses top-level asm)
> >>>>>> 4) which ld should point to ld.gold
> >>>>>> 5) unsure that ld.bfd points to ld.bfd
> >>>>>> 6) run: build/gyp_chromium -Dwerror=
> >>>>>> 7) ninja -C out/Release chrome -jX
> >>>>>>
> >>>>>> If there are any problems, follow:
> >>>>>> https://code.google.com/p/chromium/wiki/LinuxBuildInstructions
> >>>>>>
> >>>>>> Martin
> >>>>>>
> >>>> Hello,
> >>>>    taking latest trunk gcc, I built Firefox and Chromium. Both projects
> >>>> compiled without debugging symbols and -O2 on an 8-core machine.
> >>>>
> >>>> Firefox:
> >>>> -flto=9, peak memory usage (in LTRANS): 11GB
> >>>>
> >>>> Chromium:
> >>>> -flto=6, peak memory usage (in parallel WPA phase ): 16.5GB
> >>>>
> >>>> For details please see attached with graphs. The attachment contains
> >>>> also
> >>>> -fmem-report and -fmem-report-wpa.
> >>>> I think reduced memory footprint to ~3.5GB is a bit optimistic:
> >>>> http://gcc.gnu.org/gcc-4.9/changes.html
> >>>>
> >>>> Is there any way we can reduce the memory footprint?
> >>>>
> >>>> Attachment (due to size restriction):
> >>>>
> >>>> https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing
> >>>>
> >>>> Thank you,
> >>>> Martin
> >>>
> >>>
> >>> Previous email presents a bit misleading graphs (influenced by
> >>> --enable-gather-detailed-mem-stats).
> >>>
> >>> Firefox:
> >>> -flto=9, WPA peak: 8GB, LTRANS peak: 8GB
> >>> -flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
> >>> -flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB
> >>>
> >>> These data shows that parallel WPA streaming increases short-time memory
> >>> footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
> >>> -flto=4).
> >>>
> >>> For more details, please see the attachment.
> >>
> >> The main overhead comes from maintaining the state during output of
> >> the global types/decls.  We maintain somewhat "duplicate" info
> >> here by having both the tree_ref_encoder and the streamer cache.
> >> Eventually we can free the tree_ref_encoder pointer-map early, like with
> >>
> >> Index: lto-streamer-out.c
> >> ===================================================================
> >> --- lto-streamer-out.c  (revision 209018)
> >> +++ lto-streamer-out.c  (working copy)
> >> @@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)
> >>
> >>     gcc_assert (!alias_pairs);
> >>
> >> -  /* Write the global symbols.  */
> >> +  /* Get rid of the global decl state hash tables to save some memory.
> >> */
> >>     out_state = lto_get_out_decl_state ();
> >> -  num_fns = lto_function_decl_states.length ();
> >> +  for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
> >> +    if (out_state->streams[i].tree_hash_table)
> >> +      {
> >> +       delete out_state->streams[i].tree_hash_table;
> >> +       out_state->streams[i].tree_hash_table = NULL;
> >> +      }
> >> +
> >> +  /* Write the global symbols.  */
> >>     lto_output_decl_state_streams (ob, out_state);
> >> +  num_fns = lto_function_decl_states.length ();
> >>     for (idx = 0; idx < num_fns; idx++)
> >>       {
> >>         fn_out_state =
> >>
> >> as we do already for the fn state streams (untested).
> >>
> >> we can also avoid re-allocating the output hashtable/vector by, after
> >> (or in) create_output_block, allocate a bigger initial size for the
> >> streamer_tree_cache.  Note that the pointer-set already expands if
> >> the fill level is > 25%, and it really exponentially grows (similar to
> >> hash_table, btw, but that grows only at 75% fill level).
> >>
> >> OTOH simply summing then lengths of all decl streams results in
> >> a lower value than the actual number of output trees in the output block.
> >> Humm.
> >>
> >> But this is clearly the data structure that could be worth optimizing
> >> in some way.  For example during writing we don't need the
> >> streamer cache nodes array (we just need a counter to assign indexes).
> >>
> >> The attached is a patch that tries to do that plus the above (in testing
> >> right now).  Maybe you can check if it makes a noticable difference.
> >>
> >> Richard.
> >
> >
> > I run test of your patch for twice, according to graphs memory footprint
> > looks similar. Looks, after application of the patch, WPA phase was a bit
> > faster, but can be influenced by HDD heavily utilized at the end of WPA.
> > Sent graphs are executed after: echo 3 > /proc/sys/vm/drop_caches
> >
> > One another idea is to use threads instead of process fork. But I am not
> > familiar with sharing data problems between threads?


We would need to make lto-streaming thread safe. That is probably not terribly
dificult to do (just a lot of legwork), but with copy-on-write, the threads
should not be significantly cheaper than forking as long as we avoid writting
into memory we don't need.

Honza

Re: WPA stream_out form & memory consumption

Reply via email to