> On Thu, Apr 3, 2014 at 2:07 PM, Martin Liška <mli...@suse.cz> wrote: > > > > On 04/03/2014 11:41 AM, Richard Biener wrote: > >> > >> On Wed, Apr 2, 2014 at 6:11 PM, Martin Liška <mli...@suse.cz> wrote: > >>> > >>> On 04/02/2014 04:13 PM, Martin Liška wrote: > >>>> > >>>> > >>>> On 03/27/2014 10:48 AM, Martin Liška wrote: > >>>>> > >>>>> Previous patch is wrong, I did a mistake in name ;) > >>>>> > >>>>> Martin > >>>>> > >>>>> On 03/27/2014 09:52 AM, Martin Liška wrote: > >>>>>> > >>>>>> > >>>>>> On 03/25/2014 09:50 PM, Jan Hubicka wrote: > >>>>>>>> > >>>>>>>> Hello, > >>>>>>>> I've been compiling Chromium with LTO and I noticed that WPA > >>>>>>>> stream_out forks and do parallel: > >>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html. > >>>>>>>> > >>>>>>>> I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about > >>>>>>>> 6GB. When WPA start to fork, memory consumption increases so that > >>>>>>>> lto1 is killed. I would appreciate an --param option to disable this > >>>>>>>> WPA fork. The number of forks is taken from build system (-flto=9) > >>>>>>>> which is fine for ltrans phase, because LD releases aforementioned > >>>>>>>> 8GB. > >>>>>>>> > >>>>>>>> What do you think about that? > >>>>>>> > >>>>>>> I can take a look - our measurements suggested that the WPA memory > >>>>>>> will > >>>>>>> be later dominated by ltrans. Perhaps Chromium does something that > >>>>>>> makes > >>>>>>> WPA to explode that would be interesting to analyze. I did not > >>>>>>> managed > >>>>>>> to get through Chromium LTO build process recently (ninja builds are > >>>>>>> not > >>>>>>> my friends), can you send me the instructions? > >>>>>>> > >>>>>>> Honza > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Martin > >>>>>> > >>>>>> > >>>>>> There are instructions how can one build chromium with LTO: > >>>>>> 1) install depot-tools and export PATH variable according to guide: > >>>>>> http://www.chromium.org/developers/how-tos/install-depot-tools > >>>>>> 2) Checkout source code: gclient sync; cd src > >>>>>> 3) Apply patch (enables system gold linker and disables LTO for a > >>>>>> sandbox that uses top-level asm) > >>>>>> 4) which ld should point to ld.gold > >>>>>> 5) unsure that ld.bfd points to ld.bfd > >>>>>> 6) run: build/gyp_chromium -Dwerror= > >>>>>> 7) ninja -C out/Release chrome -jX > >>>>>> > >>>>>> If there are any problems, follow: > >>>>>> https://code.google.com/p/chromium/wiki/LinuxBuildInstructions > >>>>>> > >>>>>> Martin > >>>>>> > >>>> Hello, > >>>> taking latest trunk gcc, I built Firefox and Chromium. Both projects > >>>> compiled without debugging symbols and -O2 on an 8-core machine. > >>>> > >>>> Firefox: > >>>> -flto=9, peak memory usage (in LTRANS): 11GB > >>>> > >>>> Chromium: > >>>> -flto=6, peak memory usage (in parallel WPA phase ): 16.5GB > >>>> > >>>> For details please see attached with graphs. The attachment contains > >>>> also > >>>> -fmem-report and -fmem-report-wpa. > >>>> I think reduced memory footprint to ~3.5GB is a bit optimistic: > >>>> http://gcc.gnu.org/gcc-4.9/changes.html > >>>> > >>>> Is there any way we can reduce the memory footprint? > >>>> > >>>> Attachment (due to size restriction): > >>>> > >>>> https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing > >>>> > >>>> Thank you, > >>>> Martin > >>> > >>> > >>> Previous email presents a bit misleading graphs (influenced by > >>> --enable-gather-detailed-mem-stats). > >>> > >>> Firefox: > >>> -flto=9, WPA peak: 8GB, LTRANS peak: 8GB > >>> -flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB > >>> -flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB > >>> > >>> These data shows that parallel WPA streaming increases short-time memory > >>> footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of > >>> -flto=4). > >>> > >>> For more details, please see the attachment. > >> > >> The main overhead comes from maintaining the state during output of > >> the global types/decls. We maintain somewhat "duplicate" info > >> here by having both the tree_ref_encoder and the streamer cache. > >> Eventually we can free the tree_ref_encoder pointer-map early, like with > >> > >> Index: lto-streamer-out.c > >> =================================================================== > >> --- lto-streamer-out.c (revision 209018) > >> +++ lto-streamer-out.c (working copy) > >> @@ -2423,10 +2455,18 @@ produce_asm_for_decls (void) > >> > >> gcc_assert (!alias_pairs); > >> > >> - /* Write the global symbols. */ > >> + /* Get rid of the global decl state hash tables to save some memory. > >> */ > >> out_state = lto_get_out_decl_state (); > >> - num_fns = lto_function_decl_states.length (); > >> + for (int i = 0; i < LTO_N_DECL_STREAMS; i++) > >> + if (out_state->streams[i].tree_hash_table) > >> + { > >> + delete out_state->streams[i].tree_hash_table; > >> + out_state->streams[i].tree_hash_table = NULL; > >> + } > >> + > >> + /* Write the global symbols. */ > >> lto_output_decl_state_streams (ob, out_state); > >> + num_fns = lto_function_decl_states.length (); > >> for (idx = 0; idx < num_fns; idx++) > >> { > >> fn_out_state = > >> > >> as we do already for the fn state streams (untested). > >> > >> we can also avoid re-allocating the output hashtable/vector by, after > >> (or in) create_output_block, allocate a bigger initial size for the > >> streamer_tree_cache. Note that the pointer-set already expands if > >> the fill level is > 25%, and it really exponentially grows (similar to > >> hash_table, btw, but that grows only at 75% fill level). > >> > >> OTOH simply summing then lengths of all decl streams results in > >> a lower value than the actual number of output trees in the output block. > >> Humm. > >> > >> But this is clearly the data structure that could be worth optimizing > >> in some way. For example during writing we don't need the > >> streamer cache nodes array (we just need a counter to assign indexes). > >> > >> The attached is a patch that tries to do that plus the above (in testing > >> right now). Maybe you can check if it makes a noticable difference. > >> > >> Richard. > > > > > > I run test of your patch for twice, according to graphs memory footprint > > looks similar. Looks, after application of the patch, WPA phase was a bit > > faster, but can be influenced by HDD heavily utilized at the end of WPA. > > Sent graphs are executed after: echo 3 > /proc/sys/vm/drop_caches > > > > One another idea is to use threads instead of process fork. But I am not > > familiar with sharing data problems between threads?
We would need to make lto-streaming thread safe. That is probably not terribly dificult to do (just a lot of legwork), but with copy-on-write, the threads should not be significantly cheaper than forking as long as we avoid writting into memory we don't need. Honza