On Fri, 11 Dec 2015, Jan Hubicka wrote: > Hi, > this patch makes WPA to copy sections w/o decompressing them. This leads > to a nice /tmp usage for GCC bootstrap (about 70%) and little for Firefox. > In GCC about 5% of the ltrans object file is the global decl section, while > for Firefox it is 85%. I will try to figure out if there is something > terribly stupid pickled there. > > The patch simply adds raw section i/o to lto-section-in.c and > lto-section-out.c > which is used by copy_function_or_variable. The catch is that WPA->ltrans > stremaing is not compressed and this fact is not represented in the object > file > at all. We simply test flag_wpa and flag_ltrans. Now function sections born > at WPA time are uncompressed, while function sections just copied are > compressed and we do not know how to read them. > > I tried to simply turn off the non-compressed path and set compression level > to minimal and then to none (which works despite the apparently outdated FIXME > comments I removed). Sadly zlib manages to burn about 16% of WPA time > at minimal level and about 7% at none because it computes the checksum. > Clealry > next stage1 it is time to switch to better compression backend. > > For now I added the information if section is compressed into > decl_state. I am not thrilled by this but it is only way I found w/o > wasting 4 bytes per every lto section (because the lto header is not > really extensible and the stream is assumed to be aligned).
So this trick now only applies to decl sections? I think you could have stolen a bit from lto_simple_header::main_size (oddly lto_simple_header_with_strings adds its own main_size, hiding the simple-hearder ones - huh). Changing lto_header itself into int16_t major_version int8_t minor_version int8_t flags would be another possibility (and bump the major version). I think we have no sections produced with just lto_header but always lto_simple_header (from grepping). Some sections have no header (lto.opts). So would the patch be a lot more difficult if you go down either of the routes above? (I think I prefer changing lto_header rather than making main_size a bitfield) Richard. > The whole lowlevel lto streaming code is grand mess, I hope we will clean this > up and get more sane headers in foreseable future. Until that time this > solution does not waste extra space as it is easy to pickle the flag as part > of > reference. > > The patch saves about 7% of WPA time for firefox: > > phase opt and generate : 75.66 (39%) usr 1.78 (14%) sys 77.44 (37%) > wall 855644 kB (21%) ggc > phase stream in : 34.62 (18%) usr 1.95 (16%) sys 36.57 (18%) > wall 3245604 kB (79%) ggc > phase stream out : 81.89 (42%) usr 8.49 (69%) sys 90.37 (44%) > wall 50 kB ( 0%) ggc > ipa dead code removal : 4.33 ( 2%) usr 0.06 ( 0%) sys 4.24 ( 2%) > wall 0 kB ( 0%) ggc > ipa virtual call target : 25.15 (13%) usr 0.14 ( 1%) sys 25.42 (12%) > wall 0 kB ( 0%) ggc > ipa cp : 3.92 ( 2%) usr 0.21 ( 2%) sys 4.18 ( 2%) > wall 340698 kB ( 8%) ggc > ipa inlining heuristics : 24.12 (12%) usr 0.38 ( 3%) sys 24.37 (12%) > wall 500427 kB (12%) ggc > lto stream inflate : 7.07 ( 4%) usr 0.38 ( 3%) sys 7.33 ( 4%) > wall 0 kB ( 0%) ggc > ipa lto gimple in : 1.95 ( 1%) usr 0.61 ( 5%) sys 2.42 ( 1%) > wall 324875 kB ( 8%) ggc > ipa lto gimple out : 9.16 ( 5%) usr 1.64 (13%) sys 10.49 ( 5%) > wall 50 kB ( 0%) ggc > ipa lto decl in : 21.25 (11%) usr 1.01 ( 8%) sys 22.37 (11%) > wall 2348869 kB (57%) ggc > ipa lto decl out : 67.33 (34%) usr 1.66 (13%) sys 68.96 (33%) > wall 0 kB ( 0%) ggc > ipa lto constructors out: 1.39 ( 1%) usr 0.38 ( 3%) sys 2.18 ( 1%) > wall 0 kB ( 0%) ggc > ipa lto decl merge : 2.12 ( 2%) usr 0.00 ( 0%) sys 2.12 ( 2%) > wall 13737 kB ( 0%) ggc > ipa reference : 2.14 ( 2%) usr 0.00 ( 0%) sys 2.13 ( 2%) > wall 0 kB ( 0%) ggc > ipa pure const : 2.29 ( 2%) usr 0.01 ( 0%) sys 2.35 ( 2%) > wall 0 kB ( 0%) ggc > ipa icf : 9.02 ( 7%) usr 0.18 ( 2%) sys 9.72 ( 7%) > wall 19203 kB ( 0%) ggc > TOTAL : 195.27 12.37 207.64 > 4103297 kB > > > > > phase opt and generate : 79.00 (38%) usr 1.61 (13%) sys 80.61 (36%) > wall 1000597 kB (24%) ggc > phase stream in : 33.93 (16%) usr 1.91 (15%) sys 35.83 (16%) > wall 3242293 kB (76%) ggc > phase stream out : 96.90 (46%) usr 9.19 (72%) sys 106.09 (48%) > wall 52 kB ( 0%) ggc > garbage collection : 2.94 ( 1%) usr 0.00 ( 0%) sys 2.93 ( 1%) > wall 0 kB ( 0%) ggc > ipa dead code removal : 4.60 ( 2%) usr 0.04 ( 0%) sys 4.53 ( 2%) > wall 0 kB ( 0%) ggc > ipa virtual call target : 24.48 (12%) usr 0.14 ( 1%) sys 24.76 (11%) > wall 0 kB ( 0%) ggc > ipa cp : 4.92 ( 2%) usr 0.41 ( 3%) sys 5.31 ( 2%) > wall 502843 kB (12%) ggc > ipa inlining heuristics : 23.72 (11%) usr 0.23 ( 2%) sys 23.92 (11%) > wall 490927 kB (12%) ggc > lto stream inflate : 14.35 ( 7%) usr 0.35 ( 3%) sys 15.22 ( 7%) > wall 0 kB ( 0%) ggc > ipa lto gimple in : 1.79 ( 1%) usr 0.57 ( 4%) sys 2.46 ( 1%) > wall 324857 kB ( 8%) ggc > ipa lto gimple out : 9.98 ( 5%) usr 1.45 (11%) sys 11.05 ( 5%) > wall 52 kB ( 0%) ggc > ipa lto decl in : 21.01 (10%) usr 0.91 ( 7%) sys 21.90 (10%) > wall 2345561 kB (55%) ggc > ipa lto decl out : 73.55 (35%) usr 2.09 (16%) sys 75.67 (34%) > wall 0 kB ( 0%) ggc > ipa lto constructors out: 1.87 ( 1%) usr 0.32 ( 3%) sys 2.18 ( 1%) > wall 0 kB ( 0%) ggc > ipa lto decl merge : 2.06 ( 1%) usr 0.00 ( 0%) sys 2.05 ( 1%) > wall 13737 kB ( 0%) ggc > whopr wpa I/O : 2.84 ( 1%) usr 5.14 (40%) sys 7.96 ( 4%) > wall 0 kB ( 0%) ggc > whopr partitioning : 3.83 ( 2%) usr 0.01 ( 0%) sys 3.84 ( 2%) > wall 5958 kB ( 0%) ggc > ipa reference : 2.63 ( 1%) usr 0.00 ( 0%) sys 2.64 ( 1%) > wall 0 kB ( 0%) ggc > ipa icf : 8.23 ( 4%) usr 0.12 ( 1%) sys 8.32 ( 4%) > wall 19203 kB ( 0%) ggc > TOTAL : 209.83 12.71 222.54 > 4244939 kB > > This now compares well to 5.3: > > Execution times (seconds) > > phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) > wall 1989 kB ( 0%) ggc > phase opt and generate : 68.61 (31%) usr 2.41 (14%) sys 77.67 (29%) > wall 1189579 kB (27%) ggc > phase stream in : 36.38 (16%) usr 2.32 (14%) sys 56.20 (21%) > wall 3168787 kB (73%) ggc > phase stream out : 113.37 (51%) usr 11.90 (71%) sys 130.49 (49%) > wall 112 kB ( 0%) ggc > phase finalize : 3.40 ( 2%) usr 0.13 ( 1%) sys 3.55 ( 1%) > wall 0 kB ( 0%) ggc > garbage collection : 6.13 ( 3%) usr 0.01 ( 0%) sys 6.18 ( 2%) > wall 0 kB ( 0%) ggc > ipa dead code removal : 4.74 ( 2%) usr 0.05 ( 0%) sys 5.09 ( 2%) > wall 0 kB ( 0%) ggc > ipa virtual call target : 11.29 ( 5%) usr 0.15 ( 1%) sys 11.20 ( 4%) > wall 1 kB ( 0%) ggc > ipa cp : 5.22 ( 2%) usr 0.21 ( 1%) sys 5.51 ( 2%) > wall 507623 kB (12%) ggc > ipa inlining heuristics : 24.11 (11%) usr 0.33 ( 2%) sys 24.67 ( 9%) > wall 497487 kB (11%) ggc > ipa lto gimple in : 4.20 ( 2%) usr 1.08 ( 6%) sys 10.73 ( 4%) > wall 467276 kB (11%) ggc > ipa lto gimple out : 17.57 ( 8%) usr 1.92 (11%) sys 23.61 ( 9%) > wall 112 kB ( 0%) ggc > ipa lto decl in : 26.19 (12%) usr 1.20 ( 7%) sys 31.62 (12%) > wall 2242394 kB (51%) ggc > ipa lto decl out : 89.09 (40%) usr 3.64 (22%) sys 92.79 (35%) > wall 0 kB ( 0%) ggc > ipa lto constructors in : 0.79 ( 0%) usr 0.28 ( 2%) sys 14.33 ( 5%) > wall 17992 kB ( 0%) ggc > ipa lto constructors out: 2.57 ( 1%) usr 0.41 ( 2%) sys 4.02 ( 2%) > wall 0 kB ( 0%) ggc > ipa lto cgraph I/O : 1.11 ( 1%) usr 0.33 ( 2%) sys 1.81 ( 1%) > wall 432544 kB (10%) ggc > ipa lto decl merge : 2.47 ( 1%) usr 0.00 ( 0%) sys 2.47 ( 1%) > wall 8191 kB ( 0%) ggc > ipa lto cgraph merge : 1.91 ( 1%) usr 0.01 ( 0%) sys 1.97 ( 1%) > wall 14717 kB ( 0%) ggc > whopr wpa I/O : 2.92 ( 1%) usr 5.93 (35%) sys 8.84 ( 3%) > wall 0 kB ( 0%) ggc > whopr partitioning : 3.91 ( 2%) usr 0.02 ( 0%) sys 3.93 ( 1%) > wall 6001 kB ( 0%) ggc > ipa icf : 7.77 ( 4%) usr 0.19 ( 1%) sys 8.05 ( 3%) > wall 22534 kB ( 1%) ggc > TOTAL : 221.76 16.76 267.92 > 4360470 kB > > Except that I really need to do something with virtual call targets. As the > quality of information improved by improved TBAA we now do more walks. > > The savings for cc1 build are bigger and incremental linking improvements > eveyr bigger > (about 50%), but I accidentaly removed the logs... > > lto-bootstrapped/regtested x86_64-linux, OK? > > * cgraph.c (cgraph_node::get_untransformed_body): Pass compressed > flag to lto_get_section_data. > * varpool.c (varpool_node::get_constructor): Likewise. > * lto-section-in.c (lto_get_section_data): Add new flag decompress. > (lto_free_section_data): Likewise. > (lto_get_raw_section_data): New function. > (lto_free_raw_section_data): New function. > (copy_function_or_variable): Copy sections w/o decompressing. > (lto_output_decl_state_refs): Picke compressed bit. > * lto-streamer.h (lto_in_decl_state): New flag compressed. > (lto_out_decl_state): Likewise. > (lto_get_section_data, lto_free_section_data): Update prototypes > (lto_get_raw_section_data, lto_free_raw_section_data): Declare. > (lto_write_raw_data): Declare. > (lto_begin_section): Remove FIXME. > (lto_write_raw_data): New function. > (lto_write_stream): Remove FIXME. > (lto_new_out_decl_state): Set compressed flag. > > * lto.c (lto_read_in_decl_state): Unpickle compressed bit. > Index: cgraph.c > =================================================================== > --- cgraph.c (revision 231546) > +++ cgraph.c (working copy) > @@ -3251,9 +3251,11 @@ cgraph_node::get_untransformed_body (voi > > /* We may have renamed the declaration, e.g., a static function. */ > name = lto_get_decl_name_mapping (file_data, name); > + struct lto_in_decl_state *decl_state > + = lto_get_function_in_decl_state (file_data, decl); > > data = lto_get_section_data (file_data, LTO_section_function_body, > - name, &len); > + name, &len, decl_state->compressed); > if (!data) > fatal_error (input_location, "%s: section %s is missing", > file_data->file_name, > @@ -3264,7 +3266,7 @@ cgraph_node::get_untransformed_body (voi > lto_input_function_body (file_data, this, data); > lto_stats.num_function_bodies++; > lto_free_section_data (file_data, LTO_section_function_body, name, > - data, len); > + data, len, decl_state->compressed); > lto_free_function_in_decl_state_for_node (this); > /* Keep lto file data so ipa-inline-analysis knows about cross module > inlining. */ > Index: lto-section-in.c > =================================================================== > --- lto-section-in.c (revision 231546) > +++ lto-section-in.c (working copy) > @@ -130,7 +130,7 @@ const char * > lto_get_section_data (struct lto_file_decl_data *file_data, > enum lto_section_type section_type, > const char *name, > - size_t *len) > + size_t *len, bool decompress) > { > const char *data = (get_section_f) (file_data, section_type, name, len); > const size_t header_length = sizeof (struct lto_data_header); > @@ -142,9 +142,10 @@ lto_get_section_data (struct lto_file_de > if (data == NULL) > return NULL; > > - /* FIXME lto: WPA mode does not write compressed sections, so for now > - suppress uncompression if flag_ltrans. */ > - if (!flag_ltrans) > + /* WPA->ltrans streams are not compressed with exception of function bodies > + and variable initializers that has been verbatim copied from earlier > + compilations. */ > + if (!flag_ltrans || decompress) > { > /* Create a mapping header containing the underlying data and length, > and prepend this to the uncompression buffer. The uncompressed data > @@ -170,6 +171,16 @@ lto_get_section_data (struct lto_file_de > return data; > } > > +/* Get the section data without any header parsing or uncompression. */ > + > +const char * > +lto_get_raw_section_data (struct lto_file_decl_data *file_data, > + enum lto_section_type section_type, > + const char *name, > + size_t *len) > +{ > + return (get_section_f) (file_data, section_type, name, len); > +} > > /* Free the data found from the above call. The first three > parameters are the same as above. DATA is the data to be freed and > @@ -180,7 +191,7 @@ lto_free_section_data (struct lto_file_d > enum lto_section_type section_type, > const char *name, > const char *data, > - size_t len) > + size_t len, bool decompress) > { > const size_t header_length = sizeof (struct lto_data_header); > const char *real_data = data - header_length; > @@ -189,9 +200,7 @@ lto_free_section_data (struct lto_file_d > > gcc_assert (free_section_f); > > - /* FIXME lto: WPA mode does not write compressed sections, so for now > - suppress uncompression mapping if flag_ltrans. */ > - if (flag_ltrans) > + if (flag_ltrans && !decompress) > { > (free_section_f) (file_data, section_type, name, data, len); > return; > @@ -203,6 +212,17 @@ lto_free_section_data (struct lto_file_d > free (CONST_CAST (char *, real_data)); > } > > +/* Free data allocated by lto_get_raw_section_data. */ > + > +void > +lto_free_raw_section_data (struct lto_file_decl_data *file_data, > + enum lto_section_type section_type, > + const char *name, > + const char *data, > + size_t len) > +{ > + (free_section_f) (file_data, section_type, name, data, len); > +} > > /* Load a section of type SECTION_TYPE from FILE_DATA, parse the > header and then return an input block pointing to the section. The > Index: varpool.c > =================================================================== > --- varpool.c (revision 231546) > +++ varpool.c (working copy) > @@ -296,9 +303,11 @@ varpool_node::get_constructor (void) > > /* We may have renamed the declaration, e.g., a static function. */ > name = lto_get_decl_name_mapping (file_data, name); > + struct lto_in_decl_state *decl_state > + = lto_get_function_in_decl_state (file_data, decl); > > data = lto_get_section_data (file_data, LTO_section_function_body, > - name, &len); > + name, &len, decl_state->compressed); > if (!data) > fatal_error (input_location, "%s: section %s is missing", > file_data->file_name, > @@ -308,7 +317,7 @@ varpool_node::get_constructor (void) > gcc_assert (DECL_INITIAL (decl) != error_mark_node); > lto_stats.num_function_bodies++; > lto_free_section_data (file_data, LTO_section_function_body, name, > - data, len); > + data, len, decl_state->compressed); > lto_free_function_in_decl_state_for_node (this); > timevar_pop (TV_IPA_LTO_CTORS_IN); > return DECL_INITIAL (decl); > Index: lto-streamer-out.c > =================================================================== > --- lto-streamer-out.c (revision 231546) > +++ lto-streamer-out.c (working copy) > @@ -2191,22 +2224,23 @@ copy_function_or_variable (struct symtab > struct lto_in_decl_state *in_state; > struct lto_out_decl_state *out_state = lto_get_out_decl_state (); > > - lto_begin_section (section_name, !flag_wpa); > + lto_begin_section (section_name, false); > free (section_name); > > /* We may have renamed the declaration, e.g., a static function. */ > name = lto_get_decl_name_mapping (file_data, name); > > - data = lto_get_section_data (file_data, LTO_section_function_body, > - name, &len); > + data = lto_get_raw_section_data (file_data, LTO_section_function_body, > + name, &len); > gcc_assert (data); > > /* Do a bit copy of the function body. */ > - lto_write_data (data, len); > + lto_write_raw_data (data, len); > > /* Copy decls. */ > in_state = > lto_get_function_in_decl_state (node->lto_file_data, function); > + out_state->compressed = in_state->compressed; > gcc_assert (in_state); > > for (i = 0; i < LTO_N_DECL_STREAMS; i++) > @@ -2224,8 +2258,8 @@ copy_function_or_variable (struct symtab > encoder->trees.safe_push ((*trees)[j]); > } > > - lto_free_section_data (file_data, LTO_section_function_body, name, > - data, len); > + lto_free_raw_section_data (file_data, LTO_section_function_body, name, > + data, len); > lto_end_section (); > } > > @@ -2431,6 +2465,7 @@ lto_output_decl_state_refs (struct outpu > decl = (state->fn_decl) ? state->fn_decl : void_type_node; > streamer_tree_cache_lookup (ob->writer_cache, decl, &ref); > gcc_assert (ref != (unsigned)-1); > + ref = ref * 2 + (state->compressed ? 1 : 0); > lto_write_data (&ref, sizeof (uint32_t)); > > for (i = 0; i < LTO_N_DECL_STREAMS; i++) > Index: lto/lto-symtab.c > =================================================================== > --- lto/lto-symtab.c (revision 231548) > +++ lto/lto-symtab.c (working copy) > @@ -883,6 +883,11 @@ lto_symtab_merge_symbols_1 (symtab_node > else > { > DECL_INITIAL (e->decl) = error_mark_node; > + if (e->lto_file_data) > + { > + lto_free_function_in_decl_state_for_node (e); > + e->lto_file_data = NULL; > + } > symtab->call_varpool_removal_hooks (dyn_cast<varpool_node *> (e)); > } > e->remove_all_references (); > Index: lto/lto.c > =================================================================== > --- lto/lto.c (revision 231546) > +++ lto/lto.c (working copy) > @@ -234,6 +234,8 @@ lto_read_in_decl_state (struct data_in * > uint32_t i, j; > > ix = *data++; > + state->compressed = ix & 1; > + ix /= 2; > decl = streamer_tree_cache_get_tree (data_in->reader_cache, ix); > if (!VAR_OR_FUNCTION_DECL_P (decl)) > { > Index: lto-streamer.h > =================================================================== > --- lto-streamer.h (revision 231546) > +++ lto-streamer.h (working copy) > @@ -504,6 +505,9 @@ struct GTY((for_user)) lto_in_decl_state > /* If this in-decl state is associated with a function. FN_DECL > point to the FUNCTION_DECL. */ > tree fn_decl; > + > + /* True if decl state is compressed. */ > + bool compressed; > }; > > typedef struct lto_in_decl_state *lto_in_decl_state_ptr; > @@ -537,6 +541,9 @@ struct lto_out_decl_state > /* If this out-decl state belongs to a function, fn_decl points to that > function. Otherwise, it is NULL. */ > tree fn_decl; > + > + /* True if decl state is compressed. */ > + bool compressed; > }; > > typedef struct lto_out_decl_state *lto_out_decl_state_ptr; > @@ -761,10 +768,18 @@ extern void lto_set_in_hooks (struct lto > extern struct lto_file_decl_data **lto_get_file_decl_data (void); > extern const char *lto_get_section_data (struct lto_file_decl_data *, > enum lto_section_type, > - const char *, size_t *); > + const char *, size_t *, > + bool decompress = false); > +extern const char *lto_get_raw_section_data (struct lto_file_decl_data *, > + enum lto_section_type, > + const char *, size_t *); > extern void lto_free_section_data (struct lto_file_decl_data *, > - enum lto_section_type, > - const char *, const char *, size_t); > + enum lto_section_type, > + const char *, const char *, size_t, > + bool decompress = false); > +extern void lto_free_raw_section_data (struct lto_file_decl_data *, > + enum lto_section_type, > + const char *, const char *, size_t); > extern htab_t lto_create_renaming_table (void); > extern void lto_record_renamed_decl (struct lto_file_decl_data *, > const char *, const char *); > @@ -785,6 +800,7 @@ extern void lto_value_range_error (const > extern void lto_begin_section (const char *, bool); > extern void lto_end_section (void); > extern void lto_write_data (const void *, unsigned int); > +extern void lto_write_raw_data (const void *, unsigned int); > extern void lto_write_stream (struct lto_output_stream *); > extern bool lto_output_decl_index (struct lto_output_stream *, > struct lto_tree_ref_encoder *, > Index: lto-section-out.c > =================================================================== > --- lto-section-out.c (revision 231546) > +++ lto-section-out.c (working copy) > @@ -66,9 +66,6 @@ lto_begin_section (const char *name, boo > { > lang_hooks.lto.begin_section (name); > > - /* FIXME lto: for now, suppress compression if the lang_hook that appends > - data is anything other than assembler output. The effect here is that > - we get compression of IL only in non-ltrans object files. */ > gcc_assert (compression_stream == NULL); > if (compress) > compression_stream = lto_start_compression (lto_append_data, NULL); > @@ -99,6 +96,14 @@ lto_write_data (const void *data, unsign > lang_hooks.lto.append_data ((const char *)data, size, NULL); > } > > +/* Write SIZE bytes starting at DATA to the assembler. */ > + > +void > +lto_write_raw_data (const void *data, unsigned int size) > +{ > + lang_hooks.lto.append_data ((const char *)data, size, NULL); > +} > + > /* Write all of the chars in OBS to the assembler. Recycle the blocks > in obs as this is being done. */ > > @@ -123,10 +128,6 @@ lto_write_stream (struct lto_output_stre > if (!next_block) > num_chars -= obs->left_in_block; > > - /* FIXME lto: WPA mode uses an ELF function as a lang_hook to append > - output data. This hook is not happy with the way that compression > - blocks up output differently to the way it's blocked here. So for > - now, we don't compress WPA output. */ > if (compression_stream) > lto_compress_block (compression_stream, base, num_chars); > else > @@ -295,6 +296,9 @@ lto_new_out_decl_state (void) > for (i = 0; i < LTO_N_DECL_STREAMS; i++) > lto_init_tree_ref_encoder (&state->streams[i]); > > + /* At WPA time we do not compress sections by default. */ > + state->compressed = !flag_wpa; > + > return state; > } > > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)