> > CPU: AMD64 family10, speed 2100 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
> > mask of 0x00 (No unit mask) count 750000
> > samples  %        app name                 symbol name
> > 45047    11.7420  lto1                     inflate_fast
> 
> It might be worth changing LTO section layout to include a header
> that specifies whether a section is compressed or not so we can
> allow mixed compressed/uncompressed sections in the LTRANS files
> and avoid decompressing the function sections.

Yes, but this profile shows only decl streaming. Functions do not really show
up in profile.  I guess only way to cut this down is to either use LZO that
is faster at decompression side and/or reduce amount of data we stream to .o
files.
> 
> > 34224     8.9209  lto1                     
> > streamer_read_uhwi(lto_input_block*)
> > 24630     6.4201  lto1                     compare_tree_sccs_1(tree_node*, 
> > tree_node*, tree_node***)
> > 23205     6.0487  lto1                     
> > pointer_map_insert(pointer_map_t*, void const*)
> > 20829     5.4293  lto1                     unpack_value_fields(data_in*, 
> > bitpack_d*, tree_node*)
> > 13545     3.5307  lto1                     ht_lookup_with_hash(ht*, 
> > unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
> > 12841     3.3472  libc-2.11.1.so           memset
> > 11840     3.0862  lto1                     htab_find_slot_with_hash
> > 11397     2.9708  lto1                     
> > streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned 
> > int, unsigned int*, bool)
> > 11086     2.8897  lto1                     lto_input_tree(lto_input_block*, 
> > data_in*)
> > 10522     2.7427  lto1                     
> > lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
> > 8853      2.3076  lto1                     
> > unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, 
> > unsigned int)
> > 8539      2.2258  lto1                     hash_table<tree_scc_hasher, 
> > xcallocator>::find_slot_with_hash(tree_scc const*, unsigned int, 
> > insert_option)
> > 7987      2.0819  lto1                     adler32
> > 7743      2.0183  lto1                     
> > streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
> > 
> > Can't we free the pointer map in streamer after every SCC?
> 
> You mean on read-in?  We even can do without the pointer-map there at all.
> 
> We can experiment with that as a followup.

I believe it was needed for one of the cleanups (to update the map), but i guess
one can easily just run the fixup on the segment of array corresponding to new 
SCC.
> > The longest running ltrans add another 400 seconds.
> >  combiner                :  16.16 ( 4%) usr   0.08 ( 1%) sys  16.53 ( 4%) 
> > wall  205251 kB ( 6%) ggc
> >  integrated RA           :  47.97 (12%) usr   0.21 ( 3%) sys  48.39 (12%) 
> > wall  391655 kB (12%) ggc
> >  LRA hard reg assignment : 158.64 (39%) usr   0.02 ( 0%) sys 158.74 (38%) 
> > wall       0 kB ( 0%) ggc
> >  TOTAL                 : 404.51             8.39           414.01           
> >  3215235 kB
> 
> Otherwise it looks pretty good.

Indeed. We are getting closer to numbers I measured on the same machine in 
2010, when Firefox
was half of its today size.

Thanks for all the hard work!
Honza

Reply via email to