On Sat, 15 Jun 2013, Jan Hubicka wrote:

> > 
> > I've managed to fix nearly all reported missed merged types for cc1.
> > Remaining are those we'll never be able to merge (merging would
> > change the SCC shape) and those that eventually end up refering
> > to a TYPE_STUB_DECL with a make_anon_name () IDENTIFIER_NODE.
> > For the latter we should find a middle-end solution as a followup
> > in case it really matters.
> > 
> > WPA statistics for stage2 cc1 are
> > 
> > [WPA] read 2495082 SCCs of average size 2.380088
> > [WPA] 5938514 tree bodies read in total
> > [WPA] tree SCC table: size 524287, 260253 elements, collision ratio: 
> > 0.804380
> > [WPA] tree SCC max chain length 11 (size 1)
> > [WPA] Compared 429412 SCCs, 7039 collisions (0.016392)
> > [WPA] Merged 426111 SCCs
> > [WPA] Merged 3313709 tree bodies
> > [WPA] Merged 225079 types
> > [WPA] 162844 types prevailed (488124 associated trees)
> > [WPA] Old merging code merges an additional 22412 types of which 21492 are 
> > in the same SCC with their prevailing variant (345831 and 323276 
> > associated trees)
> > 
> > which shows there are 920 such TYPE_STUB_DECL issues and 21492
> > merges the old code did that destroyed SCCs.
> > 
> > Compared to the old code which only unified types and some selected
> > trees (INTEGER_CSTs), the new code can immediately ggc_free the
> > unified SCCs after they have been read which results in 55% of
> > all tree bodies input into WPA stage to be freed (rather than hoping
> > on secondary GC walk effects as the old code relied on), 58% of
> > all types are recycled.
> > 
> > Compile-time is at least on-par with the old code now and disk-usage
> > grows by a moderate 10% due to the streaming format change.
> 
> On Firefox we now get
> [WPA] read 43144472 SCCs of average size 2.270524
> [WPA] 97960575 tree bodies read in total
> [WPA] tree SCC table: size 8388593, 3936571 elements, collision ratio: 
> 0.727773
> [WPA] tree SCC max chain length 88 (size 1)
> [WPA] Compared 19030240 SCCs, 337719 collisions (0.017746)
> [WPA] Merged 18957101 SCCs
> [WPA] Merged 58202930 tree bodies
> [WPA] Merged 11800337 types
> [WPA] 4506307 types prevailed (13699881 associated trees)
> [WPA] Old merging code merges an additional 2174796 types of which 141104 are 
> in the same SCC with their prevailing variant (12811826 and 6367853 
> associated trees)
> [WPA] GIMPLE canonical type table: size 131071, 77871 elements, 4506442 
> searches, 1130903 collisions (ratio: 0.250953)
> [WPA] GIMPLE canonical type hash table: size 8388593, 4506386 elements, 
> 15712947 searches, 12879021 collisions (ratio: 0.819644)
> 
> and about 5GB of GGC memory after merging, overall the footprint is still 
> around 10GB.
> It is notable improvmenet over old code however, where we needed 16GB.
> 
> [LTRANS] read 319710 SCCs of average size 6.184039
> [LTRANS] 1977099 tree bodies read in total
> [LTRANS] GIMPLE canonical type table: size 16381, 9569 elements, 473131 
> searches, 24899 collisions (ratio: 0.052626)
> [LTRANS] GIMPLE canonical type hash table: size 1048573, 473076 elements, 
> 1611909 searches, 1340396 collisions (ratio: 0.831558)
> 
> CPU: AMD64 family10, speed 2100 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
> mask of 0x00 (No unit mask) count 750000
> samples  %        app name                 symbol name
> 45047    11.7420  lto1                     inflate_fast

It might be worth changing LTO section layout to include a header
that specifies whether a section is compressed or not so we can
allow mixed compressed/uncompressed sections in the LTRANS files
and avoid decompressing the function sections.

> 34224     8.9209  lto1                     
> streamer_read_uhwi(lto_input_block*)
> 24630     6.4201  lto1                     compare_tree_sccs_1(tree_node*, 
> tree_node*, tree_node***)
> 23205     6.0487  lto1                     pointer_map_insert(pointer_map_t*, 
> void const*)
> 20829     5.4293  lto1                     unpack_value_fields(data_in*, 
> bitpack_d*, tree_node*)
> 13545     3.5307  lto1                     ht_lookup_with_hash(ht*, unsigned 
> char const*, unsigned long, unsigned int, ht_lookup_option)
> 12841     3.3472  libc-2.11.1.so           memset
> 11840     3.0862  lto1                     htab_find_slot_with_hash
> 11397     2.9708  lto1                     
> streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned 
> int, unsigned int*, bool)
> 11086     2.8897  lto1                     lto_input_tree(lto_input_block*, 
> data_in*)
> 10522     2.7427  lto1                     lto_input_tree_1(lto_input_block*, 
> data_in*, LTO_tags, unsigned int)
> 8853      2.3076  lto1                     unify_scc(streamer_tree_cache_d*, 
> unsigned int, unsigned int, unsigned int, unsigned int)
> 8539      2.2258  lto1                     hash_table<tree_scc_hasher, 
> xcallocator>::find_slot_with_hash(tree_scc const*, unsigned int, 
> insert_option)
> 7987      2.0819  lto1                     adler32
> 7743      2.0183  lto1                     
> streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
> 
> Can't we free the pointer map in streamer after every SCC?

You mean on read-in?  We even can do without the pointer-map there at all.

We can experiment with that as a followup.

>  phase stream in         : 244.05 (47%) usr   7.14 (25%) sys 252.74 (46%) 
> wall 3478752 kB (93%) ggc
>  phase stream out        : 222.50 (43%) usr  21.22 (73%) sys 243.97 (44%) 
> wall    7160 kB ( 0%) ggc
>  garbage collection      :  12.88 ( 2%) usr   0.00 ( 0%) sys  12.88 ( 2%) 
> wall       0 kB ( 0%) ggc
>  ipa lto decl in         : 177.88 (34%) usr   4.87 (17%) sys 184.28 (34%) 
> wall 3887482 kB (103%) ggc
>  ipa lto decl out        : 199.66 (39%) usr  11.30 (39%) sys 211.05 (38%) 
> wall       0 kB ( 0%) ggc
>  ipa inlining heuristics :  29.49 ( 6%) usr   0.70 ( 2%) sys  30.21 ( 6%) 
> wall 1353230 kB (36%) ggc
>  ipa lto decl merge      :  26.57 ( 5%) usr   0.00 ( 0%) sys  26.58 ( 5%) 
> wall    8269 kB ( 0%) ggc
>  ipa lto cgraph merge    :  12.78 ( 2%) usr   0.04 ( 0%) sys  12.82 ( 2%) 
> wall  142112 kB ( 4%) ggc
>  TOTAL                 : 517.75            29.13           548.71            
> 3758407 kB
> 
> The longest running ltrans add another 400 seconds.
>  combiner                :  16.16 ( 4%) usr   0.08 ( 1%) sys  16.53 ( 4%) 
> wall  205251 kB ( 6%) ggc
>  integrated RA           :  47.97 (12%) usr   0.21 ( 3%) sys  48.39 (12%) 
> wall  391655 kB (12%) ggc
>  LRA hard reg assignment : 158.64 (39%) usr   0.02 ( 0%) sys 158.74 (38%) 
> wall       0 kB ( 0%) ggc
>  TOTAL                 : 404.51             8.39           414.01            
> 3215235 kB

Otherwise it looks pretty good.

Richard.

Reply via email to