On Sat, 15 Jun 2013, Jan Hubicka wrote: > > > > I've managed to fix nearly all reported missed merged types for cc1. > > Remaining are those we'll never be able to merge (merging would > > change the SCC shape) and those that eventually end up refering > > to a TYPE_STUB_DECL with a make_anon_name () IDENTIFIER_NODE. > > For the latter we should find a middle-end solution as a followup > > in case it really matters. > > > > WPA statistics for stage2 cc1 are > > > > [WPA] read 2495082 SCCs of average size 2.380088 > > [WPA] 5938514 tree bodies read in total > > [WPA] tree SCC table: size 524287, 260253 elements, collision ratio: > > 0.804380 > > [WPA] tree SCC max chain length 11 (size 1) > > [WPA] Compared 429412 SCCs, 7039 collisions (0.016392) > > [WPA] Merged 426111 SCCs > > [WPA] Merged 3313709 tree bodies > > [WPA] Merged 225079 types > > [WPA] 162844 types prevailed (488124 associated trees) > > [WPA] Old merging code merges an additional 22412 types of which 21492 are > > in the same SCC with their prevailing variant (345831 and 323276 > > associated trees) > > > > which shows there are 920 such TYPE_STUB_DECL issues and 21492 > > merges the old code did that destroyed SCCs. > > > > Compared to the old code which only unified types and some selected > > trees (INTEGER_CSTs), the new code can immediately ggc_free the > > unified SCCs after they have been read which results in 55% of > > all tree bodies input into WPA stage to be freed (rather than hoping > > on secondary GC walk effects as the old code relied on), 58% of > > all types are recycled. > > > > Compile-time is at least on-par with the old code now and disk-usage > > grows by a moderate 10% due to the streaming format change. > > On Firefox we now get > [WPA] read 43144472 SCCs of average size 2.270524 > [WPA] 97960575 tree bodies read in total > [WPA] tree SCC table: size 8388593, 3936571 elements, collision ratio: > 0.727773 > [WPA] tree SCC max chain length 88 (size 1) > [WPA] Compared 19030240 SCCs, 337719 collisions (0.017746) > [WPA] Merged 18957101 SCCs > [WPA] Merged 58202930 tree bodies > [WPA] Merged 11800337 types > [WPA] 4506307 types prevailed (13699881 associated trees) > [WPA] Old merging code merges an additional 2174796 types of which 141104 are > in the same SCC with their prevailing variant (12811826 and 6367853 > associated trees) > [WPA] GIMPLE canonical type table: size 131071, 77871 elements, 4506442 > searches, 1130903 collisions (ratio: 0.250953) > [WPA] GIMPLE canonical type hash table: size 8388593, 4506386 elements, > 15712947 searches, 12879021 collisions (ratio: 0.819644) > > and about 5GB of GGC memory after merging, overall the footprint is still > around 10GB. > It is notable improvmenet over old code however, where we needed 16GB. > > [LTRANS] read 319710 SCCs of average size 6.184039 > [LTRANS] 1977099 tree bodies read in total > [LTRANS] GIMPLE canonical type table: size 16381, 9569 elements, 473131 > searches, 24899 collisions (ratio: 0.052626) > [LTRANS] GIMPLE canonical type hash table: size 1048573, 473076 elements, > 1611909 searches, 1340396 collisions (ratio: 0.831558) > > CPU: AMD64 family10, speed 2100 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit > mask of 0x00 (No unit mask) count 750000 > samples % app name symbol name > 45047 11.7420 lto1 inflate_fast
It might be worth changing LTO section layout to include a header that specifies whether a section is compressed or not so we can allow mixed compressed/uncompressed sections in the LTRANS files and avoid decompressing the function sections. > 34224 8.9209 lto1 > streamer_read_uhwi(lto_input_block*) > 24630 6.4201 lto1 compare_tree_sccs_1(tree_node*, > tree_node*, tree_node***) > 23205 6.0487 lto1 pointer_map_insert(pointer_map_t*, > void const*) > 20829 5.4293 lto1 unpack_value_fields(data_in*, > bitpack_d*, tree_node*) > 13545 3.5307 lto1 ht_lookup_with_hash(ht*, unsigned > char const*, unsigned long, unsigned int, ht_lookup_option) > 12841 3.3472 libc-2.11.1.so memset > 11840 3.0862 lto1 htab_find_slot_with_hash > 11397 2.9708 lto1 > streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned > int, unsigned int*, bool) > 11086 2.8897 lto1 lto_input_tree(lto_input_block*, > data_in*) > 10522 2.7427 lto1 lto_input_tree_1(lto_input_block*, > data_in*, LTO_tags, unsigned int) > 8853 2.3076 lto1 unify_scc(streamer_tree_cache_d*, > unsigned int, unsigned int, unsigned int, unsigned int) > 8539 2.2258 lto1 hash_table<tree_scc_hasher, > xcallocator>::find_slot_with_hash(tree_scc const*, unsigned int, > insert_option) > 7987 2.0819 lto1 adler32 > 7743 2.0183 lto1 > streamer_read_tree_body(lto_input_block*, data_in*, tree_node*) > > Can't we free the pointer map in streamer after every SCC? You mean on read-in? We even can do without the pointer-map there at all. We can experiment with that as a followup. > phase stream in : 244.05 (47%) usr 7.14 (25%) sys 252.74 (46%) > wall 3478752 kB (93%) ggc > phase stream out : 222.50 (43%) usr 21.22 (73%) sys 243.97 (44%) > wall 7160 kB ( 0%) ggc > garbage collection : 12.88 ( 2%) usr 0.00 ( 0%) sys 12.88 ( 2%) > wall 0 kB ( 0%) ggc > ipa lto decl in : 177.88 (34%) usr 4.87 (17%) sys 184.28 (34%) > wall 3887482 kB (103%) ggc > ipa lto decl out : 199.66 (39%) usr 11.30 (39%) sys 211.05 (38%) > wall 0 kB ( 0%) ggc > ipa inlining heuristics : 29.49 ( 6%) usr 0.70 ( 2%) sys 30.21 ( 6%) > wall 1353230 kB (36%) ggc > ipa lto decl merge : 26.57 ( 5%) usr 0.00 ( 0%) sys 26.58 ( 5%) > wall 8269 kB ( 0%) ggc > ipa lto cgraph merge : 12.78 ( 2%) usr 0.04 ( 0%) sys 12.82 ( 2%) > wall 142112 kB ( 4%) ggc > TOTAL : 517.75 29.13 548.71 > 3758407 kB > > The longest running ltrans add another 400 seconds. > combiner : 16.16 ( 4%) usr 0.08 ( 1%) sys 16.53 ( 4%) > wall 205251 kB ( 6%) ggc > integrated RA : 47.97 (12%) usr 0.21 ( 3%) sys 48.39 (12%) > wall 391655 kB (12%) ggc > LRA hard reg assignment : 158.64 (39%) usr 0.02 ( 0%) sys 158.74 (38%) > wall 0 kB ( 0%) ggc > TOTAL : 404.51 8.39 414.01 > 3215235 kB Otherwise it looks pretty good. Richard.