[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P2
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #16 from Richard Biener rguenth at gcc dot gnu.org --- Created attachment 34974 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=34974action=edit Patch to limit coalescing amount The committed patch improves peak memory usage from 7.6GB to 5.8GB for the small testcase. The attached patch reduces memory usage from SSA coalescing further (to ~300MB) by simply doing less coalescing. Unfortunately the generated RTL puts a bigger load on CSE/DF and thus we need 7.6GB again (eventually one can find an optimal --param max-out-of-ssa-coalesce-names, but that's probably highly testcase specific). In theory you can iterate on coalescing piecewise as well, but the overhead for doing this might be too big (basically up to computing live/conflict for each coalesce pair separately, taking into account previous coalesces).
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #18 from Richard Biener rguenth at gcc dot gnu.org --- (In reply to Richard Biener from comment #17) Created attachment 34975 [details] do not compute live/conflict for abnormal coalesces This is the other idea of simply not computing live/conflict for abnormal coalesces we know to always succeed. This shrinks the following live/conflict problem for the regular coalesces by unifying some partitions. Doesn't help this particular testcase much. But it fixes PR63155 ...
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #15 from Richard Biener rguenth at gcc dot gnu.org --- Author: rguenth Date: Fri Mar 6 12:34:28 2015 New Revision: 221237 URL: https://gcc.gnu.org/viewcvs?rev=221237root=gccview=rev Log: 2015-03-06 Richard Biener rguent...@suse.de PR middle-end/64928 * tree-ssa-live.h (struct tree_live_info_d): Add livein_obstack and liveout_obstack members. (calculate_live_on_exit): Remove. (calculate_live_ranges): Change declaration. * tree-ssa-live.c (liveness_bitmap_obstack): Remove global var. (new_tree_live_info): Adjust. (calculate_live_ranges): Delete livein when not wanted. (calculate_live_ranges): Do not initialize liveness_bitmap_obstack. Deal with partly deleted live info. (loe_visit_block): Remove temporary bitmap by using bitmap_ior_and_compl_into. (live_worklist): Adjust accordingly. (calculate_live_on_exit): Make static. * tree-ssa-coalesce.c (coalesce_ssa_name): Tell calculate_live_ranges we do not need livein. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-coalesce.c trunk/gcc/tree-ssa-live.c trunk/gcc/tree-ssa-live.h
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #17 from Richard Biener rguenth at gcc dot gnu.org --- Created attachment 34975 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=34975action=edit do not compute live/conflict for abnormal coalesces This is the other idea of simply not computing live/conflict for abnormal coalesces we know to always succeed. This shrinks the following live/conflict problem for the regular coalesces by unifying some partitions. Doesn't help this particular testcase much.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #14 from Richard Biener rguenth at gcc dot gnu.org --- Note that if we fix out-of-SSA coalescing (patch in testing) then RTL CSE explodes via DF.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added CC||steven at gcc dot gnu.org --- Comment #12 from Steven Bosscher steven at gcc dot gnu.org --- (In reply to Richard Biener from comment #9) It seems that loop invariant motion is responsible for most of the abnormals, thus -fno-tree-loop-im restores performance. The loop LIM detects is of style bb 6: (header) # ___fp_3(ab) = PHI ___fp_41(4), ___fp_5(21) # ___r1_7(ab) = PHI ___r1_42(4), ___r1_9(21) # ___r2_11(ab) = PHI ___r2_43(4), ___r3_17(21) # ___r3_19(ab) = PHI ___r3_44(4), ___r3_23(21) # ___r4_25 = PHI ___r4_45(4), ___r4_26(21) # gotovar.17_29 = PHI _51(4), _69(21) goto gotovar.17_29; Perhaps disable LIM (and maybe PRE) if the CFG has a large edge/bb ratio (i.e. dense CFG)? There's probably no benefit in such cases anyway.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #13 from Jeffrey A. Law law at redhat dot com --- I think we've done similar things for Brad's large testcases in the past. You want to look at both the edge/bb density as well as the overall size. ie, a high density doesn't really hurt if the total cfg is small. See is_too_expensive in gcse.c for the current heuristics to avoid trying global opts on these kinds of testcases.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #11 from Richard Biener rguenth at gcc dot gnu.org --- Ok, so it's already calculate_live_ranges that takes much memory. I have a small patch to improve that somewhat. But what we really need is to get the must coalesce stuff coalesced with respect to both live and conflict computation. That is, map must-coalesce SSA vars to the same partition. That loses the SSA corruption testing, but well so it might be much more controversical (silent wrong-code instead of ICE). Unfortunately in the testcase there are only 2750 must-coalesces but 109493 partitions participating in the coalescing (so at least 5 want coalesces). The good news is of course that we can simply choose to _not_ coalesce that many variables, but say only the important ones.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 Jeffrey A. Law law at redhat dot com changed: What|Removed |Added CC||law at redhat dot com --- Comment #10 from Jeffrey A. Law law at redhat dot com --- Might want to look at 65076 as well where phase opt and generate is taking 89% of the compile time. Might be a better testcase to work with.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #9 from Richard Biener rguenth at gcc dot gnu.org --- It seems that loop invariant motion is responsible for most of the abnormals, thus -fno-tree-loop-im restores performance. The loop LIM detects is of style bb 6: (header) # ___fp_3(ab) = PHI ___fp_41(4), ___fp_5(21) # ___r1_7(ab) = PHI ___r1_42(4), ___r1_9(21) # ___r2_11(ab) = PHI ___r2_43(4), ___r3_17(21) # ___r3_19(ab) = PHI ___r3_44(4), ___r3_23(21) # ___r4_25 = PHI ___r4_45(4), ___r4_26(21) # gotovar.17_29 = PHI _51(4), _69(21) goto gotovar.17_29; ... bb 21: (latch) _67 = ___pc_1 + 15; _68 = (void * *) _67; _69 = *_68; PROF_edge_counter_142 = __gcov0.___H_object_2d__3e_u8vector[14]; PROF_edge_counter_143 = PROF_edge_counter_142 + 1; __gcov0.___H_object_2d__3e_u8vector[14] = PROF_edge_counter_143; goto bb 6; not sure if we should artificially limit such loops. LIM doesn't account for the (compile-time) cost of needing very many PHIs when rewriting the store-motion vars into SSA form (but it could in theory estimate by taking into account the CFG structure of the loop). Let's see if we can first generate a smaller testcase to illustrate the issue. Mine for now.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-02-09 Ever confirmed|0 |1 --- Comment #8 from Richard Biener rguenth at gcc dot gnu.org --- Ok, so the memory is used by out-of-SSA it seems #5 0x00c9eebc in coalesce_ssa_name () at /space/rguenther/src/svn/gcc-4_9-branch/gcc/tree-ssa-coalesce.c:1330 1330 graph = build_ssa_conflict_graph (liveinfo); (gdb) p *cl-list.htab $10 = {entries = 0x2b19b30, size = 524287, n_elements = 77146, n_deleted = 0, searches = 122189, collisions = 6508, size_prime_index = 16} where we malloc(!) 77146 entries of size 12. But of course bad is the conflict graph with 76063 bitmaps eating up around 1GB of memory for the first testcase (and function ___H__23__23_u8vector_2d__3e_object). That's likely caused by the change to more aggressively coalesce anonymous SSA names.
[Bug middle-end/64928] [4.8/4.9/5 Regression] Inordinate cpu time and memory usage in phase opt and generate with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added Keywords||memory-hog Component|other |middle-end Known to work||4.4.7 Target Milestone|--- |4.8.5 Summary|Inordinate cpu time and |[4.8/4.9/5 Regression] |memory usage in phase opt |Inordinate cpu time and |and generate with |memory usage in phase opt |-ftest-coverage |and generate with |-fprofile-arcs |-ftest-coverage ||-fprofile-arcs --- Comment #7 from Richard Biener rguenth at gcc dot gnu.org --- Given from the description I suppose that non-profiling/coverage mode is fine.