[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #25 from GCC Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:677122c9df1b55a791a54426269f7a8ce794f947 commit r15-7384-g677122c9df1b55a791a54426269f7a8ce794f947 Author: Richard Biener Date: Wed Feb 5 13:17:47 2025 +0100 rtl-optimization/117922 - disable fold-mem-offsets for highly connected CFG The PR shows fold-mem-offsets taking ages and a lot of memory computing DU/UD chains as that requires the RD problem. The issue is not so much the memory required for the pruned sets but the high CFG connectivity (and that the CFG is cyclic) which makes solving the dataflow problem expensive. The following adds the same limit as the one imposed by GCSE and CPROP. PR rtl-optimization/117922 * fold-mem-offsets.cc (pass_fold_mem_offsets::execute): Do nothing for a highly connected CFG.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #24 from Richard Biener --- (In reply to Paolo Bonzini from comment #22) > Yes, I suggested the MD problem because it did solve gambit performance > issues with fwprop for good, but RTL-SSA is certainly applicable. > > Also, once all RD uses are converted to RTL-SSA it should really be removed > from df.c as a bad idea (and probably also MD since it's unused *sheds a > tear*). Btw, all of UD/DU_CHAINS should be converted. For the testcase at hand all 61152 blocks are in a single big cycle, so optimizing the RD dataflow iteration looks difficult, it's just very cache unfriendly. df_worklist_dataflow_doublequeue: n_basic_blocks 61152 n_edges 48183769 count 378895 ( 6.2) it's the no longer factored computed goto that blows up the number of edges and thus the computational complexity here. See the bug about bb-reorder where we'd suggested to keep the actual CFG representation factored but have the computed goto instruction duplicated.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #23 from Richard Sandiford --- FWIW, running locally on x86 with fold_mem_offsets disabled (admittedly with rtl checking), I see: combiner : 0.91 ( 0%)21M ( 0%) late combiner : 1.31 ( 0%) 1329k ( 0%) and: forward prop : 1.41 ( 0%) 1028k ( 0%) This includes two late-combine runs (one before and one after RA) and two fwprop runs. So the time and memory overhead seem reasonable for this particular testcase. That obviously doesn't mean that it's free of scaling problems elsewhere, of course.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #22 from Paolo Bonzini --- Yes, I suggested the MD problem because it did solve gambit performance issues with fwprop for good, but RTL-SSA is certainly applicable. Also, once all RD uses are converted to RTL-SSA it should really be removed from df.c as a bad idea (and probably also MD since it's unused *sheds a tear*).
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #21 from Richard Sandiford --- I should have said, but: another reason for suggesting rtl-ssa is that it is usually self-updating. My long-term hope is that we'll be able to maintain it between passes, when we have consecutive passes that both use it. That isn't useful at the moment, because of the sparse usage, but it would become more useful if more passes use the framework. So in the long term, converting to rtl-ssa would mean that we might not need to build the whole IR for this pass individually. In contrast, doing local DF means that we would be committing to doing extra work for this pass in particular.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #20 from Richard Biener --- For the time being you could go the ext-dce workaround I added in r15-6793-g03faac50791380 and disable the pass when computing the DF problems will likely go out of hands. I agree that avoiding whole function DF when it's not actually needed would be good. RTL-SSA might be OK, but I suspect it has similar worse-cases when operating on a whole function. I suppose building it more locally might be possible as well (I think a "local" DF would be possible, too, but it requires some engineering here).
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #19 from Richard Sandiford --- Mind if I take this and try converting it to rtl-ssa? I think that would be more forward-looking, rather than adding more custom (dominator walk) code to the pass itself. Also, the switch to rtl-ssa did seem to make fwprop.c faster: see https://gcc.gnu.org/pipermail/gcc-patches/2020-November/558922.html for details.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #18 from Paolo Bonzini --- > Can someone help me find what Paolo referenced as "the multiple definitions > DF problem > that was introduced for fwprop in 2009"? Yes, the patch is at https://gcc.gnu.org/pipermail/gcc-patches/2009-June/263754.html It's composed of two parts: 1) a cheap dataflow problem that records registers that have multiple definitions where the CFG joins. The problem instructs the pass to ignore completely those registers 2) a dominator walk to note use-def relations where there can be only one definition. The basic algorithm starts at function build_single_def_use_links() in fwprop.c. You can find the old version of the file in git, just ask if you have more questions.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #17 from Christoph Müllner --- I reproduced the slow-down with a recent master on a 5950X: * no-mem-fold-offset: 4m58.226s * mem-fold-offset: 11m19.311s (+127%) More details from -ftime-report: * no-mem-fold-offset: df reaching defs : 9.34 ( 3%) 0 ( 0%) * mem-fold-offset: df reaching defs : 381.40 ( 55%) 0 ( 0%) A look at the detailed time report (-ftime-report -ftime-report-details) shows: Time variable wall GGC [...] phase opt and generate : 682.81 ( 99%) 6175M ( 97%) [...] callgraph functions expansion : 646.99 ( 94%) 5695M ( 89%) [...] fold mem offsets : 1.73 ( 0%) 679k ( 0%) `- CFG verifier: 2.10 ( 0%) 0 ( 0%) `- df use-def / def-use chains : 2.32 ( 0%) 0 ( 0%) `- df reaching defs: 370.68 ( 54%) 0 ( 0%) `- verify RTL sharing : 0.05 ( 0%) 0 ( 0%) [...] TOTAL : 690.06 6365M I read this as "fold mem offset utilizes 0% of memory", so there is no issue with the memory footprint. To confirm this, `time -v` was used: * no-mem-fold-offset: Maximum resident set size (kbytes): 15563684 * mem-fold-offset: Maximum resident set size (kbytes): 15564364 I looked at the pass, and a few things could be cleaned up in the pass itself (e.g., redundant calls). However, that won't change anything in the observed performance. The time-consuming part is UD+DU DF analysis for the whole function. Even if the pass would "return 0" right after doing nothing but the analysis, we end up with the same run time (confirmed by measurement). The pass operates on BB-granularity, so DF analysis of the whole function provides more information than needed. When going through the documentation, I came across df_set_blocks(), which I expected to reduce the problem significantly. So, I moved the df_analyse() call into the FOR_ALL_BB_FN() loop, right after a call to df_set_blocks(), with the intent to only have a single block set per iteration. However, that triggered a few ICEs in DF, and once they were bypassed, ended up in practical non-termination (i.e. the calls to df_analyse() won't get significantly cheaper by df_set_blocks()). My conclusion: This can only be fixed by not using DF analysis and implementing a pass-specific analysis. So far, I have not found a good solution for this. But I haven't looked at all the suggestions in detail. Can someone help me find what Paolo referenced as "the multiple definitions DF problem that was introduced for fwprop in 2009"?
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 ptomsich at gcc dot gnu.org changed: What|Removed |Added CC||ptomsich at gcc dot gnu.org Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |cmuellner at gcc dot gnu.org
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 Richard Biener changed: What|Removed |Added CC||philipp.tomsich at vrull dot eu --- Comment #16 from Richard Biener --- Please somebody from vrull address this bug.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 Paolo Bonzini changed: What|Removed |Added CC||bonzini at gnu dot org --- Comment #15 from Paolo Bonzini --- fold-mem-offsets seems to be a lot similar to fwprop... perhaps it could use the multiple definitions DF problem that was introduced for fwprop in 2009?... It's still there though it's unused, and it's a lot easier to retrofit than RTL-SSA that fwprop now uses.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #14 from rguenther at suse dot de --- On Fri, 6 Dec 2024, ebotcazou at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 > > Eric Botcazou changed: > >What|Removed |Added > > CC||ebotcazou at gcc dot gnu.org > > --- Comment #13 from Eric Botcazou --- > > One issue in fold-mem-offsets is that it seems to do per-BB operation but > > relies on global DF and looks at uses/defs also not within a BB? If it's > > supposed to be "global" then a better order than FOR_ALL_BB_FN might improve > > things. > > It also seems to do the same work multiple times: for each "root" in a given > BB, it goes up to its def in the BB and then goes down to all the uses of this > def, but I presume that this could be done only once per def? Yes. Trying to understand how it works I also believe the overall setup is overly complex - something like get_single_def_in_bb shouldn't be necessary. Walking interesting memory uses from interesting reg defs in a BB should be better.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 Eric Botcazou changed: What|Removed |Added CC||ebotcazou at gcc dot gnu.org --- Comment #13 from Eric Botcazou --- > One issue in fold-mem-offsets is that it seems to do per-BB operation but > relies on global DF and looks at uses/defs also not within a BB? If it's > supposed to be "global" then a better order than FOR_ALL_BB_FN might improve > things. It also seems to do the same work multiple times: for each "root" in a given BB, it goes up to its def in the BB and then goes down to all the uses of this def, but I presume that this could be done only once per def?
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #12 from Richard Biener --- I'm bisecting with -fno-fold-mem-offsets now.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #11 from Richard Biener --- I'll note that with release checking (less garbage collection) even with -fno-fold-mem-offsets I still see 20GB memory use and if-conversion 2: 3.93 ( 1%)55k ( 0%) `- loop init : 0.19 ( 0%) 1512 ( 0%) `- dominance computation : 1.00 ( 0%) 0 ( 0%) `- df live regs: 16.95 ( 6%) 0 ( 0%) `- df live&initialized regs: 16.42 ( 6%) 0 ( 0%) hard reg cprop : 4.00 ( 2%) 9072 ( 0%) `- df live regs: 11.01 ( 4%) 0 ( 0%) `- df reg dead/unused notes: 0.11 ( 0%) 278k ( 0%) `- df live&initialized regs: 11.01 ( 4%) 0 ( 0%) also reorder blocks : 13.78 ( 5%) 4663M ( 80%) ext dce: 13.48 ( 5%) 0 ( 0%) so this needs more bisection (and split out bugs). GCC 14 managed with 8GB memory and half the compile-time (with -fno-fold-mem-offsets). In particular reorder blocks GC memory report above sticks out and if-conversion 2 took just 0.1s with GCC 14.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #10 from rguenther at suse dot de --- On Fri, 6 Dec 2024, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 > > --- Comment #9 from Andrew Pinski --- > (In reply to Andrew Pinski from comment #6) > > (In reply to Andrew Pinski from comment #5) > > > (In reply to Andrew Pinski from comment #4) > > > > For me targetting aarch64, the peak memory is over 26G and it is taking > > > > over > > > > an hour. > > > > -O2 Finally finished for me: > > (--enable-checking=yes but -fno-checking): > > df reaching defs :2245.15 ( 47%) 0 ( 0%) > > df live regs : 457.94 ( 10%) 0 ( 0%) > > df live&initialized regs : 221.75 ( 5%) 0 ( 0%) > > scheduling : 468.28 ( 10%)13M ( 0%) > > With -fno-fold-mem-offsets: > df reaching defs : 22.39 ( 1%) 0 ( 0%) > df live regs : 461.52 ( 18%) 0 ( 0%) > df live&initialized regs : 223.56 ( 9%) 0 ( 0%) > scheduling : 467.40 ( 18%)13M ( 0%) > > So it is better but still pretty bad otherwise. -ftime-report -ftime-report-details will tell you which pass did those DF ops (iff the pass has a timevar...)
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #9 from Andrew Pinski --- (In reply to Andrew Pinski from comment #6) > (In reply to Andrew Pinski from comment #5) > > (In reply to Andrew Pinski from comment #4) > > > For me targetting aarch64, the peak memory is over 26G and it is taking > > > over > > > an hour. > > -O2 Finally finished for me: > (--enable-checking=yes but -fno-checking): > df reaching defs :2245.15 ( 47%) 0 ( 0%) > df live regs : 457.94 ( 10%) 0 ( 0%) > df live&initialized regs : 221.75 ( 5%) 0 ( 0%) > scheduling : 468.28 ( 10%)13M ( 0%) With -fno-fold-mem-offsets: df reaching defs : 22.39 ( 1%) 0 ( 0%) df live regs : 461.52 ( 18%) 0 ( 0%) df live&initialized regs : 223.56 ( 9%) 0 ( 0%) scheduling : 467.40 ( 18%)13M ( 0%) So it is better but still pretty bad otherwise.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #8 from GCC Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:8772f37e45e9401c9a361548e00c9691424e75e0 commit r15-5956-g8772f37e45e9401c9a361548e00c9691424e75e0 Author: Richard Biener Date: Fri Dec 6 08:08:55 2024 +0100 rtl-optimization/117922 - add timevar for fold-mem-offsets The new fold-mem-offsets RTL pass takes significant amount of time and memory. Add a timevar for it. PR rtl-optimization/117922 * timevar.def (TV_FOLD_MEM_OFFSETS): New. * fold-mem-offsets.cc (pass_data_fold_mem): Use TV_FOLD_MEM_OFFSETS.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #7 from Richard Biener --- One issue in fold-mem-offsets is that it seems to do per-BB operation but relies on global DF and looks at uses/defs also not within a BB? If it's supposed to be "global" then a better order than FOR_ALL_BB_FN might improve things. I'll note too many bitmaps in general and relying on UD and DU chains which are slow and expensive to compute (RD) instead of using RTL-SSA or doing most work in a simple scan-forward way rather than using full-blown DF.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #6 from Andrew Pinski --- (In reply to Andrew Pinski from comment #5) > (In reply to Andrew Pinski from comment #4) > > For me targetting aarch64, the peak memory is over 26G and it is taking over > > an hour. -O2 Finally finished for me: (--enable-checking=yes but -fno-checking): df reaching defs :2245.15 ( 47%) 0 ( 0%) df live regs : 457.94 ( 10%) 0 ( 0%) df live&initialized regs : 221.75 ( 5%) 0 ( 0%) scheduling : 468.28 ( 10%)13M ( 0%)
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #5 from Andrew Pinski --- (In reply to Andrew Pinski from comment #4) > For me targetting aarch64, the peak memory is over 26G and it is taking over > an hour. 65.86% cc1 [.] _Z15bitmap_ior_intoP11bitmap_headPKS_ ◆ 19.52% cc1 [.] _ZL14bitmap_elt_iorP11bitmap_headP14bitmap_elementS2_PKS1_S4_b ▒ 6.35% cc1 [.] _ZL29df_worklist_propagate_forwardP8dataflowjPjP11bitmap_headS3_P17simple_bitmap_defR3vecIi7va_heap6vl_ptrEi Looks like cprop_hardreg for aarch64 but I could be wrong.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 --- Comment #4 from Andrew Pinski --- For me targetting aarch64, the peak memory is over 26G and it is taking over an hour.
[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922 Richard Biener changed: What|Removed |Added Component|tree-optimization |rtl-optimization CC||law at gcc dot gnu.org, ||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org, ||tsamismanolis at gmail dot com Keywords|needs-bisection | Priority|P3 |P1 --- Comment #3 from Richard Biener --- I see we're running pass_fold_mem_offsets on x86 which takes most of the time but should be useless on that target? (and it misses a timevar) Using -fno-fold-mem-offsets gets us to TOTAL : 92.59 11.83104.46 1027M 92.59user 11.86system 1:44.49elapsed 99%CPU (0avgtext+0avgdata 8124868maxresident)k GCC 14.2 is TOTAL : 95.59 15.26110.88 1028M 95.59user 15.30system 1:50.93elapsed 99%CPU (0avgtext+0avgdata 8117996maxresident)k so the culprit is identified.