[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-02-06 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #25 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:677122c9df1b55a791a54426269f7a8ce794f947

commit r15-7384-g677122c9df1b55a791a54426269f7a8ce794f947
Author: Richard Biener 
Date:   Wed Feb 5 13:17:47 2025 +0100

rtl-optimization/117922 - disable fold-mem-offsets for highly connected CFG

The PR shows fold-mem-offsets taking ages and a lot of memory computing
DU/UD chains as that requires the RD problem.  The issue is not so much
the memory required for the pruned sets but the high CFG connectivity
(and that the CFG is cyclic) which makes solving the dataflow problem
expensive.

The following adds the same limit as the one imposed by GCSE and CPROP.

PR rtl-optimization/117922
* fold-mem-offsets.cc (pass_fold_mem_offsets::execute):
Do nothing for a highly connected CFG.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-02-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #24 from Richard Biener  ---
(In reply to Paolo Bonzini from comment #22)
> Yes, I suggested the MD problem because it did solve gambit performance
> issues with fwprop for good, but RTL-SSA is certainly applicable.
> 
> Also, once all RD uses are converted to RTL-SSA it should really be removed
> from df.c as a bad idea (and probably also MD since it's unused *sheds a
> tear*).

Btw, all of UD/DU_CHAINS should be converted.

For the testcase at hand all 61152 blocks are in a single big cycle, so
optimizing the RD dataflow iteration looks difficult, it's just very
cache unfriendly.

df_worklist_dataflow_doublequeue: n_basic_blocks 61152 n_edges 48183769 count
378895 (  6.2)

it's the no longer factored computed goto that blows up the number of edges
and thus the computational complexity here.  See the bug about bb-reorder
where we'd suggested to keep the actual CFG representation factored but
have the computed goto instruction duplicated.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #23 from Richard Sandiford  ---
FWIW, running locally on x86 with fold_mem_offsets disabled (admittedly with
rtl checking), I see:

 combiner   :   0.91 (  0%)21M (  0%)
 late combiner  :   1.31 (  0%)  1329k (  0%)

and:

 forward prop   :   1.41 (  0%)  1028k (  0%)

This includes two late-combine runs (one before and one after RA) and two
fwprop runs.

So the time and memory overhead seem reasonable for this particular testcase. 
That obviously doesn't mean that it's free of scaling problems elsewhere, of
course.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread bonzini at gnu dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #22 from Paolo Bonzini  ---
Yes, I suggested the MD problem because it did solve gambit performance issues
with fwprop for good, but RTL-SSA is certainly applicable.

Also, once all RD uses are converted to RTL-SSA it should really be removed
from df.c as a bad idea (and probably also MD since it's unused *sheds a
tear*).

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #21 from Richard Sandiford  ---
I should have said, but: another reason for suggesting rtl-ssa is that it is
usually self-updating.  My long-term hope is that we'll be able to maintain it
between passes, when we have consecutive passes that both use it.  That isn't
useful at the moment, because of the sparse usage, but it would become more
useful if more passes use the framework.

So in the long term, converting to rtl-ssa would mean that we might not need to
build the whole IR for this pass individually.  In contrast, doing local DF
means that we would be committing to doing extra work for this pass in
particular.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #20 from Richard Biener  ---
For the time being you could go the ext-dce workaround I added in
r15-6793-g03faac50791380 and disable the pass when computing the DF problems
will
likely go out of hands.

I agree that avoiding whole function DF when it's not actually needed would
be good.  RTL-SSA might be OK, but I suspect it has similar worse-cases
when operating on a whole function.  I suppose building it more locally
might be possible as well (I think a "local" DF would be possible, too,
but it requires some engineering here).

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #19 from Richard Sandiford  ---
Mind if I take this and try converting it to rtl-ssa?  I think that would be
more forward-looking, rather than adding more custom (dominator walk) code to
the pass itself.  Also, the switch to rtl-ssa did seem to make fwprop.c faster:
see https://gcc.gnu.org/pipermail/gcc-patches/2020-November/558922.html for
details.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread bonzini at gnu dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #18 from Paolo Bonzini  ---
> Can someone help me find what Paolo referenced as "the multiple definitions 
> DF problem 
> that was introduced for fwprop in 2009"?

Yes, the patch is at
https://gcc.gnu.org/pipermail/gcc-patches/2009-June/263754.html

It's composed of two parts:

1) a cheap dataflow problem that records registers that have multiple
definitions where the CFG joins. The problem instructs the pass to ignore
completely those registers

2) a dominator walk to note use-def relations where there can be only one
definition.

The basic algorithm starts at function build_single_def_use_links() in
fwprop.c.  You can find the old version of the file in git, just ask if you
have more questions.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-29 Thread cmuellner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #17 from Christoph Müllner  ---
I reproduced the slow-down with a recent master on a 5950X:
* no-mem-fold-offset: 4m58.226s
* mem-fold-offset: 11m19.311s (+127%)

More details from -ftime-report:
* no-mem-fold-offset: df reaching defs   :   9.34 (  3%) 0  (  0%)
* mem-fold-offset: df reaching defs  : 381.40 ( 55%) 0  (  0%)

A look at the detailed time report (-ftime-report -ftime-report-details) shows:

Time variable  wall   GGC
[...]
 phase opt and generate : 682.81 ( 99%)  6175M ( 97%)
 [...]
 callgraph functions expansion  : 646.99 ( 94%)  5695M ( 89%)
[...]
 fold mem offsets   :   1.73 (  0%)   679k (  0%)
 `- CFG verifier:   2.10 (  0%) 0  (  0%)
 `- df use-def / def-use chains :   2.32 (  0%) 0  (  0%)
 `- df reaching defs: 370.68 ( 54%) 0  (  0%)
 `- verify RTL sharing  :   0.05 (  0%) 0  (  0%)
[...]
 TOTAL  : 690.06 6365M

I read this as "fold mem offset utilizes 0% of memory", so there is no issue
with the memory footprint.

To confirm this, `time -v` was used:
* no-mem-fold-offset: Maximum resident set size (kbytes): 15563684
* mem-fold-offset: Maximum resident set size (kbytes): 15564364

I looked at the pass, and a few things could be cleaned up in the pass itself
(e.g., redundant calls). However, that won't change anything in the observed
performance.
The time-consuming part is UD+DU DF analysis for the whole function.
Even if the pass would "return 0" right after doing nothing but the analysis,
we end up with the same run time (confirmed by measurement).

The pass operates on BB-granularity, so DF analysis of the whole function
provides more information than needed. When going through the documentation, I
came across df_set_blocks(), which I expected to reduce the problem
significantly.
So, I moved the df_analyse() call into the FOR_ALL_BB_FN() loop, right after a
call to df_set_blocks(), with the intent to only have a single block set per
iteration.
However, that triggered a few ICEs in DF, and once they were bypassed, ended up
in practical non-termination (i.e. the calls to df_analyse() won't get
significantly cheaper by df_set_blocks()).

My conclusion:
This can only be fixed by not using DF analysis and implementing a
pass-specific analysis.

So far, I have not found a good solution for this. But I haven't looked at all
the suggestions in detail. Can someone help me find what Paolo referenced as
"the multiple definitions DF problem that was introduced for fwprop in 2009"?

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-10 Thread ptomsich at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

ptomsich at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ptomsich at gcc dot gnu.org
 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |cmuellner at gcc dot 
gnu.org

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2025-01-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

Richard Biener  changed:

   What|Removed |Added

 CC||philipp.tomsich at vrull dot eu

--- Comment #16 from Richard Biener  ---
Please somebody from vrull address this bug.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-09 Thread bonzini at gnu dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

Paolo Bonzini  changed:

   What|Removed |Added

 CC||bonzini at gnu dot org

--- Comment #15 from Paolo Bonzini  ---
fold-mem-offsets seems to be a lot similar to fwprop... perhaps it could use
the multiple definitions DF problem that was introduced for fwprop in 2009?... 
It's still there though it's unused, and it's a lot easier to retrofit than
RTL-SSA that fwprop now uses.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #14 from rguenther at suse dot de  ---
On Fri, 6 Dec 2024, ebotcazou at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922
> 
> Eric Botcazou  changed:
> 
>What|Removed |Added
> 
>  CC||ebotcazou at gcc dot gnu.org
> 
> --- Comment #13 from Eric Botcazou  ---
> > One issue in fold-mem-offsets is that it seems to do per-BB operation but
> > relies on global DF and looks at uses/defs also not within a BB?  If it's
> > supposed to be "global" then a better order than FOR_ALL_BB_FN might improve
> > things.
> 
> It also seems to do the same work multiple times: for each "root" in a given
> BB, it goes up to its def in the BB and then goes down to all the uses of this
> def, but I presume that this could be done only once per def?

Yes.  Trying to understand how it works I also believe the overall
setup is overly complex - something like get_single_def_in_bb
shouldn't be necessary.  Walking interesting memory uses from
interesting reg defs in a BB should be better.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-06 Thread ebotcazou at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

Eric Botcazou  changed:

   What|Removed |Added

 CC||ebotcazou at gcc dot gnu.org

--- Comment #13 from Eric Botcazou  ---
> One issue in fold-mem-offsets is that it seems to do per-BB operation but
> relies on global DF and looks at uses/defs also not within a BB?  If it's
> supposed to be "global" then a better order than FOR_ALL_BB_FN might improve
> things.

It also seems to do the same work multiple times: for each "root" in a given
BB, it goes up to its def in the BB and then goes down to all the uses of this
def, but I presume that this could be done only once per def?

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #12 from Richard Biener  ---
I'm bisecting with -fno-fold-mem-offsets now.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #11 from Richard Biener  ---
I'll note that with release checking (less garbage collection) even with
-fno-fold-mem-offsets I still see 20GB memory use and

 if-conversion 2:   3.93 (  1%)55k (  0%)
 `- loop init   :   0.19 (  0%)  1512  (  0%)
 `- dominance computation   :   1.00 (  0%) 0  (  0%)
 `- df live regs:  16.95 (  6%) 0  (  0%)
 `- df live&initialized regs:  16.42 (  6%) 0  (  0%)

 hard reg cprop :   4.00 (  2%)  9072  (  0%)
 `- df live regs:  11.01 (  4%) 0  (  0%)
 `- df reg dead/unused notes:   0.11 (  0%)   278k (  0%)
 `- df live&initialized regs:  11.01 (  4%) 0  (  0%)

also

 reorder blocks :  13.78 (  5%)  4663M ( 80%)
 ext dce:  13.48 (  5%) 0  (  0%)

so this needs more bisection (and split out bugs).  GCC 14 managed with
8GB memory and half the compile-time (with -fno-fold-mem-offsets).
In particular reorder blocks GC memory report above sticks out and
if-conversion 2 took just 0.1s with GCC 14.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #10 from rguenther at suse dot de  ---
On Fri, 6 Dec 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922
> 
> --- Comment #9 from Andrew Pinski  ---
> (In reply to Andrew Pinski from comment #6)
> > (In reply to Andrew Pinski from comment #5)
> > > (In reply to Andrew Pinski from comment #4)
> > > > For me targetting aarch64, the peak memory is over 26G and it is taking 
> > > > over
> > > > an hour.
> > 
> > -O2 Finally finished for me:
> > (--enable-checking=yes but -fno-checking):
> >  df reaching defs   :2245.15 ( 47%) 0  (  0%)
> >  df live regs   : 457.94 ( 10%) 0  (  0%)
> >  df live&initialized regs   : 221.75 (  5%) 0  (  0%)
> >  scheduling : 468.28 ( 10%)13M (  0%)
> 
> With -fno-fold-mem-offsets:
>  df reaching defs   :  22.39 (  1%) 0  (  0%)
>  df live regs   : 461.52 ( 18%) 0  (  0%)
>  df live&initialized regs   : 223.56 (  9%) 0  (  0%)
>  scheduling : 467.40 ( 18%)13M (  0%)
> 
> So it is better but still pretty bad otherwise.

-ftime-report -ftime-report-details will tell you which pass did
those DF ops (iff the pass has a timevar...)

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #9 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #6)
> (In reply to Andrew Pinski from comment #5)
> > (In reply to Andrew Pinski from comment #4)
> > > For me targetting aarch64, the peak memory is over 26G and it is taking 
> > > over
> > > an hour.
> 
> -O2 Finally finished for me:
> (--enable-checking=yes but -fno-checking):
>  df reaching defs   :2245.15 ( 47%) 0  (  0%)
>  df live regs   : 457.94 ( 10%) 0  (  0%)
>  df live&initialized regs   : 221.75 (  5%) 0  (  0%)
>  scheduling : 468.28 ( 10%)13M (  0%)

With -fno-fold-mem-offsets:
 df reaching defs   :  22.39 (  1%) 0  (  0%)
 df live regs   : 461.52 ( 18%) 0  (  0%)
 df live&initialized regs   : 223.56 (  9%) 0  (  0%)
 scheduling : 467.40 ( 18%)13M (  0%)

So it is better but still pretty bad otherwise.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #8 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:8772f37e45e9401c9a361548e00c9691424e75e0

commit r15-5956-g8772f37e45e9401c9a361548e00c9691424e75e0
Author: Richard Biener 
Date:   Fri Dec 6 08:08:55 2024 +0100

rtl-optimization/117922 - add timevar for fold-mem-offsets

The new fold-mem-offsets RTL pass takes significant amount of time
and memory.  Add a timevar for it.

PR rtl-optimization/117922
* timevar.def (TV_FOLD_MEM_OFFSETS): New.
* fold-mem-offsets.cc (pass_data_fold_mem): Use
TV_FOLD_MEM_OFFSETS.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #7 from Richard Biener  ---
One issue in fold-mem-offsets is that it seems to do per-BB operation but
relies on global DF and looks at uses/defs also not within a BB?  If it's
supposed
to be "global" then a better order than FOR_ALL_BB_FN might improve things.

I'll note too many bitmaps in general and relying on UD and DU chains which
are slow and expensive to compute (RD) instead of using RTL-SSA or doing
most work in a simple scan-forward way rather than using full-blown DF.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #6 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #5)
> (In reply to Andrew Pinski from comment #4)
> > For me targetting aarch64, the peak memory is over 26G and it is taking over
> > an hour.

-O2 Finally finished for me:
(--enable-checking=yes but -fno-checking):
 df reaching defs   :2245.15 ( 47%) 0  (  0%)
 df live regs   : 457.94 ( 10%) 0  (  0%)
 df live&initialized regs   : 221.75 (  5%) 0  (  0%)
 scheduling : 468.28 ( 10%)13M (  0%)

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #5 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #4)
> For me targetting aarch64, the peak memory is over 26G and it is taking over
> an hour.

  65.86%  cc1   [.] _Z15bitmap_ior_intoP11bitmap_headPKS_  
   
   
◆
  19.52%  cc1   [.]
_ZL14bitmap_elt_iorP11bitmap_headP14bitmap_elementS2_PKS1_S4_b 
   
   
▒
   6.35%  cc1   [.]
_ZL29df_worklist_propagate_forwardP8dataflowjPjP11bitmap_headS3_P17simple_bitmap_defR3vecIi7va_heap6vl_ptrEi
 


Looks like cprop_hardreg for aarch64 but I could be wrong.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

--- Comment #4 from Andrew Pinski  ---
For me targetting aarch64, the peak memory is over 26G and it is taking over an
hour.

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

2024-12-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922

Richard Biener  changed:

   What|Removed |Added

  Component|tree-optimization   |rtl-optimization
 CC||law at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org,
   ||tsamismanolis at gmail dot com
   Keywords|needs-bisection |
   Priority|P3  |P1

--- Comment #3 from Richard Biener  ---
I see we're running pass_fold_mem_offsets on x86 which takes most of the time
but should be useless on that target? (and it misses a timevar)

Using -fno-fold-mem-offsets gets us to

 TOTAL  :  92.59 11.83104.46   
 1027M
92.59user 11.86system 1:44.49elapsed 99%CPU (0avgtext+0avgdata
8124868maxresident)k

GCC 14.2 is

 TOTAL  :  95.59 15.26110.88   
 1028M
95.59user 15.30system 1:50.93elapsed 99%CPU (0avgtext+0avgdata
8117996maxresident)k

so the culprit is identified.