[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-01-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467
Bug 117467 depends on bug 116758, which changed state.

Bug 116758 Summary: [15 Regression] 25-40% binary size increase and up to 177% 
compile time increase for SPEC CPU wrf with Ofast since r15-3529-g506417dbc8b1cb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116758

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-01-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #13 from Richard Biener  ---
This is somewhat mitigated now but the actual inefficiency is still there.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-01-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #12 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:03faac507913803de76eab04fd74e754c70aa8c4

commit r15-6793-g03faac507913803de76eab04fd74e754c70aa8c4
Author: Richard Biener 
Date:   Fri Jan 10 12:30:29 2025 +0100

rtl-optimization/117467 - limit ext-dce memory use

The following puts in a hard limit on ext-dce because it might end
up requiring memory on the order of the number of basic blocks
times the number of pseudo registers.  The limiting follows what
GCSE based passes do and thus I re-use --param max-gcse-memory here.

This doesn't in any way address the implementation issues of the pass,
but it reduces the memory-use when compiling the
module_first_rk_step_part1.F90 TU from 521.wrf_r from 25GB to 1GB.

PR rtl-optimization/117467
PR rtl-optimization/117934
* ext-dce.cc (ext_dce_execute): Do nothing if a memory
allocation estimate exceeds what is allowed by
--param max-gcse-memory.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2025-01-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #11 from Richard Biener  ---
One issue with the dataflow problem is that it doesn't fit what
df_simple_dataflow expects - using

static bool ext_dce_rd_confluence_n (edge) { return true; }

will cause _all_ blocks to be iterated all the time, the idea is that
ext_dce_rd_transfer_n would, from live_out compute live_in (rather than
mangling both into its single 'livein' bitmap) and compute live_out
in ext_dce_rd_confluence_n from the successors live_in, only returning
true if live_out changed.

That this bug is now defered to stage4 makes a proper complete rewrite (sic!)
hardly possible.

Most of the problem looks like computing LR but we're intermangling this
with using LR as it evolves throughout the BB with defs to do the actual
ext-dce.  Why's this not simply using DF LR and doing a _single_ backward
walk performing the ext-dce?!

As said, I think this pass needs to be re-done from scratch, eventually
just killed off again for now (not to mention it's the triple duplicate
of similar functionality elsehwere...).

Alternatively it looks like memory should grow linearly with max_reg_num * 4 *
n_basic_blocks_for_fn, so disabling the pass when this becomes large is
necessary.  OTOH I hardly can see how this would get us to 25GB, so something
else is might be broken here.  For module_first_rk_step_part1.fppized.f90 we
have max_reg_num == 262610 and last_basic_block is 32042, with full 'livein'
this would amount to around 8GB of bitmap memory (4 bits per reg, 50%
overhead).

I have a patch limiting us based on this like we do limit GCSE based passes.
We already do

Warning: const/copy propagation disabled: 36613 basic blocks and 247372
registers; increase '--param max-gcse-memory' above 1105827
[-Wdisabled-optimization]
module_first_rk_step_part1.fppized.f90:1315:36: Warning: PRE disabled: 36613
basic blocks and 247372 registers; increase '--param max-gcse-memory' above
1105827 [-Wdisabled-optimization]
module_first_rk_step_part1.fppized.f90:1315:36: Warning: const/copy propagation
disabled: 36613 basic blocks and 247372 registers; increase '--param
max-gcse-memory' above 1105827 [-Wdisabled-optimization]

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-12-06 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #10 from GCC Commits  ---
The master branch has been updated by Andrew Macleod :

https://gcc.gnu.org/g:c7fd6c4369ef1a009b40c1787ea9d2dad2cf449f

commit r15-6000-gc7fd6c4369ef1a009b40c1787ea9d2dad2cf449f
Author: Andrew MacLeod 
Date:   Sat Nov 23 14:05:54 2024 -0500

Only add inferred ranges if they change the value.

Do not add an inferred range if it is already incorprated in the
current range of an SSA_NAME.

PR tree-optimization/117467
* gimple-range-infer.cc (infer_range_manager::add_ranges): Check
range_of_expr to see if the inferred range is needed.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-12-06 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #9 from GCC Commits  ---
The master branch has been updated by Andrew Macleod :

https://gcc.gnu.org/g:48eda34624fe5de050ae5ee38a360155ab188c39

commit r15-5998-g48eda34624fe5de050ae5ee38a360155ab188c39
Author: Andrew MacLeod 
Date:   Mon Nov 25 09:50:33 2024 -0500

Do not calculate an entry range for invariant names.

If an SSA_NAME is invariant, do not calculate an on_entry value.

PR tree-optimization/117467
* gimple-range-cache.cc (ranger_cache::entry_range): Do not
invoke range_from_dom for invariant ssa-names.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-14 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

Sam James  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |law at gcc dot gnu.org

--- Comment #8 from Sam James  ---
Assigning based on
https://inbox.sourceware.org/gcc-patches/6017e9f1-0e5d-4261-97e5-238442bb4...@gmail.com/.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #7 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:7a07de2c60b3c513b6aef206e9b55b3ffefe8b39

commit r15-5008-g7a07de2c60b3c513b6aef206e9b55b3ffefe8b39
Author: Richard Biener 
Date:   Thu Nov 7 09:23:03 2024 +0100

rtl-optimization/117467 - 33% compile-time in rest of compilation

ext-dce uses TV_NONE, that's not OK for a pass taking 33% compile-time.
The following adds a timevar to it for proper blaming.

PR rtl-optimization/117467
* timevar.def (TV_EXT_DCE): New.
* ext-dce.cc (pass_data_ext_dce): Use TV_EXT_DCE.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #6 from Andrew Pinski  ---
(In reply to Richard Biener from comment #5)
> So confirmed the 25GB memory use is ext-dce, with -fno-ext-dce memory use is
> donw to 3GB.  The time report then shows VRP as offender:
> 
>  tree VRP   :  76.20 ( 23%)   125M (  4%)
>  dominator optimization :  28.30 (  8%)84M (  3%)
> 
> given 25GB memory use is going to trash most machines this is P1.
> 
> The testcase is quite small but has lots of calls with lots of arguments
> that might or might not invoke fortran array copying, it's probably
> difficult to reduce sensibly (it has lots of module USEs).  Jeff and Andrew
> should have access to SPEC, so I won't spend time trying at this point.

Note I think PR 116758 is the recording the ranger/DOM/VRP side of things too.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2024-11-07
   Priority|P3  |P1
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #5 from Richard Biener  ---
So confirmed the 25GB memory use is ext-dce, with -fno-ext-dce memory use is
donw to 3GB.  The time report then shows VRP as offender:

 tree VRP   :  76.20 ( 23%)   125M (  4%)
 dominator optimization :  28.30 (  8%)84M (  3%)

given 25GB memory use is going to trash most machines this is P1.

The testcase is quite small but has lots of calls with lots of arguments
that might or might not invoke fortran array copying, it's probably
difficult to reduce sensibly (it has lots of module USEs).  Jeff and Andrew
should have access to SPEC, so I won't spend time trying at this point.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

Richard Biener  changed:

   What|Removed |Added

 CC||amacleod at redhat dot com

--- Comment #4 from Richard Biener  ---
-   28.13%28.04%590285  f951 f951  [.]
bitmap_bit_p(bitmap_head const*, int)  ▒
   + 8.28% _start  
  ▒
   - 2.79% gimple_simplify_PLUS_EXPR(gimple_match_op*, gimple**, tree_node*
(*)(tree_node*), code_helper, tree_node*, ▒
  - 2.78% gimple_resimplify2(gimple**, gimple_match_op*, tree_node*
(*)(tree_node*))  ▒
 - gimple_simplify_MULT_EXPR(gimple_match_op*, gimple**, tree_node*
(*)(tree_node*), code_helper, tree_node*, ▒
- 2.26% pta_valueize(tree_node*)   
  ▒
 range_query::value_of_expr(tree_node*, gimple*)   
  ▒
   + gimple_ranger::range_of_expr(vrange&, tree_node*, gimple*)
  ▒
   + 2.36% gimple_ranger::range_of_stmt(vrange&, gimple*, tree_node*)  
  ▒
   + 2.23% gimple_simplify_POINTER_PLUS_EXPR(gimple_match_op*, gimple**,
tree_node* (*)(tree_node*), code_helper, tree▒
   + 2.07% fold_using_range::fold_stmt(vrange&, gimple*, fur_source&,
tree_node*) ▒
   + 2.03% execute_ranger_vrp(function*, bool)  

this seems in the end all related to prange and pta_valueize?

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #3 from Richard Biener  ---
(In reply to Richard Biener from comment #2)
>   - 32.11% (anonymous namespace)::pass_ext_dce::execute(function*)  
> ▒
>  - ext_dce_execute()
> ▒
> - 32.10% df_worklist_dataflow(dataflow*, bitmap_head*, int*,
> int) ▒
>- 32.08% ext_dce_rd_transfer_n(int)  
> ▒
>   + 14.75% ext_dce_process_uses(rtx_insn*, rtx_def*,
> bitmap_head*, bool)  ▒
>   + 8.18% bitmap_ior_into(bitmap_head*, bitmap_head const*) 
> ▒
>   + 4.49% ext_dce_process_sets(rtx_insn*, rtx_def*,
> bitmap_head*) ▒
> 3.34% bitmap_copy(bitmap_head*, bitmap_head const*) 
> ▒
> 1.31% bitmap_equal_p(bitmap_head const*, bitmap_head
> const*)
> 
> likely (unverified) also the source of 25GB memory use.
> 
> The DF problem seems seriously unoptimized - it lacks a separate "local"
> compute
> step (the ext_dce_process_sets part that populates live_tmp _per insn_!).

That is, usually the transfer function is the IOR of input and appropriate
IOR/whatever of the (cached!) local compute result.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

Richard Biener  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
  - 32.11% (anonymous namespace)::pass_ext_dce::execute(function*) 
  ▒
 - ext_dce_execute()   
  ▒
- 32.10% df_worklist_dataflow(dataflow*, bitmap_head*, int*, int)  
  ▒
   - 32.08% ext_dce_rd_transfer_n(int) 
  ▒
  + 14.75% ext_dce_process_uses(rtx_insn*, rtx_def*,
bitmap_head*, bool)  ▒
  + 8.18% bitmap_ior_into(bitmap_head*, bitmap_head const*)
  ▒
  + 4.49% ext_dce_process_sets(rtx_insn*, rtx_def*,
bitmap_head*) ▒
3.34% bitmap_copy(bitmap_head*, bitmap_head const*)
  ▒
1.31% bitmap_equal_p(bitmap_head const*, bitmap_head
const*)

likely (unverified) also the source of 25GB memory use.

The DF problem seems seriously unoptimized - it lacks a separate "local"
compute
step (the ext_dce_process_sets part that populates live_tmp _per insn_!).

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

--- Comment #1 from Richard Biener  ---
Samples: 2M of event 'cycles:Pu', Event count (approx.): 2183019518772  
Overhead   Samples  Command  Shared Object Symbol   
  29.34%627170  f951 f951  [.]
bitmap_bit_p(bitmap_head const*, int)   
  10.68%231516  f951 f951  [.]
bitmap_set_bit(bitmap_head*, int)   
   5.62%122003  f951 f951  [.]
bitmap_set_range(bitmap_head*, unsigned int, unsigned int) [
   5.23%113260  f951 f951  [.]
bitmap_list_insert_element_after(bitmap_head*, bitmap_elemen
   4.23% 90654  f951 f951  [.] df_count_refs(bool,
bool, bool) 
   4.03% 87072  f951 f951  [.]
bitmap_ior_into(bitmap_head*, bitmap_head const*)   
   3.53% 77003  f951 f951  [.]
bitmap_copy(bitmap_head*, bitmap_head const*)   
   2.78% 59162  f951 f951  [.]
bitmap_and_compl_into(bitmap_head*, bitmap_head const*) 
   1.60% 34193  f951 f951  [.] lra_remat()  

yay.

  + 32.11% (anonymous namespace)::pass_ext_dce::execute(function*)  

is the "rest of compilation", fixing that.

[Bug tree-optimization/117467] [15 Regression] 521.wrf_r again explodes memory/compile-time wise

2024-11-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117467

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |15.0
   Keywords||compile-time-hog,
   ||memory-hog