[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-10-02 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #58 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:4ba4165d66b18d7c5b8af02ecdf38bfa0690c106

commit r15-4017-g4ba4165d66b18d7c5b8af02ecdf38bfa0690c106
Author: Richard Biener 
Date:   Thu Sep 26 11:43:21 2024 +0200

tree-optimiztation/114855 - profile prediction slowness

The testcase in PR114855 shows profile prediction to evaluate
the same SSA def via expr_expected_value for each condition or
switch in a function.  The following patch caches the expected
value (and probability/predictor) for each visited SSA def,
also protecting against recursion and thus obsoleting the visited
bitmap.  This reduces the time spent in branch prediction from
1.2s to 0.3s, though callgrind which was how I noticed this
seems to be comparatively very much happier about the change than
this number suggests.

PR tree-optimization/114855
* predict.cc (ssa_expected_value): New global.
(expr_expected_value): Do not take bitmap.
(expr_expected_value_1): Likewise.  Use ssa_expected_value
to cache results for a SSA def.
(tree_predict_by_opcode): Adjust.
(tree_estimate_probability): Manage ssa_expected_value.
(tree_guess_outgoing_edge_probabilities): Likewise.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-27 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #57 from Andrew Macleod  ---
FWIW, on my last run, enabling early VRP sped up my -O1 compilation by a fair
amount. total compilation dropped from 480 seconds to 380 seconds... 

It took 2.5 seconds to run, and im going to guess might have cleaned the code
up somewhat.

Might be worth looking at a Fast VRP pass at -O1?  I'm not sure how/where to
add NEXT_PASS (pass_fast_vrp);  to run it at -O1, but it can be easily
experimented with via

--enable-tree-evrp --param=vrp-block-limit=1

which will always run Fast VRP as the EVRP pass.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-26 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #56 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:942bbb2357656019caa3f8ebd2d23b09222f039a

commit r15-3896-g942bbb2357656019caa3f8ebd2d23b09222f039a
Author: Richard Biener 
Date:   Wed Sep 25 10:38:12 2024 +0200

tree-optimization/114855 - speed up dom_oracle::register_transitives

dom_oracle::register_transitives contains an unbound dominator walk
which for the testcase in PR114855 dominates the profile.  The following
fixes the unbound work done by assigning a constant work budget to the
loop, bounding the number of dominators visited but also the number of
relations processed.  This gets both dom_oracle::register_transitives and
get_immediate_dominator off the profile.

I'll note that we're still doing an unbound dominator walk via
equiv_set in find_equiv_dom at the start of the function and when
we register a relation that also looks up the same way.  At least
for the testcase at hand this isn't an issue.

I've also amended the guard to register_transitives with the
per-basic-block limit for the number of relations registered not
being exhausted.

PR tree-optimization/114855
* params.opt (--param transitive-relations-work-bound): New.
* doc/invoke.texi (--param transitive-relations-work-bound):
Document.
* value-relation.cc (dom_oracle::register_transitives):
Assing an overall work budget, bounding the dominator walk and
the number of relations processed.
(dom_oracle::record): Only register_transitives when the
number of already registered relations does not yet exceed
the per-BB limit.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #55 from Andrew Macleod  ---
(In reply to Richard Biener from comment #50)
> (In reply to Richard Biener from comment #4)
> > Trunk at -O1:
> > 
> > dominator optimization : 495.14 ( 82%)   0.20 (  5%) 495.44 (
> > 81%)   113M (  5%)
> 
> Compared to that we're now at the following state with -O1 (everything >=
> 4%):
> 
>  callgraph ipa passes   :  17.23 ( 10%)
>  df live regs   :   6.76 (  4%)
>  dominator optimization :  89.76 ( 50%)
>  backwards jump threading   :   7.94 (  4%)
>  TOTAL  : 180.77
> 
> So it's still DOM aka forward threading eating most of the time. 
> -fno-thread-jumps improves compile-time to 77s, DOM then still takes 25s
> (33%)
> (top offenders are then dom_oracle::register_transitives, bitmap_set_bit
> and wide_int_storage copying).  I noticed the unbound dominator traversal
> in register_transitives already.

Theres already a creation flag to disable registering transitives if we decide
we don't want them due to their expense.  fast VRP disables them always, but
regular VRP never disables it.  We switch to fast vrp when the size of the CFG
goes beyond a threshold.

The threader will use a normal ranger, so it isn't getting the benefit of this
flag.   It would be simple enough to disable transitives in ranger if the CFG
is beyond a specified size, and we could use the same --param VRP uses to
switch algorithms.

something like this ought to remove that hotspot:

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 4b94e0db7aa..f665925f630 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1003,7 +1003,8 @@ ranger_cache::ranger_cache (int not_executable_flag, bool
use_imm_uses)
   m_temporal = new temporal_cache;

   // If DOM info is available, spawn an oracle as well.
-  create_relation_oracle ();
+  bool transitives_p = (last_basic_block_for_fn (cfun) <
param_vrp_block_limit);
+  create_relation_oracle (transitives_p);
   create_infer_oracle (use_imm_uses);
   create_gori (not_executable_flag, param_vrp_switch_limit);

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #54 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:9b7626383822799d60ea3c62e62e700f16337cd6

commit r15-3860-g9b7626383822799d60ea3c62e62e700f16337cd6
Author: Aldy Hernandez 
Date:   Mon Aug 26 07:33:38 2024 +0200

Remove recursion in simplify_control_stmt_condition_1 [PR114855].

Remove some ad-hoc simplification code in the forward threader, as the
call into the ranger in m_simplifier->simplify() will handle anything we
can do manually in simplify_control_stmt_condition_1.

In PR114855, DOM time is reduced from 120s to 92s (-23%) and overall
compilation time from 235s to 205s (-12%).  The total thread count at -O1
is
unchanged for the testcase.

In our bootstrap .ii benchmark suite, I see we thread 3 threads less
over all files at -O1.  At -O2, the backward threader picks up one more,
for no difference over all.

PR tree-optimization/114855

gcc/ChangeLog:

* tree-ssa-threadedge.cc: Remove unneeded recursion.

Re: [Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread Aldy Hernandez
rguenther at suse dot de via Gcc-bugs  writes:


>> Have you tried the patch in comment 22?  That should reduce the time in
>> DOM by 23%.
>
> I thought that was already applied ...?

No.  I wanted to investigate the 3 missing threads, but I think the
patch can go in as is.  I'll be away for a few weeks, but feel free to
push it.  It was already tested.


[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread aldy at quesejoda dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #53 from aldy at quesejoda dot com ---
rguenther at suse dot de via Gcc-bugs  writes:


>> Have you tried the patch in comment 22?  That should reduce the time in
>> DOM by 23%.
>
> I thought that was already applied ...?

No.  I wanted to investigate the 3 missing threads, but I think the
patch can go in as is.  I'll be away for a few weeks, but feel free to
push it.  It was already tested.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #52 from rguenther at suse dot de  ---
On Wed, 25 Sep 2024, aldy at quesejoda dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> 
> --- Comment #51 from aldy at quesejoda dot com ---
> "rguenth at gcc dot gnu.org via Gcc-bugs"  writes:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> >
> > --- Comment #50 from Richard Biener  ---
> > (In reply to Richard Biener from comment #4)
> >> Trunk at -O1:
> >> 
> >> dominator optimization : 495.14 ( 82%)   0.20 (  5%) 495.44 (
> >> 81%)   113M (  5%)
> >
> > Compared to that we're now at the following state with -O1 (everything >= 
> > 4%):
> >
> >  callgraph ipa passes   :  17.23 ( 10%)
> >  df live regs   :   6.76 (  4%)
> >  dominator optimization :  89.76 ( 50%)
> >  backwards jump threading   :   7.94 (  4%)
> >  TOTAL  : 180.77
> >
> > So it's still DOM aka forward threading eating most of the time. 
> > -fno-thread-jumps improves compile-time to 77s, DOM then still takes 25s 
> > (33%)
> > (top offenders are then dom_oracle::register_transitives, bitmap_set_bit
> > and wide_int_storage copying).  I noticed the unbound dominator traversal
> > in register_transitives already.
> 
> Have you tried the patch in comment 22?  That should reduce the time in
> DOM by 23%.

I thought that was already applied ...?

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread aldy at quesejoda dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #51 from aldy at quesejoda dot com ---
"rguenth at gcc dot gnu.org via Gcc-bugs"  writes:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
>
> --- Comment #50 from Richard Biener  ---
> (In reply to Richard Biener from comment #4)
>> Trunk at -O1:
>> 
>> dominator optimization : 495.14 ( 82%)   0.20 (  5%) 495.44 (
>> 81%)   113M (  5%)
>
> Compared to that we're now at the following state with -O1 (everything >= 4%):
>
>  callgraph ipa passes   :  17.23 ( 10%)
>  df live regs   :   6.76 (  4%)
>  dominator optimization :  89.76 ( 50%)
>  backwards jump threading   :   7.94 (  4%)
>  TOTAL  : 180.77
>
> So it's still DOM aka forward threading eating most of the time. 
> -fno-thread-jumps improves compile-time to 77s, DOM then still takes 25s (33%)
> (top offenders are then dom_oracle::register_transitives, bitmap_set_bit
> and wide_int_storage copying).  I noticed the unbound dominator traversal
> in register_transitives already.

Have you tried the patch in comment 22?  That should reduce the time in
DOM by 23%.

Re: [Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread Aldy Hernandez
"rguenth at gcc dot gnu.org via Gcc-bugs"  writes:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
>
> --- Comment #50 from Richard Biener  ---
> (In reply to Richard Biener from comment #4)
>> Trunk at -O1:
>> 
>> dominator optimization : 495.14 ( 82%)   0.20 (  5%) 495.44 (
>> 81%)   113M (  5%)
>
> Compared to that we're now at the following state with -O1 (everything >= 4%):
>
>  callgraph ipa passes   :  17.23 ( 10%)
>  df live regs   :   6.76 (  4%)
>  dominator optimization :  89.76 ( 50%)
>  backwards jump threading   :   7.94 (  4%)
>  TOTAL  : 180.77
>
> So it's still DOM aka forward threading eating most of the time. 
> -fno-thread-jumps improves compile-time to 77s, DOM then still takes 25s (33%)
> (top offenders are then dom_oracle::register_transitives, bitmap_set_bit
> and wide_int_storage copying).  I noticed the unbound dominator traversal
> in register_transitives already.

Have you tried the patch in comment 22?  That should reduce the time in
DOM by 23%.


[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #50 from Richard Biener  ---
(In reply to Richard Biener from comment #4)
> Trunk at -O1:
> 
> dominator optimization : 495.14 ( 82%)   0.20 (  5%) 495.44 (
> 81%)   113M (  5%)

Compared to that we're now at the following state with -O1 (everything >= 4%):

 callgraph ipa passes   :  17.23 ( 10%)
 df live regs   :   6.76 (  4%)
 dominator optimization :  89.76 ( 50%)
 backwards jump threading   :   7.94 (  4%)
 TOTAL  : 180.77

So it's still DOM aka forward threading eating most of the time. 
-fno-thread-jumps improves compile-time to 77s, DOM then still takes 25s (33%)
(top offenders are then dom_oracle::register_transitives, bitmap_set_bit
and wide_int_storage copying).  I noticed the unbound dominator traversal
in register_transitives already.

With -O2 we're still running into the backwards threader slowness.  I don't
see a quick way to fix that without also eventually changing what is threaded
and what is not as side-effect of changing thread materialization order.  So I
think a bigger refactoring like Aldy started is necessary.  Eventually I'll
re-investigate a "quick" fix, but at least being able to record additional
meta per thread path is necessary (so 0001 of Aldys proposed series in it's
current or in slightly altered form).

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #49 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:cc141b56b367b3d81c1b590e22ae174f1e013009

commit r15-3854-gcc141b56b367b3d81c1b590e22ae174f1e013009
Author: Richard Biener 
Date:   Tue Sep 24 14:10:13 2024 +0200

rtl-optimization/114855 - slow add_store_equivs in IRA

For the testcase in PR114855 at -O1 add_store_equivs shows up as the
main sink for bitmap_set_bit because it uses a bitmap to mark all
seen insns by UID to make sure the forward walk in memref_used_between_p
will find the insn in question.  Given we do have a CFG here the
functions operation is questionable, given memref_used_between_p
together with the walk of all insns is obviously quadratic in the
worst case that whole thing should be re-done ... but, for the
testcase, using a sbitmap of size get_max_uid () + 1 gets
bitmap_set_bit off the profile and improves IRA time from 15.58s (8%)
to 3.46s (2%).

Now, given above quadraticness I wonder whether we should instead
gate add_store_equivs on optimize > 1 or flag_expensive_optimizations.

PR rtl-optimization/114855
* ira.cc (add_store_equivs): Use sbitmap for tracking
visited insns.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-25 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #48 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:caf3fe7880e62692da45489dc5bcae069c1555c8

commit r15-3852-gcaf3fe7880e62692da45489dc5bcae069c1555c8
Author: Richard Biener 
Date:   Tue Sep 24 11:47:26 2024 +0200

tree-optimization/114855 - slow VRP due to equiv oracle queries

For the testcase in PR114855 VRP takes 320.41s (23%) (after mitigating
backwards threader slowness).  This is mostly due to the bitmap check
in equiv_oracle::find_equiv_dom.  The following turns this bitmap
to tree view, trading the linear search for a O(log N) one which
improves VRP time to 54.54s (5%).

PR tree-optimization/114855
* value-relation.cc (equiv_oracle::equiv_oracle): Switch
m_equiv_set to tree view.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-24 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #47 from Sam James  ---
> gcc.exe (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 5.1.0

GCC 5 is long EOL, unfortunately.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-24 Thread jeremy.rutman at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #46 from jeremy rutman  ---
I don't know if this is relevant but a certain gcc I was using lately seems to
do fine compiling one of the autogenerated files in question (an AES128 encrypt
file) , but quits unexpectedly when I try compiling a second (the corresponding
AES128 decrypt)  

version: 
gcc.exe (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 5.1.0
Copyright (C) 2015 Free Software Foundation, Inc.


$ C\:/Program\ Files/ANSYS\ Inc/v241/SCADE/contrib/Msys64/mingw64/bin/gcc.exe
-v -pedantic -Wall -Wextra -Wwrite-strings -g3 -I"C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\libraries\SC65\libmathext" -I"..\cryptol_aes"
-I"..\cryptol_aes\dec" -I"..\cryptol_aes\enc" -I"." -I"C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\." -I"C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\include" -I"C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\include\C" -I"C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\include\Ada" -I"C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\lib\Ada" -pedantic -DBUILD_DLL -DSIM -DWIN32 -D_CONSOLE
-D_MBCS -c -ansi -std=c99 -m64 "..\cryptol_aes\dec\aesDecrypt.c" -o
"win64\aesDecrypt.o"
Using built-in specs.
COLLECT_GCC=C:\Program Files\ANSYS
Inc\v241\SCADE\contrib\Msys64\mingw64\bin\gcc.exe
Target: x86_64-w64-mingw32
Configured with: ../../../src/gcc-5.1.0/configure --host=x86_64-w64-mingw32
--build=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --prefix=/mingw64
--with-sysroot=/c/mingw510/x86_64-510-win32-sjlj-rt_v4-rev0/mingw64
--with-gxx-include-dir=/mingw64/x86_64-w64-mingw32/include/c++ --enable-shared
--enable-static --enable-targets=all --enable-multilib
--enable-languages=ada,c,c++,fortran,objc,obj-c++,lto
--enable-libstdcxx-time=yes --enable-threads=win32 --enable-libgomp
--enable-libatomic --enable-lto --enable-graphite --enable-checking=release
--enable-fully-dynamic-string --enable-version-specific-runtime-libs
--enable-sjlj-exceptions --disable-isl-version-check --disable-libstdcxx-pch
--disable-libstdcxx-debug --enable-bootstrap --disable-rpath
--disable-win32-registry --disable-nls --disable-werror --disable-symvers
--with-gnu-as --with-gnu-ld --with-arch-32=i686 --with-arch-64=nocona
--with-tune-32=generic --with-tune-64=core2 --with-libiconv --with-system-zlib
--with-gmp=/c/mingw510/prerequisites/x86_64-w64-mingw32-static
--with-mpfr=/c/mingw510/prerequisites/x86_64-w64-mingw32-static
--with-mpc=/c/mingw510/prerequisites/x86_64-w64-mingw32-static
--with-isl=/c/mingw510/prerequisites/x86_64-w64-mingw32-static
--with-pkgversion='x86_64-win32-sjlj-rev0, Built by MinGW-W64 project'
--with-bugurl=http://sourceforge.net/projects/mingw-w64 CFLAGS='-O2 -pipe
-I/c/mingw510/x86_64-510-win32-sjlj-rt_v4-rev0/mingw64/opt/include
-I/c/mingw510/prerequisites/x86_64-zlib-static/include
-I/c/mingw510/prerequisites/x86_64-w64-mingw32-static/include' CXXFLAGS='-O2
-pipe -I/c/mingw510/x86_64-510-win32-sjlj-rt_v4-rev0/mingw64/opt/include
-I/c/mingw510/prerequisites/x86_64-zlib-static/include
-I/c/mingw510/prerequisites/x86_64-w64-mingw32-static/include' CPPFLAGS=
LDFLAGS='-pipe -L/c/mingw510/x86_64-510-win32-sjlj-rt_v4-rev0/mingw64/opt/lib
-L/c/mingw510/prerequisites/x86_64-zlib-static/lib
-L/c/mingw510/prerequisites/x86_64-w64-mingw32-static/lib '
Thread model: win32
gcc version 5.1.0 (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project)
COLLECT_GCC_OPTIONS='-v' '-Wall' '-Wextra' '-Wwrite-strings' '-g3' '-I'
'C:\Program Files\ANSYS Inc\v241\SCADE\SCADE\.\libraries\SC65\libmathext' '-I'
'..\cryptol_aes' '-I' '..\cryptol_aes\dec' '-I' '..\cryptol_aes\enc' '-I' '.'
'-I' 'C:\Program Files\ANSYS Inc\v241\SCADE\SCADE\.' '-I' 'C:\Program
Files\ANSYS Inc\v241\SCADE\SCADE\.\include' '-I' 'C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\include\C' '-I' 'C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\include\Ada' '-I' 'C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\lib\Ada' '-Wpedantic' '-D' 'BUILD_DLL' '-D' 'SIM' '-D'
'WIN32' '-D' '_CONSOLE' '-D' '_MBCS' '-c' '-ansi' '-std=c99' '-m64' '-o'
'win64\aesDecrypt.o' '-mtune=core2' '-march=nocona'
 C:/Program Files/ANSYS
Inc/v241/SCADE/contrib/Msys64/mingw64/bin/../libexec/gcc/x86_64-w64-mingw32/5.1.0/cc1.exe
-quiet -v -I C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\libraries\SC65\libmathext -I ..\cryptol_aes -I
..\cryptol_aes\dec -I ..\cryptol_aes\enc -I . -I C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\. -I C:\Program Files\ANSYS Inc\v241\SCADE\SCADE\.\include
-I C:\Program Files\ANSYS Inc\v241\SCADE\SCADE\.\include\C -I C:\Program
Files\ANSYS Inc\v241\SCADE\SCADE\.\include\Ada -I C:\Program Files\ANSYS
Inc\v241\SCADE\SCADE\.\lib\Ada -iprefix C:/Program Files/ANSYS
Inc/v241/SCADE/contrib/Msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/5.1.0/
-dD -U_REENTRANT -D BUILD_DLL -D SIM -D WIN32 -D _CONSOLE -D _MBCS
..\cryptol_aes\dec\aesDecrypt.c -quiet -dumpbase aesDecrypt.c -m64 -mtune=core2
-march=nocona -auxbase-strip win64\aesDecrypt.o -g3 -Wall -Wextra
-Wwrite-strings -Wpedantic -ansi -std=c99 -version -o
C:

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #45 from Richard Biener  ---
I noticed that

get_bitmask_from_range(tree_node*, generic_wide_int const&,
generic_wide_int const&)

is quite high on the profile accumulating profile hits on
wide_int_storage::operator=(wide_int_storage const&)

I wonder if irange_bitmask::irange_bitmask can be micro-optimized somehow
for the case of a wi::zero 'value'?  Likewise whether ::set_unknown
in using assigns instead of inits like : m_value (..) makes a difference.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #44 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:9a795b3a5b6a0d8b4b4f38a66ab9782aabead92e

commit r15-3824-g9a795b3a5b6a0d8b4b4f38a66ab9782aabead92e
Author: Richard Biener 
Date:   Tue Sep 24 12:53:11 2024 +0200

tree-optimization/114855 - more update_ssa speedup

The following tackles another source of slow bitmap operations,
namely populating blocks_to_update.  We already have that in
tree view around PHI insertion but also the initial population is
slow.  There's unfortunately a conditional inbetween list view
requirement and the bitmap API doesn't allow opportunistic
switching but rejects tree -> tree or list -> list transitions.
So the following patch wraps the early population in a tree view
section with possibly one redundant tree -> list -> tree view
transition.

This cuts tree SSA incremental from 228.25s (21%) to 65.05s (7%).

PR tree-optimization/114855
* tree-into-ssa.cc (update_ssa): Use tree view for the
initial population of blocks_to_update.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #43 from Richard Biener  ---
Thanks for the work sofar.  It seems the back_jt_path_registry::update_cfg
has a "dead" guard against un-adjust_paths_after_duplication paths with
its tracking visited_starting_edges, so for the purpose of limiting
complexity we can do sth like

diff --git a/gcc/tree-ssa-threadupdate.cc b/gcc/tree-ssa-threadupdate.cc
index c88cc1d6aac..88e7290db61 100644
--- a/gcc/tree-ssa-threadupdate.cc
+++ b/gcc/tree-ssa-threadupdate.cc
@@ -2292,8 +2292,11 @@ back_jt_path_registry::adjust_paths_after_duplication
(unsigned curr_path_num)
 {
   vec *curr_path = m_paths[curr_path_num];

-  /* Iterate through all the other paths and adjust them.  */
-  for (unsigned cand_path_num = 0; cand_path_num < m_paths.length (); )
+  /* Iterate through all the other paths and adjust them.
+ ???  Limit the linear search so we do not run into quadratic
+ behavior (see PR114855).  */
+  for (unsigned cand_path_num = 0;
+   cand_path_num < std::min (m_paths.length (), 1024u);)
 {
   if (cand_path_num == curr_path_num)
{

I measured no difference from disabling the merging (didn't try even
larger limits).

As for addressing the problem with "sorting", I think we should see to
order the jump threads in optimal order which should generally be RPO
of the entry edges and in case of multiple paths with the same entry
(the case adjust_paths_after_duplication is concerned with), we should
order those longest-to-shortes.

I believe all other order issues should be non-existent but we can end
up cancelling paths when we do the order wrong for same-entry paths.

Eventually adjusting the path_commpare sort function in your patch to
further order paths after e->dest->index and .length () would fix the
observed regressions.

Oddly your patch-set increases DOM time two-fold?

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #42 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:f9dfe8dea31bf5c56aa7798a0905707faf9e7ec4

commit r15-3818-gf9dfe8dea31bf5c56aa7798a0905707faf9e7ec4
Author: Richard Biener 
Date:   Mon Sep 23 15:41:14 2024 +0200

tree-optimization/114855 - high update_ssa time

Part of the problem in PR114855 is high update_ssa time.  When one fixes
the backward jump threading issue tree SSA incremental is at
439.91s ( 26%), mostly doing bitmap element searches for
blocks_with_phis_to_rewrite.  The following turns that bitmap to tree
view noticing the two-dimensional vector of PHIs it guards is excessive
compared to what we actually save with it - walking all PHI nodes
in a block, something we already do once to initialize stmt flags.
So instead of optimizing that walk we use the stmt flag, saving
allocations and global state that lives throughout the whole
compilation.

This reduces the tree SSA incremental time to 203.13 ( 14%)

The array was added in r0-74758-g2ce798794df8e1 when we still possibly
had gazillion virtual operands for PR26830, I checked the testcase
still behaves OK.

PR tree-optimization/114855
* tree-into-ssa.cc (phis_to_rewrite): Remove global var.
(mark_phi_for_rewrite): Simplify.
(rewrite_update_phi_arguments): Walk all PHIs, process
those satisfying rewrite_uses_p.
(delete_update_ssa): Simplify.
(update_ssa): Likewise.  Switch blocks_with_phis_to_rewrite
to tree view.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-23 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #41 from Aldy Hernandez  ---
Created attachment 59181
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59181&action=edit
unrtested patch sorting threadable paths

The performance improvement with this patch is:

** mainline
Time variable   usr   sys  wall
  GGC
 dominator optimization : 161.07 (  9%)   0.65 (  8%) 162.00 (  9%)
  236M (  9%)
 backwards jump threading   :1386.84 ( 78%)   1.76 ( 22%)1393.97 ( 78%)
  519M ( 20%)
 TOTAL  :1775.41  7.88   1789.76   
 2636M

** with patchset
torsion:~/bld/benchmark/tainted/gcc []$
Time variable   usr   sys  wall
  GGC
 dominator optimization : 284.32 ( 46%)   1.39 ( 16%) 286.54 ( 46%)
  447M ( 13%)
 backwards jump threading   :  15.51 (  3%)   0.87 ( 10%)  16.40 (  3%)
  510M ( 15%)
 TOTAL  : 616.15  8.81626.82   
 3334M

As discussed in comment 39, it causes a regression in
gcc.dg/tree-ssa/ssa-sink-13.c (and possibly elsewhere) that must be addressed.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-23 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #40 from Aldy Hernandez  ---
For the record, I still think adjust_paths_after_duplication() isn't giving us
much for all the hassle it's causing.

I compared the number of threaded paths with and without it and the difference
is:

* mainline
Number of threads:
 threadfull:31634
 ethread:22253
 thread:12048
 total:  65935

* without adjust_paths_after_duplication:
Number of threads:
 threadfull:30561
 ethread:22253
 thread:11743
 total:  64557


That is, adjust_paths_after_duplication() is just saving us 2% of threaded
paths.

The above is for -O2 -fno-tree-dominator-opts which disables the DOM threader,
so DOM doesn't start picking up the slack from the lack of subpaths in the
backward threader.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-23 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #39 from Aldy Hernandez  ---
I'm going to step away from this PR for a few weeks so I'll do a brain dump on
where I am, just in case someone wants to poke at it some more.

This problem in adjust_paths_after_duplication() boils down to a slow sub-path
search.  The current search is quadratic, but it can easily be remedied by
keeping the paths sorted.  In a patchset I will attach shortly we can
dramatically improve the performance by implementing the above:

** mainline
Time variable   usr   sys  wall
  GGC
 dominator optimization : 161.07 (  9%)   0.65 (  8%) 162.00 (  9%)
  236M (  9%)
 backwards jump threading   :1386.84 ( 78%)   1.76 ( 22%)1393.97 ( 78%)
  519M ( 20%)
 TOTAL  :1775.41  7.88   1789.76   
 2636M

** with patchset
torsion:~/bld/benchmark/tainted/gcc []$
Time variable   usr   sys  wall
  GGC
 dominator optimization : 284.32 ( 46%)   1.39 ( 16%) 286.54 ( 46%)
  447M ( 13%)
 backwards jump threading   :  15.51 (  3%)   0.87 ( 10%)  16.40 (  3%)
  510M ( 15%)
 TOTAL  : 616.15  8.81626.82   
 3334M

The problem though is that the jump threader magic that reorders the basic
blocks once we know what paths to thread, is sensitive to the order of the
vector itself.  This affects all threaders, and it's code I'm unfamiliar with.

For example, for gcc.dg/tree-ssa/ssa-sink-13.c has the following thread paths
in the queue:

  [1] Registering jump thread: (2, 15) incoming edge;  (4, 5) nocopy; 
  [2] Registering jump thread: (2, 3) incoming edge;  (3, 4) normal (4, 6)
nocopy; 
  [3] Registering jump thread: (4, 6) incoming edge;  (6, 8) nocopy; 
  [4] Registering jump thread: (6, 8) incoming edge;  (8, 9) nocopy; 
  [5] Registering jump thread: (6, 7) incoming edge;  (7, 8) normal (8, 10)
nocopy; 
  [6] Registering jump thread: (8, 10) incoming edge;  (10, 12) nocopy; 
  [7] Registering jump thread: (10, 12) incoming edge;  (12, 13) nocopy; 
  [8] Registering jump thread: (10, 11) incoming edge;  (11, 12) normal (12,
14) nocopy; 

Once we process the first path, the code in back_jt_path_registry::update_cfg()
removes it with m_paths.ordered_remove (0).  The vector code cheaply moves the
path in position 8, to the first position leaving the vector like this:

  [1] Registering jump thread: (10, 11) incoming edge;  (11, 12) normal (12,
14) nocopy;
  [2] Registering jump thread: (2, 3) incoming edge;  (3, 4) normal (4, 6)
nocopy; 
  [3] Registering jump thread: (4, 6) incoming edge;  (6, 8) nocopy; 
  [4] Registering jump thread: (6, 8) incoming edge;  (8, 9) nocopy; 
  [5] Registering jump thread: (6, 7) incoming edge;  (7, 8) normal (8, 10)
nocopy; 
  [6] Registering jump thread: (8, 10) incoming edge;  (10, 12) nocopy; 
  [7] Registering jump thread: (10, 12) incoming edge;  (12, 13) nocopy; 

This simple change starts a cascade of transformations that ultimately causes
us to spectacularly fail ssa-sink-13.c.

So I have a patchset that could theoretically fix the performance issues in the
backward threader, but we first need to figure out why/if we care about the
ordering of the path vector affecting code generation.  Cause I'm afraid any
change we make is going to be affected by this.

I'll post patches next.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-17 Thread aldy at quesejoda dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #38 from aldy at quesejoda dot com ---
On Tue, Sep 17, 2024 at 1:19 PM rguenther at suse dot de <
gcc-bugzi...@gcc.gnu.org> wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
>
> --- Comment #37 from rguenther at suse dot de 
> ---
> On Tue, 17 Sep 2024, aldyh at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> >
> > --- Comment #36 from Aldy Hernandez  ---
> > (In reply to Richard Biener from comment #33)
> > > Can we just sort m_paths after the path entry BB and fix the lookup
> that way?
> >
> > This seemed promising, especially because the
> adjust_paths_after_duplication()
> > gets called from the backwards threader in order, and always starting at
> > position 0, so searching for the first item would be super fast.
> However, this
> > function will rewire the edges in the path
> (rewire_first_differing_edge), thus
> > messing up the sort.
>
> Bah.  Remove those "fixed" candidates from the set (dropping 2nd level
> opportunities that way)?  Alternatively sort m_paths into a better
>

Yeah, that's exactly what I thought.  I doubt we're getting much from 2nd
level subpaths.


> data structure (IIRC we only have splay-tree as binary tree) for
> the purpose of the lookup and re-place fixed up paths (but keep
> m_paths for the purpose of processing things).
>

Yeah, ultimately this is a data structure problem.

I'll poke a bit more tomorrow.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #37 from rguenther at suse dot de  ---
On Tue, 17 Sep 2024, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> 
> --- Comment #36 from Aldy Hernandez  ---
> (In reply to Richard Biener from comment #33)
> > Can we just sort m_paths after the path entry BB and fix the lookup that 
> > way?
> 
> This seemed promising, especially because the adjust_paths_after_duplication()
> gets called from the backwards threader in order, and always starting at
> position 0, so searching for the first item would be super fast.  However, 
> this
> function will rewire the edges in the path (rewire_first_differing_edge), thus
> messing up the sort.

Bah.  Remove those "fixed" candidates from the set (dropping 2nd level
opportunities that way)?  Alternatively sort m_paths into a better
data structure (IIRC we only have splay-tree as binary tree) for
the purpose of the lookup and re-place fixed up paths (but keep
m_paths for the purpose of processing things).

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-17 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #36 from Aldy Hernandez  ---
(In reply to Richard Biener from comment #33)
> Can we just sort m_paths after the path entry BB and fix the lookup that way?

This seemed promising, especially because the adjust_paths_after_duplication()
gets called from the backwards threader in order, and always starting at
position 0, so searching for the first item would be super fast.  However, this
function will rewire the edges in the path (rewire_first_differing_edge), thus
messing up the sort.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-05 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #35 from Aldy Hernandez  ---
(In reply to Andrew Macleod from comment #34)
> Btw, if we simply remove the call to that function, with -O1
> -fenable-tree-thread1 I see:
> 
>  backwards jump threading   :  14.36 (  3%)   0.39 (  5%)  14.68 ( 
> 3%)   238M (  9%)
> 
> which is a notable improvement from 1370 seconds :-P   SO clearly the
> culprit.
> 
> DOM appears to still be somewhat of an issue..  Ive done some experimenting

adjust_paths_after_duplication() only applies to paths made by the backwards
threader (hence the prefix, back_jt_path_registry).  I don't remember why, as
it should apply to any sets of paths.  It could be the forward threader didn't
create such redundant paths, or some other reason lost in time.

> in this area, but nothing IM willing to commit to yet.  Do you have anything
> outstanding for this Aldy?
> 
> dominator optimization : 208.04 ( 43%)   0.95 ( 12%) 209.77 (
> 43%)   342M ( 13%)

The patch in comment 22 should reduce DOM significantly:
https://gcc.gnu.org/bugzilla/attachment.cgi?id=59001

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-05 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #34 from Andrew Macleod  ---
Btw, if we simply remove the call to that function, with -O1
-fenable-tree-thread1 I see:

 backwards jump threading   :  14.36 (  3%)   0.39 (  5%)  14.68 (  3%)
  238M (  9%)

which is a notable improvement from 1370 seconds :-P   SO clearly the culprit.

DOM appears to still be somewhat of an issue..  Ive done some experimenting in
this area, but nothing IM willing to commit to yet.  Do you have anything
outstanding for this Aldy?

dominator optimization : 208.04 ( 43%)   0.95 ( 12%) 209.77 ( 43%) 
 342M ( 13%)

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #33 from Richard Biener  ---
Can we just sort m_paths after the path entry BB and fix the lookup that way?

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #32 from Richard Biener  ---
(In reply to Aldy Hernandez from comment #31)
> (In reply to rguent...@suse.de from comment #30)
> > On Wed, 4 Sep 2024, amacleod at redhat dot com wrote:
> 
> > > I tried running valgrind, which died, but up to that point it showed 77% 
> > > of the
> > > time spend in the body of 
> > > back_jt_path_registry::adjust_paths_after_duplication ()
> > 
> > Hmm, I can have a look there if you don't beat me to it (though the
> > profiles didn't have this part in AFAIR).
> 
> [Some background before somebody shoots me.]
> 
> This function should really go.  It was leftover from a PR I worked on a
> long time ago, before the backwards threader rewrite.  In working on it I
> realized that we were threading a lot of sub-threads that could be
> simplified.  I suggested this to Jeff, who may remember the details.  There
> is no testcase unfortunately.
> 
> It seemed cheap at the time, since the original backwards threader could
> only thread a handful of things, so the number of threads was always in the
> single digits.
> 
> The function is ugly.  I thought about removing it as part of the threader
> rewrite, but was wasn't brave enough because we didn't have a testcase and I
> had forgotten the details.
> 
> Ok, enough excuses.
> 
> Andrew was playing with this last night and saw a reduction from 1700
> seconds to 27 (?? in a subset of the testcase??) by just removing it.
> 
> We either need to rewrite this function, or delete it altogether.  I vote
> for the latter.  It's a micro optimization that is obviously too expensive
> to be worth it.  Perhaps we should benchmark how many times it actually
> triggers in a bootstrap, and what the supposed code size savings are.

I think this functionality should have been implemented during discovery
somehow - you'd have to build a way to search threading paths from
path prefixes.  (quite easy to do for forward threading, just to name
another reason why forward threading is "better")

In the end this would also affect costing of a path (or a pair of paths),
but doing better there, esp. if path A or B individually are not profitable
to thread but the combined thread would be within size/copy limits of
two paths, is a bit difficult.

It might be the easiest way would be to keep it at transform time but
instead of walking all paths after each threading just populate a
pair -> copy-bb map and when threading a new
path lookup start-of-path, orig-bb for each of the threads path and
adjust it there?  You'd need multi-levels as well I guess if you
have three paths working on each others.  Maybe the map should be
two levels deep, the first key only on start-of-path yielding a map
for the orig-bb.

Note the situation that the start of a path is copied is also not handled,
which basically duplicates a threading opportunity to two paths.

That said, implementing separate threading for each path is probably
sub-optimal and instead the "threading web" could have a better overall
data structure where these combinations/duplications are more obvious
(and can be costed to their actual effect).

That said - I'd not remove the functionality but instead throw memory
at it to make the "lookup" O(log n) instead of a linear walk and perform
the adjustment at the time a thread is code generated (but fill the data
structure after each thread).

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-04 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

Aldy Hernandez  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org

--- Comment #31 from Aldy Hernandez  ---
(In reply to rguent...@suse.de from comment #30)
> On Wed, 4 Sep 2024, amacleod at redhat dot com wrote:

> > I tried running valgrind, which died, but up to that point it showed 77% of 
> > the
> > time spend in the body of 
> > back_jt_path_registry::adjust_paths_after_duplication ()
> 
> Hmm, I can have a look there if you don't beat me to it (though the
> profiles didn't have this part in AFAIR).

[Some background before somebody shoots me.]

This function should really go.  It was leftover from a PR I worked on a long
time ago, before the backwards threader rewrite.  In working on it I realized
that we were threading a lot of sub-threads that could be simplified.  I
suggested this to Jeff, who may remember the details.  There is no testcase
unfortunately.

It seemed cheap at the time, since the original backwards threader could only
thread a handful of things, so the number of threads was always in the single
digits.

The function is ugly.  I thought about removing it as part of the threader
rewrite, but was wasn't brave enough because we didn't have a testcase and I
had forgotten the details.

Ok, enough excuses.

Andrew was playing with this last night and saw a reduction from 1700 seconds
to 27 (?? in a subset of the testcase??) by just removing it.

We either need to rewrite this function, or delete it altogether.  I vote for
the latter.  It's a micro optimization that is obviously too expensive to be
worth it.  Perhaps we should benchmark how many times it actually triggers in a
bootstrap, and what the supposed code size savings are.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-04 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #30 from rguenther at suse dot de  ---
On Wed, 4 Sep 2024, amacleod at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> 
> --- Comment #29 from Andrew Macleod  ---
> Huh. Do we calculate *all* paths ahead of time?

Yes, we first identify all profitable threading candidates in a function
and then perform the threading.

> I tried running valgrind, which died, but up to that point it showed 77% of 
> the
> time spend in the body of 
> back_jt_path_registry::adjust_paths_after_duplication ()

Hmm, I can have a look there if you don't beat me to it (though the
profiles didn't have this part in AFAIR).

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-04 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #29 from Andrew Macleod  ---
Huh. Do we calculate *all* paths ahead of time?

I tried running valgrind, which died, but up to that point it showed 77% of the
time spend in the body of 
back_jt_path_registry::adjust_paths_after_duplication ()

That functions looks kind of quadratic or worse in nature, and when I run it
with GDB and force a stop at a random point, I see:

(gdb) p m_paths->m_vec
$2 = (vec*, va_heap, vl_embed> *)
0x12f50750
(gdb) p *(m_paths->m_vec)
$3 = {m_vecpfx = {m_alloc = 745039, m_using_auto_storage = 0, m_num = 572595}}
(gdb) p cand_path_num
$4 = 309148
(gdb) p curr_path_num
$5 = 0


Am I reading that right? 572,595 paths, of which we are looking at candidate
#309,148  ??

It *appears* that this is called for path 0, then path 0 is removed and its all
done over again with one less path in the vector.

At one point a little later in back_jt_path_registry::update_cfg I see:

p m_num_threaded_edges
$19 = 18669

and the size of the m_path vector is down to 558637 
(gdb) p *(m_paths.m_vec)
$23 = {m_vecpfx = {m_alloc = 745039, m_using_auto_storage = 0, m_num = 558637}}

back_jt_path_registry::update_cfg() makes a lot of calls to
back_jt_path_registry::duplicate_thread_path which then invokes
adjust_paths_after_duplication

All this time is spend here copying and moving things.

Something seems horribly wrong about that

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #28 from Richard Biener  ---
Just to clarify what the "handcuffs" in the backwards threader do and what
they don't.  There is no limit on the number of cond/switch stmts (thus
basic-blocks) we consider as the exit of paths, but for each such exit
there is a limit on the number of paths we consider to it
(param_max_jump_thread_paths) and by that a limit on how many times
we instantiate a path_range_query and call range_of_stmt on that paths
exit statement (cond/switch).

So the number of path_range_query instantiations and range_of_stmt calls
should be linear in the number of basic-blocks with a constant factor
(param_max_jump_thread_paths).

At least that was the intent - placing a counter in
back_threader::find_taken_edge_{switch,cond} should be able to prove that
in case you'd have a testcase you can easily grow/shrink while preserving
its "pattern".

My observation was that instantiation of path_range_query and the
range_of_stmt call does work that is not on the order of the path size
but also scales at least linear with the size of the function - for the
CFG/stmt pattern of the testcase at hand at least.  The identified
unlimited dominator walk has this property but might not be the main
time-sink in the end.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-03 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #27 from Andrew Macleod  ---
(In reply to Aldy Hernandez from comment #26)

> 
> With -O1 -fenable-tree-thread1 (no threadfull):
>  dominator optimization : 127.76 (  7%)   0.57 (  7%) 128.58 ( 
> 7%)   236M (  9%)
>  backwards jump threading   :1375.45 ( 79%)   1.81 ( 22%)1382.33 (
> 79%)   519M ( 20%)
>  TOTAL  :1733.30  8.16   1747.35
> 2636M
> 

> Richi mentioned that the handcuffs he added weren't being triggered in the
> backwards threader because ranger was going outside the path, but
> -fenable-tree-thread1 adds a light-weight threading pass that shouldn't use
> anything but gori (*).  So, either we're doing something stupid in path
> discovery, or we're looking outside the path, even with a non-resolving
> threader.  Something definitely smells bad.
> 
> (*) Well, technically we use the global ranger to look at relations. 
> Andrew, would looking at relations cause us to look outside of the given
> path?

Sure. If a relation is not resolved within the path, and there might be a
relation determined before the path, then it queries the global oracle starting
at the root of the path. 

The relation oracle does not cache anything on lookups. It was anticipated as a
light weight query, and caching results would end up costing more in memory
than it would save in subsequent queries.

So if there are frequent queries that escape the path, and those potential
relations are way up in the CFG, it could do a lot of traversals.

Also, the oracle does not KNOW if there is a relation, it only knows when both
ssa-names HAVE relations, so it has to go look to see if they are related to
each other.

That said, I tried disabling the root oracle, and there was basically no
difference.   So that is probably a red herring.

I also tried disabling relations.. ie, hacked the path oracle to always return
VREL_VARYING and not to actually register anything.  the threader only dropped
from 79% to 77% with no relations, so I do not think the relation oracle is the
issue. 

Its sure pending a lot of time somewhere.   still looking... It does appear to
do a lot of the same work over and over.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-02 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #26 from Aldy Hernandez  ---
I think there's something fundamentally wrong in the *backwards* threader that
causes us to blow up, even without fully resolving conditions with a global
ranger.

I tried running at -O1 and -fenable-tree-thread1.  This basically adds just one
non fully-resolving) backwards threader instance (i.e. not
-fenable-tree-threadfull1, which is what we do at -O2).  This helps analyze the
problem with the backwards threader, bringing down the test arena from hours to
less than 30 minutes:

With -O1 -fenable-tree-thread1 (no threadfull):
 dominator optimization : 127.76 (  7%)   0.57 (  7%) 128.58 (  7%)
  236M (  9%)
 backwards jump threading   :1375.45 ( 79%)   1.81 ( 22%)1382.33 ( 79%)
  519M ( 20%)
 TOTAL  :1733.30  8.16   1747.35   
 2636M

What's curious is that adding just this one non-resolving backward threader
instance causes us to go from 370 seconds to 1733 seconds.  That's a full 20
minutes for one backward threader pass.

Richi mentioned that the handcuffs he added weren't being triggered in the
backwards threader because ranger was going outside the path, but
-fenable-tree-thread1 adds a light-weight threading pass that shouldn't use
anything but gori (*).  So, either we're doing something stupid in path
discovery, or we're looking outside the path, even with a non-resolving
threader.  Something definitely smells bad.

(*) Well, technically we use the global ranger to look at relations.  Andrew,
would looking at relations cause us to look outside of the given path?

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-02 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #25 from Aldy Hernandez  ---
(In reply to Aldy Hernandez from comment #22)
> Created attachment 59001 [details]
> reduce recursion in forward threader (patch in testing)

Avoiding unnecessary recursion in simplify_control_stmt_condition_1() reduces
the time in DOM by 23%:

Before:
 dominator optimization : 213.87 ( 51%)   0.41 (  7%) 215.03 ( 51%)
  113M (  6%)
 backwards jump threading   :  29.34 (  7%)   0.19 (  3%)  29.73 (  7%)
   68M (  3%)
 TOTAL  : 416.61  5.97423.92   
 2041M

After:
 dominator optimization : 164.61 ( 44%)   0.40 (  7%) 165.65 ( 44%)
  113M (  6%)
 backwards jump threading   :  28.18 (  8%)   0.20 (  3%)  28.50 (  8%)
   68M (  3%)
 TOTAL  : 370.40  5.88377.37   
 2041M

As mentioned, in our benchmark suite I see a difference of 3 threads at -O1,
and nothing at -O2.  I still haven't been able to pin point the exact
difference in threads.  I will do so when I return next week (I'm having
logistical issues with daycare this week).  I wouldn't be surprised if it was
nothing, as the forward threader frequently tries to thread unreachable paths
which has made comparing the forward / backward threaders difficult.

If we look at dom_jt_simplifier::simplify(), we'll see that we first try with
DOM's internal tables, and then with ranger.  Avoiding the recursion in
simplify_control_stmt_condition_1() (the caller) with the proposed patch,
causes us to attempt to simplify far less paths, but it will cause less
attempts with DOM's internal tables.  In theory this is OK, because the
statements in question are either integer or pointers (no relations), so ranger
should be able to do this better than DOM:

  /* If the condition has the form (A & B) CMP 0 or (A | B) CMP 0 then
 recurse into the LHS to see if there is a simplification that
 makes this condition always true or always false along the edge
 E.  */
  if ((cond_code == EQ_EXPR || cond_code == NE_EXPR)
  && TREE_CODE (op0) == SSA_NAME
  && integer_zerop (op1))
{
...
}

In other words, the patch also reduces the recursion into DOM's tables, but I
think this is OK, because DOM's lookup_avail_expr() shouldn't be getting
anything ranger can't already get with pointers and integers.  I still want to
check those 3 threads (not in this testcase, but in our internal benchmarks) to
make sure.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-02 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #24 from Aldy Hernandez  ---
(In reply to Richard Biener from comment #13)
> Most of the -O1 dom time is spent in threading using path ranger to simplify
> the JT conditions.  That in turn does (for each threading from scratch?)
> GORI computes with most of the time spent in range_from_dom in the
> gori_compute::may_recompute_p cycle (and there doing bitmap operations).
> That compute has a 'depth' --param but it looks like range_from_dom
> doesn't and we have a very deep dominator tree for this testcase.

The path solver does look outside of the path, but only for incoming ranges to
the path (range_on_entry).  It uses a ranger, but it is shared among all paths
and queries to it should be O(1) as they should all be cached in the ranger. 
This is in DOM:

  /* Recursively walk the dominator tree optimizing statements.  */
  gimple_ranger *ranger = enable_ranger (fun);
  path_range_query path_query (*ranger, /*resolve=*/true);

We try really hard in path_range_query to not query anything outside of the
path, not even with the shared ranger, unless m_resolve == true.  But even so,
things should be cached.

Curiously, I tried disabling the full ranger resolver in DOM threading with:

diff --git a/gcc/tree-ssa-dom.cc b/gcc/tree-ssa-dom.cc
index 221fe6db321..7279d359606 100644
--- a/gcc/tree-ssa-dom.cc
+++ b/gcc/tree-ssa-dom.cc
@@ -926,7 +926,7 @@ pass_dominator::execute (function *fun)

   /* Recursively walk the dominator tree optimizing statements.  */
   gimple_ranger *ranger = enable_ranger (fun);
-  path_range_query path_query (*ranger);
+  path_range_query path_query (*ranger, /*resolve=*/false);
   dom_jt_simplifier simplifier (avail_exprs_stack, ranger, &path_query);
   dom_jt_state state (const_and_copies, avail_exprs_stack);
   jump_threader threader (&simplifier, &state);

But this only reduced the usage from 164 secs to 85 secs.  Intuitively it seems
logical that if we're doing full resolving of things outside of the path, PHI
edges, and doing a lot more work, that we could be 2x as slow.  But we're still
slow.

 dominator optimization :  85.59 ( 30%)   0.33 (  6%)  86.15 ( 30%)
  113M (  6%)
 backwards jump threading   :  29.58 ( 10%)   0.23 (  4%)  29.97 ( 10%)
   68M (  3%)
 TOTAL  : 284.35  5.84291.01   
 2041M

Perhaps Andrew can comment on the cache and on the range_from_dom() impact.  We
shouldn't be looking outside of the path, at least not in a way that impacts
compile time drastically, if that's the case.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-26 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #23 from Aldy Hernandez  ---
(In reply to Aldy Hernandez from comment #22)
> Created attachment 59001 [details]
> reduce recursion in forward threader (patch in testing)
> 
> As suggested by Richard in PR116166.

Should've been more verbose in the description:

This patch removes some ad-hoc simplification code in the forward threader, as
the
call into the ranger in m_simplifier->simplify() will handle anything we
can do manually in simplify_control_stmt_condition_1.

In PR114855, DOM time is reduced from 120s to 92s (-23%) and overall
compilation time from 235s to 205s (-12%).  The total thread count at -O1 is
unchanged for the testcase.

In our bootstrap .ii benchmark suite, I see we thread 3 threads less
over all files at -O1.  At -O2, the backward threader picks up one more,
for no difference over all.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-26 Thread aldyh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #22 from Aldy Hernandez  ---
Created attachment 59001
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59001&action=edit
reduce recursion in forward threader (patch in testing)

As suggested by Richard in PR116166.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-19 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #21 from Richard Biener  ---
(In reply to Andrew Macleod from comment #20)
> I did an -O2 run after those patches. 
> 
> Highlights:
> 
>  tree SSA incremental   : 117.74 (  1%)   0.63 (  3%) 120.37 ( 
> 1%)  1049M ( 24%)
>  dominator optimization : 680.49 (  5%)   0.65 (  4%) 686.40 ( 
> 5%)   170M (  4%)
>  backwards jump threading   :11253.23 ( 90%)   9.15 ( 49%)11332.34 (
> 90%)  1332M ( 30%)
>  TOTAL  :12520.81 18.54  12622.54   
> 4459M
> 
> I suspect most of dom is the over and over edge recalculations, but I don't
> actually know that.  I do know that most of the 200 seconds at -O1 is, so it
> seems a reasonable guess.
> 
> the backwards threader I can't comment on, but time would be rational if
> that were fixed.

The backwards threader has some limits in place but it eventually
re-evaluates all ranges on the path for each block considered - and as long
as that range evaluation isn't bounded in the distance it looks it will
run away.  The main "source" of slowness is back_threader::find_taken_edge_cond
which instantiates a path_range_query and evaluates range_of_stmt of the
path ending conditional.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-10 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #20 from Andrew Macleod  ---
I did an -O2 run after those patches. 

Highlights:

 tree SSA incremental   : 117.74 (  1%)   0.63 (  3%) 120.37 (  1%)
 1049M ( 24%)
 dominator optimization : 680.49 (  5%)   0.65 (  4%) 686.40 (  5%)
  170M (  4%)
 backwards jump threading   :11253.23 ( 90%)   9.15 ( 49%)11332.34 (
90%)  1332M ( 30%)
 TOTAL  :12520.81 18.54  12622.54  
  4459M

I suspect most of dom is the over and over edge recalculations, but I don't
actually know that.  I do know that most of the 200 seconds at -O1 is, so it
seems a reasonable guess.

the backwards threader I can't comment on, but time would be rational if that
were fixed.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-09 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #19 from Andrew Macleod  ---
(In reply to GCC Commits from comment #17)
> The master branch has been updated by Andrew Macleod :
> 
> https://gcc.gnu.org/g:9e4da946c4263a4c89d5fc365b3c97ae244c5018
> 
> commit r15-2858-g9e4da946c4263a4c89d5fc365b3c97ae244c5018
> Author: Andrew MacLeod 
> Date:   Thu Aug 8 16:37:28 2024 -0400
> 
> Adjust rangers recomputation depth based on the number of BBs.
> 
> As the number of block increase, recomputations can become more
> expensive.  Adjust the depth limit to avoid excessive compile time.
> 
> PR tree-optimization/114855
> * gimple-range-gori.cc (gori_compute::gori_compute): Adjust
> ranger_recompute_depth limit based on the number of BBs.
> (gori_compute::may_recompute_p): Use previosuly calculated value.
> * gimple-range-gori.h (gori_compute::m_recompute_depth): New.

With the second change, on my machine at -O1 we're down to:
 callgraph ipa passes   :  47.83 ( 11%)   1.64 ( 25%)  49.65 ( 12%)
  338M ( 17%)
 df live regs   :  12.19 (  3%)   0.19 (  3%)  12.41 (  3%)
0  (  0%)
 dominator optimization : 214.20 ( 51%)   0.58 (  9%) 215.56 ( 51%)
  113M (  6%)
 backwards jump threading   :  27.58 (  7%)   0.20 (  3%)  27.95 (  7%)
   68M (  3%)
 tree CCP   :   5.46 (  1%)   0.15 (  2%)   5.72 (  1%)
   14M (  1%)
 tree FRE   :   5.48 (  1%)   0.20 (  3%)   5.69 (  1%)
 2063k (  0%)
 dominance computation  :   8.54 (  2%)   0.18 (  3%)   8.79 (  2%)
0  (  0%)
 integrated RA  :  28.48 (  7%)   0.03 (  0%)  28.61 (  7%)
   29M (  1%)
 LRA non-specific   :  12.43 (  3%)   0.01 (  0%)  12.48 (  3%)
 8977k (  0%)
 TOTAL  : 416.00  6.52424.24   
 2041M

So better.  Still have some DOM work to do.  That is mostly edge range
recalculations in the path ranger and GORI.

-O2 is a different beast :-P

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-09 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

Sam James  changed:

   What|Removed |Added

   See Also||https://github.com/GaloisIn
   ||c/cryptol-compiler/issues/4
   ||2

--- Comment #18 from Sam James  ---
(In reply to Richard Biener from comment #15)
> The testcase is a bit unwieldly for developing a fix - I wonder if it's
> possible to auto-generate smaller testcases with the same structure?

Jeremy, could you give us some recipe for it?

That said I tried to look a bit, the file has "Automatically generated by SBV".

It looks like an SMT solver:
https://hackage.haskell.org/package/sbv-10.11/docs/Documentation-SBV-Examples-CodeGeneration-AddSub.html
(the example looks kind of similar to the original testcase too, just way
smaller).

That led me to
https://github.com/GaloisInc/cryptol-compiler/issues/42#issuecomment-2076405286.
I don't know enough about Haskell stuff to try extract a generator from that
though.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #17 from GCC Commits  ---
The master branch has been updated by Andrew Macleod :

https://gcc.gnu.org/g:9e4da946c4263a4c89d5fc365b3c97ae244c5018

commit r15-2858-g9e4da946c4263a4c89d5fc365b3c97ae244c5018
Author: Andrew MacLeod 
Date:   Thu Aug 8 16:37:28 2024 -0400

Adjust rangers recomputation depth based on the number of BBs.

As the number of block increase, recomputations can become more
expensive.  Adjust the depth limit to avoid excessive compile time.

PR tree-optimization/114855
* gimple-range-gori.cc (gori_compute::gori_compute): Adjust
ranger_recompute_depth limit based on the number of BBs.
(gori_compute::may_recompute_p): Use previosuly calculated value.
* gimple-range-gori.h (gori_compute::m_recompute_depth): New.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #16 from GCC Commits  ---
The master branch has been updated by Andrew Macleod :

https://gcc.gnu.org/g:5ce3874b3c2fdd76f506005cb1171a732af7c807

commit r15-2857-g5ce3874b3c2fdd76f506005cb1171a732af7c807
Author: Andrew MacLeod 
Date:   Thu Aug 8 16:34:15 2024 -0400

Limit equivalency processing in rangers cache.

When the number of block exceed VRP's sparse threshold, do not query all
equivalencies during cache filling.   This can be expensive for unknown
benefit.

PR tree-optimization/114855
* gimple-range-cache.cc (ranger_cache::fill_block_cache): Do not
process equivalencies if the number of blocks is too high.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-08-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #15 from Richard Biener  ---
The testcase is a bit unwieldly for developing a fix - I wonder if it's
possible to auto-generate smaller testcases with the same structure?

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-07-25 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #14 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:634eae5ec3f3af2c4f6221d3ed2cf78d7f5c47f0

commit r15-2312-g634eae5ec3f3af2c4f6221d3ed2cf78d7f5c47f0
Author: Sam James 
Date:   Tue Jul 23 15:06:10 2024 +0100

doc: Document -O1 as the preferred level for large machine-generated code

At -O1, the intention is that we compile things in a "reasonable" amount
of time (ditto memory use). In particular, we try to especially avoid
optimizations which scale poorly on pathological cases, as is the case
for large machine-generated code.

Recommend -O1 for large machine-generated code, as has been informally
done on bugs for a while now.

This applies (broadly speaking) for both large machine-generated functions
but also to a lesser extent repetitive small-but-still-not-tiny functions
from a generator program.

gcc/ChangeLog:
PR middle-end/114855
* doc/invoke.texi (Optimize options): Mention machine-generated
code for -O1.

[Bug middle-end/114855] ICE: Segfault

2024-06-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #13 from Richard Biener  ---
Most of the -O1 dom time is spent in threading using path ranger to simplify
the JT conditions.  That in turn does (for each threading from scratch?)
GORI computes with most of the time spent in range_from_dom in the
gori_compute::may_recompute_p cycle (and there doing bitmap operations).
That compute has a 'depth' --param but it looks like range_from_dom
doesn't and we have a very deep dominator tree for this testcase.

What's also oddly expensive (visible through the wide_int_storage::operator=
profile) is irange_bitmask::intersect, I suspect

  // If we have two known bits that are incompatible, the resulting
  // bit is undefined.  It is unclear whether we should set the entire
  // range to UNDEFINED, or just a subset of it.  For now, set the
  // entire bitmask to unknown (VARYING).
  if (wi::bit_and (~(m_mask | src.m_mask),
   m_value ^ src.m_value) != 0)
{

is quite expensive to evaluate.

It might make sense to implement a wi::not_ior_and_xor_nonzero_p special
case for this (unfortunately wide-int doesn't use expression templates).

Limiting range_from_dom like the following improves compile-time at -O1
from 600s to 200s (tested on the gcc-14 branch).  This should probably
re-use an existing ranger --param that's related or add a new one.
Note that -O -fno-thread-jumps compiles in 30s.  IIRC path-ranger uses
it's own cache it wipes between queries - I don't know how this interacts
with GORI (it hopefully shouldn't recompute things, but I don't know).

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index a33b7a73872..47117db0648 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1668,6 +1668,7 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
   else
 bb = get_immediate_dominator (CDI_DOMINATORS, start_bb);

+  unsigned depth = 10;
   // Search until a value is found, pushing blocks which may need calculating.
   for ( ; bb; prev_bb = bb, bb = get_immediate_dominator (CDI_DOMINATORS, bb))
 {
@@ -1709,6 +1710,9 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,

   if (m_on_entry.get_bb_range (r, name, bb))
break;
+
+  if (--depth == 0)
+   break;
 }

   if (DEBUG_RANGE_CACHE)

I think we're running into several "layers" of (limited or unlimited)
recursions that compose to O (N^M) behavior here.  In other places of
the compiler we impose a global work limit to avoid this and to allow
one layer to use up the work fully when the others do not need deep
recursion.  Of course that only works if the work can be fairly
distributed.

Note instead of limiting the depth of the DOM walk above you could
also limit the number of blocks added to m_workback.

We hit the join block handling often for the testcase, I think that
when one of the pred_bb->preds has BB as src we can avoid adding
to m_workback since we know there's no edge range on that edge
and thus resolve_dom would union with VARYING?  Thus

@@ -1693,8 +1694,8 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
  edge_iterator ei;
  bool all_dom = true;
  FOR_EACH_EDGE (e, ei, prev_bb->preds)
-   if (e->src != bb
-   && !dominated_by_p (CDI_DOMINATORS, e->src, bb))
+   if (e->src == bb
+   || !dominated_by_p (CDI_DOMINATORS, e->src, bb))
  {
all_dom = false;
break;

though doing this doesn't help the testcase.  But I see that
resolve_dom eventually recurses to range_from_dom which in this case
doesn't stop at the immediate dominator of prev_bb but again only
eventually at the definition of 'name'.  For the testcase we always
have only two incoming edges but in theory this leads to quadraticness?

Trying to limit this with a hack (not sure when else the stack isnt
empty upon recursion) like the following doensn't help though (in addition
to the above changes), instead it results in a slight slowdown.

@@ -1632,6 +1632,8 @@ ranger_cache::resolve_dom (vrange &r, tree name,
basic_block bb)
   m_on_entry.set_bb_range (name, bb, r);
 }

+static vec rfd_limit = vNULL;
+
 // Get the range of NAME from dominators of BB and return it in R.  Search the
 // dominator tree based on MODE.

@@ -1657,6 +1659,8 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
   // Range on entry to the DEF block should not be queried.
   gcc_checking_assert (start_bb != def_bb);
   unsigned start_limit = m_workback.length ();
+  if (!rfd_limit.is_empty ())
+def_bb = get_immediate_dominator (CDI_DOMINATORS, rfd_limit.last ());

   // Default value is global range.
   get_global_range (r, name);
@@ -1736,7 +1744,11 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
  // RFD_FILL, then the cache cant be stored to

[Bug middle-end/114855] ICE: Segfault

2024-06-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #12 from Richard Biener  ---
At -O1 we have

Samples: 2M of event 'cycles:u', Event count (approx.): 2983686432518   
Overhead   Samples  Command  Shared Object Symbol   
  19.77%467950  cc1  cc1   [.] bitmap_bit_p 
  12.31%300919  cc1  cc1   [.]
wide_int_storage::operator=   
   6.79%158610  cc1  cc1   [.]
gori_compute::may_recompute_p 
   4.84%113100  cc1  cc1   [.]
ranger_cache::range_from_dom  
   3.79% 88582  cc1  cc1   [.] bitmap_set_bit   
   3.24% 75772  cc1  cc1   [.]
block_range_cache::get_bb_range   
   2.40% 56058  cc1  cc1   [.] get_immediate_dominator  
   2.37% 55493  cc1  cc1   [.] gori_map::exports
   2.15% 50244  cc1  cc1   [.] gori_map::is_export_p
   1.87% 45710  cc1  cc1   [.]
wide_int_storage::wide_int_storage
   1.73% 40436  cc1  cc1   [.]
infer_range_manager::has_range_p  
   1.70% 39586  cc1  cc1   [.] gimple_has_side_effects  
   1.17% 28642  cc1  cc1   [.]
irange_storage::get_irange
   1.13% 27004  cc1  cc1   [.]
back_jt_path_registry::adjust_paths_after_duplication  

so it's DOMs jump threader that takes the time.  Using -O1 -fno-thread-jumps
this improves a lot to

Samples: 362K of event 'cycles:u', Event count (approx.): 441041461405  
Overhead   Samples  Command  Shared Object Symbol   
  22.44% 78191  cc1  cc1   [.]
wide_int_storage::operator=   
  11.02% 38451  cc1  cc1   [.] bitmap_bit_p 
   3.55% 12318  cc1  cc1   [.]
dom_oracle::register_transitives  
   3.45% 12016  cc1  cc1   [.]
wide_int_storage::wide_int_storage  

I'm going to try to collect a callgrind profile for -O1.

[Bug middle-end/114855] ICE: Segfault

2024-06-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #11 from Richard Biener  ---
Btw, a question to the reporter - I suppose the files are machine-generated. 
Are you able to create a file of smaller size?  This one has ~20 lines,
some with 2000 and 2 lines would be perfect.

[Bug middle-end/114855] ICE: Segfault

2024-06-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #10 from Richard Biener  ---
Created attachment 58505
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58505&action=edit
preprocessed testcase

[Bug middle-end/114855] ICE: Segfault

2024-06-22 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #9 from Richard Biener  ---
Note looking at -O1 is the most important thing, as we tell people to use -O1
for autogenerated code.  There I suppose comment#4 still applies and likely
this is ranger as well.  Maybe DOMs ranger use can be tuned down at -O1
(when -ftree-vrp is off).

[Bug middle-end/114855] ICE: Segfault

2024-05-09 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #8 from Andrew Macleod  ---
(In reply to Andrew Macleod from comment #7)
> LOoks like the primary culprits now are:
> 
> dominator optimization : 666.73 (  7%)   0.77 (  2%) 671.76 ( 
> 7%)   170M (  4%)
> backwards jump threading   :7848.77 ( 85%)  21.04 ( 65%)7920.05 (
> 85%)  1332M ( 29%)
> 
> TOTAL  :9250.99 32.58   9341.40 
> 4619M

If I turn off threading, then VRP opps up with 400, so I took a look at VRP.

The biggest problem is that this testcase has on the order of 400,000 basic
blocks, with a pattern of a block of code followed by a lot of CFG diamonds
using a number of differense ssa-names from within the block over and over.  
When we are calculating /storing imports and exports for every block, then
utilizing that info to try to find outgoing ranges that maybe we can use, it
simply adds up.

For VRP, we currently utilize different cache models depoending on the number
of block.. Im wondering if maybe this might not be a good testcase to actually
use a different VRP when the number of block are excessive.  I wroite the fast
VRP pass last year, which currently isnt being used.  I'm goign to experiment
with it to see if we turn it on for CFGs that above a threasdhold (100,000 BB?
), we enable the lower overhead fast VRP instead for all VRP passes. 

The threading issue probably needs to have some knobs added or tweaked for such
very large BBs. there would be a LOT of threading opportunities in the code I
saw, so I can see why it would be so busy.  I saw a lot fo branches to branches
using the same SSA_NAMe.

[Bug middle-end/114855] ICE: Segfault

2024-05-03 Thread amacleod at redhat dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #7 from Andrew Macleod  ---
LOoks like the primary culprits now are:

dominator optimization : 666.73 (  7%)   0.77 (  2%) 671.76 (  7%) 
 170M (  4%)
backwards jump threading   :7848.77 ( 85%)  21.04 ( 65%)7920.05 ( 85%) 
1332M ( 29%)

TOTAL  :9250.99 32.58   9341.40
4619M

[Bug middle-end/114855] ICE: Segfault

2024-04-30 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

Richard Biener  changed:

   What|Removed |Added

 CC||aldyh at gcc dot gnu.org

--- Comment #6 from Richard Biener  ---
I've backported the fix for the recursion issue, only the memory/compile-time
hog issue should prevail on the branch.  comment#14 now also applies to the GCC
13 branch.

[Bug middle-end/114855] ICE: Segfault

2024-04-26 Thread jeremy.rutman at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #5 from jeremy rutman  ---
Using gcc 14.0.1 20240117 (experimental) [master r14-8187-gb00be6f1576] I was
able to compile when not using any flags:

$ /usr/lib/gcc-snapshot/bin/cc -c aesDecrypt.c -o aesDecrypt.o

But when using the flags as before 

$  /usr/lib/gcc-snapshot/bin/cc -Wall -O3 -DNDEBUG -fomit-frame-pointer -c
aesDecrypt.c -o aesDecrypt.o

the compile kept going for at least one hour on my machine before I aborted.

[Bug middle-end/114855] ICE: Segfault

2024-04-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

Richard Biener  changed:

   What|Removed |Added

 CC||amacleod at redhat dot com
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-04-26

--- Comment #4 from Richard Biener  ---
Trunk at -O1:

dominator optimization : 495.14 ( 82%)   0.20 (  5%) 495.44 ( 81%) 
 113M (  5%)

I can confirm the segfault with the 13.2 release.  It segfaults in

#0  0x009a8603 in (anonymous
namespace)::pass_waccess::check_dangling_stores (this=this@entry=0x2866fc0,
bb=0x75277480, stores=..., bbs=...)
at /space/rguenther/src/gcc-13-branch/gcc/gimple-ssa-warn-access.cc:4535

with too deep recursion.  That was fixed by r14-4308-gf194c684a28a5d for
PR111600 and could be backported, leaving the compile-time hog.  I'll do
that next week.

[Bug middle-end/114855] ICE: Segfault

2024-04-25 Thread jeremy.rutman at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #3 from jeremy rutman  ---
For what it's worth, clang is able to compile the code in question. 

Ubuntu clang version 18.1.3 (1)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

[Bug middle-end/114855] ICE: Segfault

2024-04-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #2 from Andrew Pinski  ---
The code basically does a bunch of:

  const SWord8 s599 = s557 ? s595 : s598;
  const SWord8 s600 = s561 ? 14 : 246;
  const SWord8 s601 = s561 ? 3 : 72;
  const SWord8 s602 = s559 ? s600 : s601;
  const SWord8 s603 = s561 ? 102 : 181;
  const SWord8 s604 = s561 ? 62 : 112;
  const SWord8 s605 = s559 ? s603 : s604;
  const SWord8 s606 = s557 ? s602 : s605;
  const SWord8 s607 = s555 ? s599 : s606;
  const SWord8 s608 = s561 ? 138 : 139;


Continuously.

[Bug middle-end/114855] ICE: Segfault

2024-04-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #1 from Andrew Pinski  ---
Worthing noting on the trunk most of the compile time seems to be in the ranger
code ...