[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #38 from Richard Biener  ---
Created attachment 50354
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50354=edit
SVG of the CFG at LIM

This is a SVG of the CFG as created by dot at the point of the first LIM pass.

The CFG isn't too special and I guess a switch instead of the computed goto
would present us with the same issues.

I suppose putting a hard limit on the number of stores to move and then
ordering candidates based on their importance (execution frequency) is the
way to go.

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2021-03-10 Thread lucier at math dot purdue.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #37 from lucier at math dot purdue.edu ---
Created attachment 50352
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50352=edit
Smaller parameterized test file

This file is generated from a single copy of the fibonacci function, and is
simplified a bit otherwise.  I believe it has two computed gotos.

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #36 from Richard Biener  ---
So the issue is still the same - one thing I noticed is that store-motion also
adds a flag for each counter update to avoid introducing store-data-races.
-fallow-store-data-races mitigates that part and speeds up the compilation
quite a bit.  In case there are threads involved you'd want
-fprofile-update=atomic
which then causes store-motion to give up and the compile-time is great
overall.

The original trigger of the regression is likely the marking of the profile
counters as to not be aliased - we might want to introduce another flag to
tell that store-data-races for the particular decl are not a consideration
(maybe even have some user-visible attribute for this).

Otherwise re-confirmed (I stripped options down to -O -fPIC -fprofile-arcs
-ftest-coverage):

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-2.o1-fib-2.i
1.84user 0.05system 0:01.90elapsed 99%CPU (0avgtext+0avgdata
160764maxresident)k
0inputs+0outputs (0major+58129minor)pagefaults 0swaps
rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-3.o1-fib-3.i 
10.15user 0.17system 0:10.32elapsed 99%CPU (0avgtext+0avgdata
726688maxresident)k
0inputs+0outputs (0major+265008minor)pagefaults 0swaps
rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-4.o1-fib-4.i 
43.60user 1.06system 0:44.68elapsed 99%CPU (0avgtext+0avgdata
6107260maxresident)k
0inputs+0outputs (0major+1765217minor)pagefaults 0swaps
rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i 
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.
Command exited with non-zero status 1
143.09user 3.93system 2:28.29elapsed 99%CPU (0avgtext+0avgdata
24636148maxresident)k
37504inputs+0outputs (31major+6133278minor)pagefaults 0swaps

on the last which runs OOM adding -fallow-store-data-races does

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fallow-store-data-races
123.06user 0.45system 2:03.59elapsed 99%CPU (0avgtext+0avgdata
100maxresident)k
57304inputs+0outputs (68major+535127minor)pagefaults 0swaps

and -fprofile-update=atomic

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fprofile-update=atomic 
0.61user 0.02system 0:00.63elapsed 100%CPU (0avgtext+0avgdata
73236maxresident)k
72inputs+0outputs (0major+18284minor)pagefaults 0swaps

and -fno-tree-loop-im

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fno-tree-loop-im  
1.06user 0.01system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 90672maxresident)k
0inputs+0outputs (0major+24331minor)pagefaults 0swaps

I still wonder if you can produce an even smaller testcase where visualizing
the CFG is possible.  Unfortunately the source is mechanically generated
and following it is hard.  Like a testcase that retains the basic structure
but ends up with just a few (2, less than 10) computed gotos?

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2021-03-09 Thread lucier at math dot purdue.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #35 from lucier at math dot purdue.edu ---
Created attachment 50345
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50345=edit
Parametrized input files for test coverage testing.

These are the .i files that go with my previous comment.

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2021-03-09 Thread lucier at math dot purdue.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #34 from lucier at math dot purdue.edu ---
I decided to approach this a bit more methodically by generating a series of
synthetic programs, each twice as long as the previous, and to measure the
compilation time.  I'll attach the associated .i files here.

Each .i file was generated from a Scheme file with 2^k copies, k=1,..,5, of a
simple recursive definition of the fibonacci function, suitably renamed.  So
these are not large files by my standards.

The short summary is that CPU time seems to grow quadraticly with the length of
the code.  The required memory grows very quickly, too---I killed the
compilation with k=5 (so 32 copies of fibonacci function) because the
computation filled 32GB of RAM and 32GB of swap.

Perhaps this parameterized input files might be of help.

Brad

I downloaded the git sources for gcc:

heine:~/programs/gcc/gcc-mainline> git log
commit 7eef9a66018e23677058fec421229e3fa435a1a3 (HEAD -> master, origin/master,
origin/HEAD)
Author: Joel Brobecker 
Date:   Mon Mar 8 23:59:37 2021 -0300

I configured and built gcc with

heine:~/programs/gcc/gcc-mainline> /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/pkgs/gcc-mainline/bin/gcc
COLLECT_LTO_WRAPPER=/pkgs/gcc-mainline/libexec/gcc/x86_64-pc-linux-gnu/11.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../../gcc-mainline/configure --prefix=/pkgs/gcc-mainline
--enable-languages=c --enable-checking=release
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.1 20210309 (experimental) (GCC) 

The program names are fib-1.c to fib-5.c, fib-k.c contains 2^k copies of
fibonacci.

/pkgs/gcc-mainline/bin/gcc -march=native -D___CAN_IMPORT_CLIB_DYNAMICALLY  -O1 
   -Wno-unused -Wno-write-strings -Wdisabled-optimization -fwrapv
-fno-strict-aliasing -fno-trapping-math -fno-math-errno -fschedule-insns2
-fomit-frame-pointer -fPIC -fno-common -mpc64   -rdynamic -shared 
-D___SINGLE_HOST -D___DYNAMIC
-I"/home/lucier/programs/gambit/gambit-profiled/include" -o 'fib-1.o1' -Q
-fprofile-arcs -ftest-coverage -save-temps   'fib-1.c' 

Time variable   usr   sys  wall
  GGC
 phase setup:   0.02 (100%)   0.00 (  0%)   0.03 (100%)
 5039k (100%)
 TOTAL  :   0.02  0.00  0.03   
 5049k
 btowc wctob mbrlen ___H_fib_2d_1 ___setup_mod ___init_mod
___LNK_fib_2d_1_2e_o1
Analyzing compilation unit
Performing interprocedural optimizations
 <*free_lang_data> {heap 1240k}  {heap 1240k} 
{heap 1240k}  {heap 1240k}  {heap 2468k}
 {heap 2468k}  {heap 2468k}  {heap
2468k}Streaming LTO
  {heap 2468k}  {heap 2468k}  {heap
2468k}  {heap 2468k}  {heap 2468k}  {heap 2468k}
 {heap 2468k}  {heap 2468k}  {heap
2468k}  {heap 2468k}Assembling functions:
  {heap 2468k} ___setup_mod ___init_mod ___H_fib_2d_1
___LNK_fib_2d_1_2e_o1 _sub_I_00100_0 _sub_D_00100_1
Time variable   usr   sys  wall
  GGC
 phase setup:   0.00 (  0%)   0.00 (  0%)   0.01 (  1%)
 1519k (  6%)
 phase parsing  :   0.06 (  8%)   0.01 ( 20%)   0.08 ( 10%)
 2072k (  8%)
 phase opt and generate :   0.67 ( 92%)   0.04 ( 80%)   0.70 ( 89%)
   22M ( 86%)
 dump files :   0.01 (  1%)   0.00 (  0%)   0.00 (  0%)
0  (  0%)
 callgraph functions expansion  :   0.66 ( 90%)   0.03 ( 60%)   0.69 ( 87%)
   21M ( 82%)
 callgraph ipa passes   :   0.01 (  1%)   0.00 (  0%)   0.01 (  1%)
  570k (  2%)
 cfg cleanup:   0.00 (  0%)   0.00 (  0%)   0.04 (  5%)
   64  (  0%)
 trivially dead code:   0.00 (  0%)   0.01 ( 20%)   0.00 (  0%)
0  (  0%)
 df live regs   :   0.01 (  1%)   0.00 (  0%)   0.02 (  3%)
0  (  0%)
 df live regs   :   0.02 (  3%)   0.00 (  0%)   0.02 (  3%)
0  (  0%)
 df reg dead/unused notes   :   0.02 (  3%)   0.00 (  0%)   0.01 (  1%)
  305k (  1%)
 alias analysis :   0.01 (  1%)   0.00 (  0%)   0.01 (  1%)
 1482k (  6%)
 alias stmt walking :   0.02 (  3%)   0.01 ( 20%)   0.02 (  3%)
 7280  (  0%)
 rebuild jump labels:   0.01 (  1%)   0.00 (  0%)   0.00 (  0%)
0  (  0%)
 preprocessing  :   0.02 (  3%)   0.00 (  0%)   0.01 (  1%)
  240k (  1%)
 lexical analysis   :   0.02 (  3%)   0.01 ( 20%)   0.00 (  0%)
0  (  0%)
 parser (global):   0.01 (  1%)   0.00 (  0%)   0.04 (  5%)
 1239k (  5%)
 parser struct body :   0.01 (  1%)   0.00 (  0%)   0.01 (  1%)
  359k (  1%)
 parser function body   :   0.00 (  0%)   0.00 (  0%)   0.02 (  3%)
  201k (  1%)
 tree gimplify  :   0.00 (  0%)   0.01 ( 20%)   0.00 (  0%)
  297k (  1%)
 tree copy propagation  :   0.01 (  1%)   0.00 (  

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2020-09-29 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #33 from rguenther at suse dot de  ---
On Tue, 29 Sep 2020, lucier at math dot purdue.edu wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928
> 
> --- Comment #32 from lucier at math dot purdue.edu ---
> I don't know precisely what you're saying, but it compiles fine without the
> instrumentation.

Yes - the instrumentation does complicate the IL but the instrumentation
should be already better than linear in the blocks.

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2020-09-29 Thread lucier at math dot purdue.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #32 from lucier at math dot purdue.edu ---
I don't know precisely what you're saying, but it compiles fine without the
instrumentation.

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2020-09-29 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #31 from Richard Biener  ---
(In reply to lucier from comment #30)
> I'm coming back to this project.
> 
> I naively thought "Well, I don't need arc profiling, I'll just set
> -ftest-coverage without -fprofile-arcs" but it appears that I can't do that,
> the gcda files are generated by -fprofile-arcs.
> 
> It seems to me that test coverage could be implemented simply by
> instrumenting each basic block in an algorithm that's linear in the number
> of basic blocks.  Is it possible to do this?
> 
> Brad

I don't think the instrumentation itself is the problem - it's already
doing better than one counter per block.  It's simply that the large
source runs into multiple non-linearities in core pieces of the compiler
that cannot be turned off ...

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2020-09-28 Thread lucier at math dot purdue.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #30 from lucier at math dot purdue.edu ---
I'm coming back to this project.

I naively thought "Well, I don't need arc profiling, I'll just set
-ftest-coverage without -fprofile-arcs" but it appears that I can't do that,
the gcda files are generated by -fprofile-arcs.

It seems to me that test coverage could be implemented simply by instrumenting
each basic block in an algorithm that's linear in the number of basic blocks. 
Is it possible to do this?

Brad