[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #38 from Richard Biener --- Created attachment 50354 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50354=edit SVG of the CFG at LIM This is a SVG of the CFG as created by dot at the point of the first LIM pass. The CFG isn't too special and I guess a switch instead of the computed goto would present us with the same issues. I suppose putting a hard limit on the number of stores to move and then ordering candidates based on their importance (execution frequency) is the way to go.
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #37 from lucier at math dot purdue.edu --- Created attachment 50352 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50352=edit Smaller parameterized test file This file is generated from a single copy of the fibonacci function, and is simplified a bit otherwise. I believe it has two computed gotos.
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #36 from Richard Biener --- So the issue is still the same - one thing I noticed is that store-motion also adds a flag for each counter update to avoid introducing store-data-races. -fallow-store-data-races mitigates that part and speeds up the compilation quite a bit. In case there are threads involved you'd want -fprofile-update=atomic which then causes store-motion to give up and the compile-time is great overall. The original trigger of the regression is likely the marking of the profile counters as to not be aliased - we might want to introduce another flag to tell that store-data-races for the particular decl are not a consideration (maybe even have some user-visible attribute for this). Otherwise re-confirmed (I stripped options down to -O -fPIC -fprofile-arcs -ftest-coverage): rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-2.o1-fib-2.i 1.84user 0.05system 0:01.90elapsed 99%CPU (0avgtext+0avgdata 160764maxresident)k 0inputs+0outputs (0major+58129minor)pagefaults 0swaps rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-3.o1-fib-3.i 10.15user 0.17system 0:10.32elapsed 99%CPU (0avgtext+0avgdata 726688maxresident)k 0inputs+0outputs (0major+265008minor)pagefaults 0swaps rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-4.o1-fib-4.i 43.60user 1.06system 0:44.68elapsed 99%CPU (0avgtext+0avgdata 6107260maxresident)k 0inputs+0outputs (0major+1765217minor)pagefaults 0swaps rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i gcc: fatal error: Killed signal terminated program cc1 compilation terminated. Command exited with non-zero status 1 143.09user 3.93system 2:28.29elapsed 99%CPU (0avgtext+0avgdata 24636148maxresident)k 37504inputs+0outputs (31major+6133278minor)pagefaults 0swaps on the last which runs OOM adding -fallow-store-data-races does rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fallow-store-data-races 123.06user 0.45system 2:03.59elapsed 99%CPU (0avgtext+0avgdata 100maxresident)k 57304inputs+0outputs (68major+535127minor)pagefaults 0swaps and -fprofile-update=atomic rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fprofile-update=atomic 0.61user 0.02system 0:00.63elapsed 100%CPU (0avgtext+0avgdata 73236maxresident)k 72inputs+0outputs (0major+18284minor)pagefaults 0swaps and -fno-tree-loop-im rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O -fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fno-tree-loop-im 1.06user 0.01system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 90672maxresident)k 0inputs+0outputs (0major+24331minor)pagefaults 0swaps I still wonder if you can produce an even smaller testcase where visualizing the CFG is possible. Unfortunately the source is mechanically generated and following it is hard. Like a testcase that retains the basic structure but ends up with just a few (2, less than 10) computed gotos?
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #35 from lucier at math dot purdue.edu --- Created attachment 50345 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50345=edit Parametrized input files for test coverage testing. These are the .i files that go with my previous comment.
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #34 from lucier at math dot purdue.edu --- I decided to approach this a bit more methodically by generating a series of synthetic programs, each twice as long as the previous, and to measure the compilation time. I'll attach the associated .i files here. Each .i file was generated from a Scheme file with 2^k copies, k=1,..,5, of a simple recursive definition of the fibonacci function, suitably renamed. So these are not large files by my standards. The short summary is that CPU time seems to grow quadraticly with the length of the code. The required memory grows very quickly, too---I killed the compilation with k=5 (so 32 copies of fibonacci function) because the computation filled 32GB of RAM and 32GB of swap. Perhaps this parameterized input files might be of help. Brad I downloaded the git sources for gcc: heine:~/programs/gcc/gcc-mainline> git log commit 7eef9a66018e23677058fec421229e3fa435a1a3 (HEAD -> master, origin/master, origin/HEAD) Author: Joel Brobecker Date: Mon Mar 8 23:59:37 2021 -0300 I configured and built gcc with heine:~/programs/gcc/gcc-mainline> /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. COLLECT_GCC=/pkgs/gcc-mainline/bin/gcc COLLECT_LTO_WRAPPER=/pkgs/gcc-mainline/libexec/gcc/x86_64-pc-linux-gnu/11.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../../gcc-mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-checking=release Thread model: posix Supported LTO compression algorithms: zlib gcc version 11.0.1 20210309 (experimental) (GCC) The program names are fib-1.c to fib-5.c, fib-k.c contains 2^k copies of fibonacci. /pkgs/gcc-mainline/bin/gcc -march=native -D___CAN_IMPORT_CLIB_DYNAMICALLY -O1 -Wno-unused -Wno-write-strings -Wdisabled-optimization -fwrapv -fno-strict-aliasing -fno-trapping-math -fno-math-errno -fschedule-insns2 -fomit-frame-pointer -fPIC -fno-common -mpc64 -rdynamic -shared -D___SINGLE_HOST -D___DYNAMIC -I"/home/lucier/programs/gambit/gambit-profiled/include" -o 'fib-1.o1' -Q -fprofile-arcs -ftest-coverage -save-temps 'fib-1.c' Time variable usr sys wall GGC phase setup: 0.02 (100%) 0.00 ( 0%) 0.03 (100%) 5039k (100%) TOTAL : 0.02 0.00 0.03 5049k btowc wctob mbrlen ___H_fib_2d_1 ___setup_mod ___init_mod ___LNK_fib_2d_1_2e_o1 Analyzing compilation unit Performing interprocedural optimizations <*free_lang_data> {heap 1240k} {heap 1240k} {heap 1240k} {heap 1240k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k}Streaming LTO {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k} {heap 2468k}Assembling functions: {heap 2468k} ___setup_mod ___init_mod ___H_fib_2d_1 ___LNK_fib_2d_1_2e_o1 _sub_I_00100_0 _sub_D_00100_1 Time variable usr sys wall GGC phase setup: 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 1%) 1519k ( 6%) phase parsing : 0.06 ( 8%) 0.01 ( 20%) 0.08 ( 10%) 2072k ( 8%) phase opt and generate : 0.67 ( 92%) 0.04 ( 80%) 0.70 ( 89%) 22M ( 86%) dump files : 0.01 ( 1%) 0.00 ( 0%) 0.00 ( 0%) 0 ( 0%) callgraph functions expansion : 0.66 ( 90%) 0.03 ( 60%) 0.69 ( 87%) 21M ( 82%) callgraph ipa passes : 0.01 ( 1%) 0.00 ( 0%) 0.01 ( 1%) 570k ( 2%) cfg cleanup: 0.00 ( 0%) 0.00 ( 0%) 0.04 ( 5%) 64 ( 0%) trivially dead code: 0.00 ( 0%) 0.01 ( 20%) 0.00 ( 0%) 0 ( 0%) df live regs : 0.01 ( 1%) 0.00 ( 0%) 0.02 ( 3%) 0 ( 0%) df live regs : 0.02 ( 3%) 0.00 ( 0%) 0.02 ( 3%) 0 ( 0%) df reg dead/unused notes : 0.02 ( 3%) 0.00 ( 0%) 0.01 ( 1%) 305k ( 1%) alias analysis : 0.01 ( 1%) 0.00 ( 0%) 0.01 ( 1%) 1482k ( 6%) alias stmt walking : 0.02 ( 3%) 0.01 ( 20%) 0.02 ( 3%) 7280 ( 0%) rebuild jump labels: 0.01 ( 1%) 0.00 ( 0%) 0.00 ( 0%) 0 ( 0%) preprocessing : 0.02 ( 3%) 0.00 ( 0%) 0.01 ( 1%) 240k ( 1%) lexical analysis : 0.02 ( 3%) 0.01 ( 20%) 0.00 ( 0%) 0 ( 0%) parser (global): 0.01 ( 1%) 0.00 ( 0%) 0.04 ( 5%) 1239k ( 5%) parser struct body : 0.01 ( 1%) 0.00 ( 0%) 0.01 ( 1%) 359k ( 1%) parser function body : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 3%) 201k ( 1%) tree gimplify : 0.00 ( 0%) 0.01 ( 20%) 0.00 ( 0%) 297k ( 1%) tree copy propagation : 0.01 ( 1%) 0.00 (
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #33 from rguenther at suse dot de --- On Tue, 29 Sep 2020, lucier at math dot purdue.edu wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 > > --- Comment #32 from lucier at math dot purdue.edu --- > I don't know precisely what you're saying, but it compiles fine without the > instrumentation. Yes - the instrumentation does complicate the IL but the instrumentation should be already better than linear in the blocks.
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #32 from lucier at math dot purdue.edu --- I don't know precisely what you're saying, but it compiles fine without the instrumentation.
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #31 from Richard Biener --- (In reply to lucier from comment #30) > I'm coming back to this project. > > I naively thought "Well, I don't need arc profiling, I'll just set > -ftest-coverage without -fprofile-arcs" but it appears that I can't do that, > the gcda files are generated by -fprofile-arcs. > > It seems to me that test coverage could be implemented simply by > instrumenting each basic block in an algorithm that's linear in the number > of basic blocks. Is it possible to do this? > > Brad I don't think the instrumentation itself is the problem - it's already doing better than one counter per block. It's simply that the large source runs into multiple non-linearities in core pieces of the compiler that cannot be turned off ...
[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928 --- Comment #30 from lucier at math dot purdue.edu --- I'm coming back to this project. I naively thought "Well, I don't need arc profiling, I'll just set -ftest-coverage without -fprofile-arcs" but it appears that I can't do that, the gcda files are generated by -fprofile-arcs. It seems to me that test coverage could be implemented simply by instrumenting each basic block in an algorithm that's linear in the number of basic blocks. Is it possible to do this? Brad