On Wed, Oct 2, 2024 at 9:54 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Wed, Oct 2, 2024 at 9:13 AM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > On Tue, Oct 1, 2024 at 6:06 PM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > > > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>: > > > > > > > > Hi, > > > > > > > > the <experimental/simd> unit tests are my long-standing pain point of > > > > excessive compiler memory usage and compile times. I've always worked > > > > around > > > > the memory usage problem by splitting the test matrix into multiple > > > > translations (with different -D flags) of the same source file. I.e. > > > > pay with > > > > a huge number of compiler invocations to be able to compile at all. OOM > > > > kills > > > > / thrashing isn't fun. > > > > > > > > Recently, the GNU Radio 4 implementation hit a similar issue of > > > > excessive > > > > compiler memory usage and compile times. Worst case example I have > > > > tested (a > > > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while > > > > compiling)): > > > > > > > > GCC 15: 13m03s, 30.413 GB (checking enabled) > > > > GCC 14: 12m03s, 15.248 GB > > > > GCC 13: 11m40s, 14.862 GB > > > > Clang 18: 8m10s, 10.811 GB > > > > > > > > That's supposed to be a unit test. But it's nothing one can use for > > > > test- > > > > driven development, obviously. But how do mere mortals optimize code for > > > > better compile times? -ftime-report is interesting but not really > > > > helpful. -Q > > > > has interesting information, but the output format is unusable for C++ > > > > and > > > > it's really hard to post-process. > > > > > > > > When compiler memory usage goes through the roof it's fairly obvious > > > > that > > > > compile times have to suffer. So I was wondering whether there are any > > > > low- > > > > hanging fruit to pick. I've managed to come up with a small torture > > > > test that > > > > shows interesting behavior. I put it at > > > > https://github.com/mattkretz/template-torture-test. Simply do > > > > > > > > git clone https://github.com/mattkretz/template-torture-test > > > > cd template-torture-test > > > > make STRESS=7 > > > > make TORTURE=1 STRESS=5 > > > > > > > > These numbers can already "kill" smaller machines. Be prepared to kill > > > > cc1plus > > > > before things get out of hand. > > > > > > > > The bit I find interesting in this test is switched with the -D GO_FAST > > > > macro > > > > (the 'all' target always compiles with and without GO_FAST). With the > > > > macro, > > > > template arguments to 'Operand<typename...>' are tree-like and the > > > > resulting > > > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST, > > > > template arguments to 'Operand<typename...>' are a flat list. But GGC > > > > usage is > > > > at 22890M. The latter variant needs 24x longer to compile. > > > > > > > > Are long flat template argument/parameter lists a special problem? Why > > > > does it > > > > make overload resolution *so much more* expensive? > > > > > > > > Beyond that torture test (should I turn it into a PR?), what can I do > > > > to help? > > > > > > Analyze where the compile time is spent and where memory is spent. > > > Identify unfitting data structures and algorithms causing the issue. > > > Replace with better ones. That’s what I do for these kind of issues in > > > the middle end. > > > > So seeing > > > > overload resolution : 42.89 ( 67%) 1.41 ( 44%) > > 44.31 ( 66%) 18278M ( 80%) > > template instantiation : 47.25 ( 73%) 1.66 ( 51%) > > 48.95 ( 72%) 22326M ( 97%) > > > > it seems obvious that you are using an excessive number of template > > instantiations and > > compilers are not prepared to make those "lean". perf shows (GCC 14.2 > > release build) > > > > Samples: 261K of event 'cycles:Pu', Event count (approx.): > > 315948118358 > > Overhead Samples Command Shared Object > > Symbol > > 26.96% 69216 cc1plus cc1plus > > [.] iterative_hash > > 7.66% 19389 cc1plus cc1plus > > [.] _Z12ggc_set_markPKv > > 5.34% 13719 cc1plus cc1plus > > [.] _Z27iterative_hash_template_argP9tree_nodej > > 5.11% 13205 cc1plus cc1plus > > [.] _Z24variably_modified_type_pP9tree_nodeS0_ > > 4.60% 11901 cc1plus cc1plus > > [.] _Z13cp_type_qualsPK9tree_node > > 4.14% 10733 cc1plus cc1plus > > [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib > > > > where the excessive use of iterative_hash_object makes it slower than > > necessary. I can only guess but > > replacing > > > > val = iterative_hash_object (code, val); > > > > with using iterative_hash_hashval_t or iterative_hash_host_wide_int > > might help a lot. Likewise: > > > > case IDENTIFIER_NODE: > > return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val); > > > > with iterative_hash_hashval_t. Using inchash for the whole API might > > help as well. > > Fixing the above results in the following, I'll test & submit a patch. > > Samples: 283K of event 'cycles:Pu', Event count (approx.): > 318742588396 > Overhead Samples Command Shared Object Symbol > 13.92% 39577 cc1plus cc1plus [.] > _Z27iterative_hash_template_argP9tree_nodej > 10.73% 29883 cc1plus cc1plus [.] _Z12ggc_set_markPKv > 10.11% 28811 cc1plus cc1plus [.] iterative_hash > 5.33% 15254 cc1plus cc1plus [.] > _Z13cp_type_qualsPK9tree_node > > -fmem-report shows > > cp/pt.cc:4392 (expand_template_argument_pack) 0 : 0.0% > 10101M: 43.0% 0 : 0.0% 2122M: 35.5% 90k > > there's both a 20% overhead due to GC allocation granularity and the > issue we do not collect those > vectors. There are at least some places (uses of the function) that > look like they could use > ggc_free on the vector as it doesn't escape (like > placeholder_type_constraint_dependent_p) > but the API would need to indicate whether it performed any expansion. > > The other big offenders are > > cp/pt.cc:9001 (coerce_template_parameter_pack) 2161M: 41.1% > 7940M: 33.8% 0 : 0.0% 2122M: 35.5% 90k > cp/pt.cc:24388 (unify_pack_expansion) 2551M: 48.5% > 2800M: 11.9% 0 : 0.0% 1130M: 18.9% 58k > > also make_tree_vec cases related to pack expansion. I'm not sure we > need to put those vectors into GC memory > throughout all this, using heap vectors with appropriate lifetime > (smart pointers?) would possibly be a better > solution. I'll leave that to frontend folks but you can also try > looking. For example the 2nd above is > > TREE_TYPE (packs) = make_tree_vec (len - start); > > maybe there is already a vec in TREE_TYPE we could ggc_free. Maybe > it's possible to "share" packs?
It looks like for example the "outermost" template pack expansion unify_bound_ttp_args does might be done for each parm/arg pair we match? The expansion vector also looks dead there. diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc index 04f0a1d5fff..0ad15675c92 100644 --- a/gcc/cp/pt.cc +++ b/gcc/cp/pt.cc @@ -9442,6 +9460,9 @@ coerce_template_parms (tree parms, SET_NON_DEFAULT_TEMPLATE_ARGS_COUNT (new_inner_args, TREE_VEC_LENGTH (new_inner_args)); + if ((return_full_args ? new_args != inner_args : new_inner_args != inner_args) + && inner_args != orig_inner_args) + ggc_free (inner_args); return return_full_args ? new_args : new_inner_args; } makes a 800MB difference in peak memory use (but otherwise untested). I've tried plugging the holes in a similar way elsewhere but that didn't make a difference for peak memory use. Even callgrind isn't of much help discovering the ultimate callers of when expand_template_argument_pack allocates memory. user-level tracing tools might help but I'm not familiar with those. Richard. > Richard. > > > This won't improve memory use of course, making "leaner" template > > instantiations likely would help > > (maybe somehow allow on-demand copying of sub-structures?). > > > > Richard. > > > > > > > > > > > > Richard > > > > > > > Thanks, > > > > Matthias > > > > > > > > -- > > > > ────────────────────────────────────────────────────────────────────────── > > > > Dr. Matthias Kretz https://mattkretz.github.io > > > > GSI Helmholtz Center for Heavy Ion Research https://gsi.de > > > > std::simd > > > > ────────────────────────────────────────────────────────────────────────── > > > > <signature.asc>