On Wed, Oct 2, 2024 at 9:13 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Oct 1, 2024 at 6:06 PM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > > > > > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>: > > > > > > Hi, > > > > > > the <experimental/simd> unit tests are my long-standing pain point of > > > excessive compiler memory usage and compile times. I've always worked > > > around > > > the memory usage problem by splitting the test matrix into multiple > > > translations (with different -D flags) of the same source file. I.e. pay > > > with > > > a huge number of compiler invocations to be able to compile at all. OOM > > > kills > > > / thrashing isn't fun. > > > > > > Recently, the GNU Radio 4 implementation hit a similar issue of excessive > > > compiler memory usage and compile times. Worst case example I have tested > > > (a > > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while compiling)): > > > > > > GCC 15: 13m03s, 30.413 GB (checking enabled) > > > GCC 14: 12m03s, 15.248 GB > > > GCC 13: 11m40s, 14.862 GB > > > Clang 18: 8m10s, 10.811 GB > > > > > > That's supposed to be a unit test. But it's nothing one can use for test- > > > driven development, obviously. But how do mere mortals optimize code for > > > better compile times? -ftime-report is interesting but not really > > > helpful. -Q > > > has interesting information, but the output format is unusable for C++ and > > > it's really hard to post-process. > > > > > > When compiler memory usage goes through the roof it's fairly obvious that > > > compile times have to suffer. So I was wondering whether there are any > > > low- > > > hanging fruit to pick. I've managed to come up with a small torture test > > > that > > > shows interesting behavior. I put it at > > > https://github.com/mattkretz/template-torture-test. Simply do > > > > > > git clone https://github.com/mattkretz/template-torture-test > > > cd template-torture-test > > > make STRESS=7 > > > make TORTURE=1 STRESS=5 > > > > > > These numbers can already "kill" smaller machines. Be prepared to kill > > > cc1plus > > > before things get out of hand. > > > > > > The bit I find interesting in this test is switched with the -D GO_FAST > > > macro > > > (the 'all' target always compiles with and without GO_FAST). With the > > > macro, > > > template arguments to 'Operand<typename...>' are tree-like and the > > > resulting > > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST, > > > template arguments to 'Operand<typename...>' are a flat list. But GGC > > > usage is > > > at 22890M. The latter variant needs 24x longer to compile. > > > > > > Are long flat template argument/parameter lists a special problem? Why > > > does it > > > make overload resolution *so much more* expensive? > > > > > > Beyond that torture test (should I turn it into a PR?), what can I do to > > > help? > > > > Analyze where the compile time is spent and where memory is spent. > > Identify unfitting data structures and algorithms causing the issue. > > Replace with better ones. That’s what I do for these kind of issues in the > > middle end. > > So seeing > > overload resolution : 42.89 ( 67%) 1.41 ( 44%) > 44.31 ( 66%) 18278M ( 80%) > template instantiation : 47.25 ( 73%) 1.66 ( 51%) > 48.95 ( 72%) 22326M ( 97%) > > it seems obvious that you are using an excessive number of template > instantiations and > compilers are not prepared to make those "lean". perf shows (GCC 14.2 > release build) > > Samples: 261K of event 'cycles:Pu', Event count (approx.): > 315948118358 > Overhead Samples Command Shared Object > Symbol > 26.96% 69216 cc1plus cc1plus > [.] iterative_hash > 7.66% 19389 cc1plus cc1plus > [.] _Z12ggc_set_markPKv > 5.34% 13719 cc1plus cc1plus > [.] _Z27iterative_hash_template_argP9tree_nodej > 5.11% 13205 cc1plus cc1plus > [.] _Z24variably_modified_type_pP9tree_nodeS0_ > 4.60% 11901 cc1plus cc1plus > [.] _Z13cp_type_qualsPK9tree_node > 4.14% 10733 cc1plus cc1plus > [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib > > where the excessive use of iterative_hash_object makes it slower than > necessary. I can only guess but > replacing > > val = iterative_hash_object (code, val); > > with using iterative_hash_hashval_t or iterative_hash_host_wide_int > might help a lot. Likewise: > > case IDENTIFIER_NODE: > return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val); > > with iterative_hash_hashval_t. Using inchash for the whole API might > help as well.
Fixing the above results in the following, I'll test & submit a patch. Samples: 283K of event 'cycles:Pu', Event count (approx.): 318742588396 Overhead Samples Command Shared Object Symbol 13.92% 39577 cc1plus cc1plus [.] _Z27iterative_hash_template_argP9tree_nodej 10.73% 29883 cc1plus cc1plus [.] _Z12ggc_set_markPKv 10.11% 28811 cc1plus cc1plus [.] iterative_hash 5.33% 15254 cc1plus cc1plus [.] _Z13cp_type_qualsPK9tree_node -fmem-report shows cp/pt.cc:4392 (expand_template_argument_pack) 0 : 0.0% 10101M: 43.0% 0 : 0.0% 2122M: 35.5% 90k there's both a 20% overhead due to GC allocation granularity and the issue we do not collect those vectors. There are at least some places (uses of the function) that look like they could use ggc_free on the vector as it doesn't escape (like placeholder_type_constraint_dependent_p) but the API would need to indicate whether it performed any expansion. The other big offenders are cp/pt.cc:9001 (coerce_template_parameter_pack) 2161M: 41.1% 7940M: 33.8% 0 : 0.0% 2122M: 35.5% 90k cp/pt.cc:24388 (unify_pack_expansion) 2551M: 48.5% 2800M: 11.9% 0 : 0.0% 1130M: 18.9% 58k also make_tree_vec cases related to pack expansion. I'm not sure we need to put those vectors into GC memory throughout all this, using heap vectors with appropriate lifetime (smart pointers?) would possibly be a better solution. I'll leave that to frontend folks but you can also try looking. For example the 2nd above is TREE_TYPE (packs) = make_tree_vec (len - start); maybe there is already a vec in TREE_TYPE we could ggc_free. Maybe it's possible to "share" packs? Richard. > This won't improve memory use of course, making "leaner" template > instantiations likely would help > (maybe somehow allow on-demand copying of sub-structures?). > > Richard. > > > > > > > Richard > > > > > Thanks, > > > Matthias > > > > > > -- > > > ────────────────────────────────────────────────────────────────────────── > > > Dr. Matthias Kretz https://mattkretz.github.io > > > GSI Helmholtz Center for Heavy Ion Research https://gsi.de > > > std::simd > > > ────────────────────────────────────────────────────────────────────────── > > > <signature.asc>