On Tue, Oct 1, 2024 at 6:06 PM Richard Biener
<[email protected]> wrote:
>
>
>
> > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <[email protected]>:
> >
> > Hi,
> >
> > the <experimental/simd> unit tests are my long-standing pain point of
> > excessive compiler memory usage and compile times. I've always worked around
> > the memory usage problem by splitting the test matrix into multiple
> > translations (with different -D flags) of the same source file. I.e. pay
> > with
> > a huge number of compiler invocations to be able to compile at all. OOM
> > kills
> > / thrashing isn't fun.
> >
> > Recently, the GNU Radio 4 implementation hit a similar issue of excessive
> > compiler memory usage and compile times. Worst case example I have tested (a
> > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while compiling)):
> >
> > GCC 15: 13m03s, 30.413 GB (checking enabled)
> > GCC 14: 12m03s, 15.248 GB
> > GCC 13: 11m40s, 14.862 GB
> > Clang 18: 8m10s, 10.811 GB
> >
> > That's supposed to be a unit test. But it's nothing one can use for test-
> > driven development, obviously. But how do mere mortals optimize code for
> > better compile times? -ftime-report is interesting but not really helpful.
> > -Q
> > has interesting information, but the output format is unusable for C++ and
> > it's really hard to post-process.
> >
> > When compiler memory usage goes through the roof it's fairly obvious that
> > compile times have to suffer. So I was wondering whether there are any low-
> > hanging fruit to pick. I've managed to come up with a small torture test
> > that
> > shows interesting behavior. I put it at
> > https://github.com/mattkretz/template-torture-test. Simply do
> >
> > git clone https://github.com/mattkretz/template-torture-test
> > cd template-torture-test
> > make STRESS=7
> > make TORTURE=1 STRESS=5
> >
> > These numbers can already "kill" smaller machines. Be prepared to kill
> > cc1plus
> > before things get out of hand.
> >
> > The bit I find interesting in this test is switched with the -D GO_FAST
> > macro
> > (the 'all' target always compiles with and without GO_FAST). With the macro,
> > template arguments to 'Operand<typename...>' are tree-like and the resulting
> > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST,
> > template arguments to 'Operand<typename...>' are a flat list. But GGC usage
> > is
> > at 22890M. The latter variant needs 24x longer to compile.
> >
> > Are long flat template argument/parameter lists a special problem? Why does
> > it
> > make overload resolution *so much more* expensive?
> >
> > Beyond that torture test (should I turn it into a PR?), what can I do to
> > help?
>
> Analyze where the compile time is spent and where memory is spent. Identify
> unfitting data structures and algorithms causing the issue. Replace with
> better ones. That’s what I do for these kind of issues in the middle end.
So seeing
overload resolution : 42.89 ( 67%) 1.41 ( 44%)
44.31 ( 66%) 18278M ( 80%)
template instantiation : 47.25 ( 73%) 1.66 ( 51%)
48.95 ( 72%) 22326M ( 97%)
it seems obvious that you are using an excessive number of template
instantiations and
compilers are not prepared to make those "lean". perf shows (GCC 14.2
release build)
Samples: 261K of event 'cycles:Pu', Event count (approx.):
315948118358
Overhead Samples Command Shared Object
Symbol
26.96% 69216 cc1plus cc1plus
[.] iterative_hash
7.66% 19389 cc1plus cc1plus
[.] _Z12ggc_set_markPKv
5.34% 13719 cc1plus cc1plus
[.] _Z27iterative_hash_template_argP9tree_nodej
5.11% 13205 cc1plus cc1plus
[.] _Z24variably_modified_type_pP9tree_nodeS0_
4.60% 11901 cc1plus cc1plus
[.] _Z13cp_type_qualsPK9tree_node
4.14% 10733 cc1plus cc1plus
[.] _ZL5unifyP9tree_nodeS0_S0_S0_ib
where the excessive use of iterative_hash_object makes it slower than
necessary. I can only guess but
replacing
val = iterative_hash_object (code, val);
with using iterative_hash_hashval_t or iterative_hash_host_wide_int
might help a lot. Likewise:
case IDENTIFIER_NODE:
return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val);
with iterative_hash_hashval_t. Using inchash for the whole API might
help as well.
This won't improve memory use of course, making "leaner" template
instantiations likely would help
(maybe somehow allow on-demand copying of sub-structures?).
Richard.
>
> Richard
>
> > Thanks,
> > Matthias
> >
> > --
> > ──────────────────────────────────────────────────────────────────────────
> > Dr. Matthias Kretz https://mattkretz.github.io
> > GSI Helmholtz Center for Heavy Ion Research https://gsi.de
> > std::simd
> > ──────────────────────────────────────────────────────────────────────────
> > <signature.asc>