Hi,

the <experimental/simd> unit tests are my long-standing pain point of 
excessive compiler memory usage and compile times. I've always worked around 
the memory usage problem by splitting the test matrix into multiple 
translations (with different -D flags) of the same source file. I.e. pay with 
a huge number of compiler invocations to be able to compile at all. OOM kills 
/ thrashing isn't fun.

Recently, the GNU Radio 4 implementation hit a similar issue of excessive 
compiler memory usage and compile times. Worst case example I have tested (a 
single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while compiling)):

GCC 15: 13m03s, 30.413 GB (checking enabled)
GCC 14: 12m03s, 15.248 GB
GCC 13: 11m40s, 14.862 GB
Clang 18: 8m10s, 10.811 GB

That's supposed to be a unit test. But it's nothing one can use for test-
driven development, obviously. But how do mere mortals optimize code for 
better compile times? -ftime-report is interesting but not really helpful. -Q 
has interesting information, but the output format is unusable for C++ and 
it's really hard to post-process.

When compiler memory usage goes through the roof it's fairly obvious that 
compile times have to suffer. So I was wondering whether there are any low-
hanging fruit to pick. I've managed to come up with a small torture test that 
shows interesting behavior. I put it at 
https://github.com/mattkretz/template-torture-test. Simply do

git clone https://github.com/mattkretz/template-torture-test
cd template-torture-test
make STRESS=7
make TORTURE=1 STRESS=5

These numbers can already "kill" smaller machines. Be prepared to kill cc1plus 
before things get out of hand.

The bit I find interesting in this test is switched with the -D GO_FAST macro 
(the 'all' target always compiles with and without GO_FAST). With the macro, 
template arguments to 'Operand<typename...>' are tree-like and the resulting 
type name is *longer*. But GGC usage is only at 442M. Without GO_FAST, 
template arguments to 'Operand<typename...>' are a flat list. But GGC usage is 
at 22890M. The latter variant needs 24x longer to compile.

Are long flat template argument/parameter lists a special problem? Why does it 
make overload resolution *so much more* expensive?

Beyond that torture test (should I turn it into a PR?), what can I do to help?
 
Thanks,
  Matthias

-- 
──────────────────────────────────────────────────────────────────────────
 Dr. Matthias Kretz                           https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
 std::simd
──────────────────────────────────────────────────────────────────────────

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to