On Tue, Oct 1, 2024 at 6:06 PM Richard Biener
<richard.guent...@gmail.com> wrote:
>
>
>
> > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>:
> >
> > Hi,
> >
> > the <experimental/simd> unit tests are my long-standing pain point of
> > excessive compiler memory usage and compile times. I've always worked around
> > the memory usage problem by splitting the test matrix into multiple
> > translations (with different -D flags) of the same source file. I.e. pay 
> > with
> > a huge number of compiler invocations to be able to compile at all. OOM 
> > kills
> > / thrashing isn't fun.
> >
> > Recently, the GNU Radio 4 implementation hit a similar issue of excessive
> > compiler memory usage and compile times. Worst case example I have tested (a
> > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while compiling)):
> >
> > GCC 15: 13m03s, 30.413 GB (checking enabled)
> > GCC 14: 12m03s, 15.248 GB
> > GCC 13: 11m40s, 14.862 GB
> > Clang 18: 8m10s, 10.811 GB
> >
> > That's supposed to be a unit test. But it's nothing one can use for test-
> > driven development, obviously. But how do mere mortals optimize code for
> > better compile times? -ftime-report is interesting but not really helpful. 
> > -Q
> > has interesting information, but the output format is unusable for C++ and
> > it's really hard to post-process.
> >
> > When compiler memory usage goes through the roof it's fairly obvious that
> > compile times have to suffer. So I was wondering whether there are any low-
> > hanging fruit to pick. I've managed to come up with a small torture test 
> > that
> > shows interesting behavior. I put it at 
> > https://github.com/mattkretz/template-torture-test. Simply do
> >
> > git clone https://github.com/mattkretz/template-torture-test
> > cd template-torture-test
> > make STRESS=7
> > make TORTURE=1 STRESS=5
> >
> > These numbers can already "kill" smaller machines. Be prepared to kill 
> > cc1plus
> > before things get out of hand.
> >
> > The bit I find interesting in this test is switched with the -D GO_FAST 
> > macro
> > (the 'all' target always compiles with and without GO_FAST). With the macro,
> > template arguments to 'Operand<typename...>' are tree-like and the resulting
> > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST,
> > template arguments to 'Operand<typename...>' are a flat list. But GGC usage 
> > is
> > at 22890M. The latter variant needs 24x longer to compile.
> >
> > Are long flat template argument/parameter lists a special problem? Why does 
> > it
> > make overload resolution *so much more* expensive?
> >
> > Beyond that torture test (should I turn it into a PR?), what can I do to 
> > help?
>
> Analyze where the compile time is spent and where memory is spent.  Identify 
> unfitting data structures and algorithms causing the issue.  Replace with 
> better ones.  That’s what I do for these kind of issues in the middle end.

So seeing

 overload resolution               :  42.89 ( 67%)   1.41 ( 44%)
44.31 ( 66%) 18278M ( 80%)
 template instantiation             :  47.25 ( 73%)   1.66 ( 51%)
48.95 ( 72%) 22326M ( 97%)

it seems obvious that you are using an excessive number of template
instantiations and
compilers are not prepared to make those "lean".  perf shows (GCC 14.2
release build)

Samples: 261K of event 'cycles:Pu', Event count (approx.):
315948118358
Overhead       Samples  Command  Shared Object
 Symbol
  26.96%         69216  cc1plus  cc1plus
 [.] iterative_hash
   7.66%         19389  cc1plus  cc1plus
 [.] _Z12ggc_set_markPKv
   5.34%         13719  cc1plus  cc1plus
 [.] _Z27iterative_hash_template_argP9tree_nodej
   5.11%         13205  cc1plus  cc1plus
 [.] _Z24variably_modified_type_pP9tree_nodeS0_
   4.60%         11901  cc1plus  cc1plus
 [.] _Z13cp_type_qualsPK9tree_node
   4.14%         10733  cc1plus  cc1plus
 [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib

where the excessive use of iterative_hash_object makes it slower than
necessary.  I can only guess but
replacing

  val = iterative_hash_object (code, val);

with using iterative_hash_hashval_t or iterative_hash_host_wide_int
might help a lot.  Likewise:

    case IDENTIFIER_NODE:
      return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val);

with iterative_hash_hashval_t.  Using inchash for the whole API might
help as well.

This won't improve memory use of course, making "leaner" template
instantiations likely would help
(maybe somehow allow on-demand copying of sub-structures?).

Richard.



>
> Richard
>
> > Thanks,
> >  Matthias
> >
> > --
> > ──────────────────────────────────────────────────────────────────────────
> > Dr. Matthias Kretz                           https://mattkretz.github.io
> > GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
> > std::simd
> > ──────────────────────────────────────────────────────────────────────────
> > <signature.asc>

Reply via email to