On Wed, Oct 2, 2024 at 9:13 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, Oct 1, 2024 at 6:06 PM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> >
> >
> > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>:
> > >
> > > Hi,
> > >
> > > the <experimental/simd> unit tests are my long-standing pain point of
> > > excessive compiler memory usage and compile times. I've always worked 
> > > around
> > > the memory usage problem by splitting the test matrix into multiple
> > > translations (with different -D flags) of the same source file. I.e. pay 
> > > with
> > > a huge number of compiler invocations to be able to compile at all. OOM 
> > > kills
> > > / thrashing isn't fun.
> > >
> > > Recently, the GNU Radio 4 implementation hit a similar issue of excessive
> > > compiler memory usage and compile times. Worst case example I have tested 
> > > (a
> > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while compiling)):
> > >
> > > GCC 15: 13m03s, 30.413 GB (checking enabled)
> > > GCC 14: 12m03s, 15.248 GB
> > > GCC 13: 11m40s, 14.862 GB
> > > Clang 18: 8m10s, 10.811 GB
> > >
> > > That's supposed to be a unit test. But it's nothing one can use for test-
> > > driven development, obviously. But how do mere mortals optimize code for
> > > better compile times? -ftime-report is interesting but not really 
> > > helpful. -Q
> > > has interesting information, but the output format is unusable for C++ and
> > > it's really hard to post-process.
> > >
> > > When compiler memory usage goes through the roof it's fairly obvious that
> > > compile times have to suffer. So I was wondering whether there are any 
> > > low-
> > > hanging fruit to pick. I've managed to come up with a small torture test 
> > > that
> > > shows interesting behavior. I put it at 
> > > https://github.com/mattkretz/template-torture-test. Simply do
> > >
> > > git clone https://github.com/mattkretz/template-torture-test
> > > cd template-torture-test
> > > make STRESS=7
> > > make TORTURE=1 STRESS=5
> > >
> > > These numbers can already "kill" smaller machines. Be prepared to kill 
> > > cc1plus
> > > before things get out of hand.
> > >
> > > The bit I find interesting in this test is switched with the -D GO_FAST 
> > > macro
> > > (the 'all' target always compiles with and without GO_FAST). With the 
> > > macro,
> > > template arguments to 'Operand<typename...>' are tree-like and the 
> > > resulting
> > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST,
> > > template arguments to 'Operand<typename...>' are a flat list. But GGC 
> > > usage is
> > > at 22890M. The latter variant needs 24x longer to compile.
> > >
> > > Are long flat template argument/parameter lists a special problem? Why 
> > > does it
> > > make overload resolution *so much more* expensive?
> > >
> > > Beyond that torture test (should I turn it into a PR?), what can I do to 
> > > help?
> >
> > Analyze where the compile time is spent and where memory is spent.  
> > Identify unfitting data structures and algorithms causing the issue.  
> > Replace with better ones.  That’s what I do for these kind of issues in the 
> > middle end.
>
> So seeing
>
>  overload resolution               :  42.89 ( 67%)   1.41 ( 44%)
> 44.31 ( 66%) 18278M ( 80%)
>  template instantiation             :  47.25 ( 73%)   1.66 ( 51%)
> 48.95 ( 72%) 22326M ( 97%)
>
> it seems obvious that you are using an excessive number of template
> instantiations and
> compilers are not prepared to make those "lean".  perf shows (GCC 14.2
> release build)
>
> Samples: 261K of event 'cycles:Pu', Event count (approx.):
> 315948118358
> Overhead       Samples  Command  Shared Object
>  Symbol
>   26.96%         69216  cc1plus  cc1plus
>  [.] iterative_hash
>    7.66%         19389  cc1plus  cc1plus
>  [.] _Z12ggc_set_markPKv
>    5.34%         13719  cc1plus  cc1plus
>  [.] _Z27iterative_hash_template_argP9tree_nodej
>    5.11%         13205  cc1plus  cc1plus
>  [.] _Z24variably_modified_type_pP9tree_nodeS0_
>    4.60%         11901  cc1plus  cc1plus
>  [.] _Z13cp_type_qualsPK9tree_node
>    4.14%         10733  cc1plus  cc1plus
>  [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib
>
> where the excessive use of iterative_hash_object makes it slower than
> necessary.  I can only guess but
> replacing
>
>   val = iterative_hash_object (code, val);
>
> with using iterative_hash_hashval_t or iterative_hash_host_wide_int
> might help a lot.  Likewise:
>
>     case IDENTIFIER_NODE:
>       return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val);
>
> with iterative_hash_hashval_t.  Using inchash for the whole API might
> help as well.

Fixing the above results in the following, I'll test & submit a patch.

Samples: 283K of event 'cycles:Pu', Event count (approx.):
318742588396
Overhead       Samples  Command  Shared Object         Symbol
  13.92%         39577  cc1plus  cc1plus               [.]
_Z27iterative_hash_template_argP9tree_nodej
  10.73%         29883  cc1plus  cc1plus               [.] _Z12ggc_set_markPKv
  10.11%         28811  cc1plus  cc1plus               [.] iterative_hash
   5.33%         15254  cc1plus  cc1plus               [.]
_Z13cp_type_qualsPK9tree_node

-fmem-report shows

cp/pt.cc:4392 (expand_template_argument_pack)            0 :  0.0%
10101M: 43.0%        0 :  0.0%     2122M: 35.5%       90k

there's both a 20% overhead due to GC allocation granularity and the
issue we do not collect those
vectors.  There are at least some places (uses of the function) that
look like they could use
ggc_free on the vector as it doesn't escape (like
placeholder_type_constraint_dependent_p)
but the API would need to indicate whether it performed any expansion.

The other big offenders are

cp/pt.cc:9001 (coerce_template_parameter_pack)        2161M: 41.1%
7940M: 33.8%        0 :  0.0%     2122M: 35.5%       90k
cp/pt.cc:24388 (unify_pack_expansion)                 2551M: 48.5%
2800M: 11.9%        0 :  0.0%     1130M: 18.9%       58k

also make_tree_vec cases related to pack expansion.  I'm not sure we
need to put those vectors into GC memory
throughout all this, using heap vectors with appropriate lifetime
(smart pointers?) would possibly be a better
solution.  I'll leave that to frontend folks but you can also try
looking.  For example the 2nd above is

      TREE_TYPE (packs) = make_tree_vec (len - start);

maybe there is already a vec in TREE_TYPE we could ggc_free.  Maybe
it's possible to "share" packs?

Richard.

> This won't improve memory use of course, making "leaner" template
> instantiations likely would help
> (maybe somehow allow on-demand copying of sub-structures?).
>
> Richard.
>
>
>
> >
> > Richard
> >
> > > Thanks,
> > >  Matthias
> > >
> > > --
> > > ──────────────────────────────────────────────────────────────────────────
> > > Dr. Matthias Kretz                           https://mattkretz.github.io
> > > GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
> > > std::simd
> > > ──────────────────────────────────────────────────────────────────────────
> > > <signature.asc>

Reply via email to