Re: How to debug/improve excessive compiler memory usage and compile times

Richard Biener via Gcc Wed, 02 Oct 2024 01:33:05 -0700

On Wed, Oct 2, 2024 at 9:54 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Wed, Oct 2, 2024 at 9:13 AM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Tue, Oct 1, 2024 at 6:06 PM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > >
> > >
> > > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>:
> > > >
> > > > Hi,
> > > >
> > > > the <experimental/simd> unit tests are my long-standing pain point of
> > > > excessive compiler memory usage and compile times. I've always worked 
> > > > around
> > > > the memory usage problem by splitting the test matrix into multiple
> > > > translations (with different -D flags) of the same source file. I.e. 
> > > > pay with
> > > > a huge number of compiler invocations to be able to compile at all. OOM 
> > > > kills
> > > > / thrashing isn't fun.
> > > >
> > > > Recently, the GNU Radio 4 implementation hit a similar issue of 
> > > > excessive
> > > > compiler memory usage and compile times. Worst case example I have 
> > > > tested (a
> > > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while 
> > > > compiling)):
> > > >
> > > > GCC 15: 13m03s, 30.413 GB (checking enabled)
> > > > GCC 14: 12m03s, 15.248 GB
> > > > GCC 13: 11m40s, 14.862 GB
> > > > Clang 18: 8m10s, 10.811 GB
> > > >
> > > > That's supposed to be a unit test. But it's nothing one can use for 
> > > > test-
> > > > driven development, obviously. But how do mere mortals optimize code for
> > > > better compile times? -ftime-report is interesting but not really 
> > > > helpful. -Q
> > > > has interesting information, but the output format is unusable for C++ 
> > > > and
> > > > it's really hard to post-process.
> > > >
> > > > When compiler memory usage goes through the roof it's fairly obvious 
> > > > that
> > > > compile times have to suffer. So I was wondering whether there are any 
> > > > low-
> > > > hanging fruit to pick. I've managed to come up with a small torture 
> > > > test that
> > > > shows interesting behavior. I put it at 
> > > > https://github.com/mattkretz/template-torture-test. Simply do
> > > >
> > > > git clone https://github.com/mattkretz/template-torture-test
> > > > cd template-torture-test
> > > > make STRESS=7
> > > > make TORTURE=1 STRESS=5
> > > >
> > > > These numbers can already "kill" smaller machines. Be prepared to kill 
> > > > cc1plus
> > > > before things get out of hand.
> > > >
> > > > The bit I find interesting in this test is switched with the -D GO_FAST 
> > > > macro
> > > > (the 'all' target always compiles with and without GO_FAST). With the 
> > > > macro,
> > > > template arguments to 'Operand<typename...>' are tree-like and the 
> > > > resulting
> > > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST,
> > > > template arguments to 'Operand<typename...>' are a flat list. But GGC 
> > > > usage is
> > > > at 22890M. The latter variant needs 24x longer to compile.
> > > >
> > > > Are long flat template argument/parameter lists a special problem? Why 
> > > > does it
> > > > make overload resolution *so much more* expensive?
> > > >
> > > > Beyond that torture test (should I turn it into a PR?), what can I do 
> > > > to help?
> > >
> > > Analyze where the compile time is spent and where memory is spent.  
> > > Identify unfitting data structures and algorithms causing the issue.  
> > > Replace with better ones.  That’s what I do for these kind of issues in 
> > > the middle end.
> >
> > So seeing
> >
> >  overload resolution               :  42.89 ( 67%)   1.41 ( 44%)
> > 44.31 ( 66%) 18278M ( 80%)
> >  template instantiation             :  47.25 ( 73%)   1.66 ( 51%)
> > 48.95 ( 72%) 22326M ( 97%)
> >
> > it seems obvious that you are using an excessive number of template
> > instantiations and
> > compilers are not prepared to make those "lean".  perf shows (GCC 14.2
> > release build)
> >
> > Samples: 261K of event 'cycles:Pu', Event count (approx.):
> > 315948118358
> > Overhead       Samples  Command  Shared Object
> >  Symbol
> >   26.96%         69216  cc1plus  cc1plus
> >  [.] iterative_hash
> >    7.66%         19389  cc1plus  cc1plus
> >  [.] _Z12ggc_set_markPKv
> >    5.34%         13719  cc1plus  cc1plus
> >  [.] _Z27iterative_hash_template_argP9tree_nodej
> >    5.11%         13205  cc1plus  cc1plus
> >  [.] _Z24variably_modified_type_pP9tree_nodeS0_
> >    4.60%         11901  cc1plus  cc1plus
> >  [.] _Z13cp_type_qualsPK9tree_node
> >    4.14%         10733  cc1plus  cc1plus
> >  [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib
> >
> > where the excessive use of iterative_hash_object makes it slower than
> > necessary.  I can only guess but
> > replacing
> >
> >   val = iterative_hash_object (code, val);
> >
> > with using iterative_hash_hashval_t or iterative_hash_host_wide_int
> > might help a lot.  Likewise:
> >
> >     case IDENTIFIER_NODE:
> >       return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val);
> >
> > with iterative_hash_hashval_t.  Using inchash for the whole API might
> > help as well.
>
> Fixing the above results in the following, I'll test & submit a patch.
>
> Samples: 283K of event 'cycles:Pu', Event count (approx.):
> 318742588396
> Overhead       Samples  Command  Shared Object         Symbol
>   13.92%         39577  cc1plus  cc1plus               [.]
> _Z27iterative_hash_template_argP9tree_nodej
>   10.73%         29883  cc1plus  cc1plus               [.] _Z12ggc_set_markPKv
>   10.11%         28811  cc1plus  cc1plus               [.] iterative_hash
>    5.33%         15254  cc1plus  cc1plus               [.]
> _Z13cp_type_qualsPK9tree_node
>
> -fmem-report shows
>
> cp/pt.cc:4392 (expand_template_argument_pack)            0 :  0.0%
> 10101M: 43.0%        0 :  0.0%     2122M: 35.5%       90k
>
> there's both a 20% overhead due to GC allocation granularity and the
> issue we do not collect those
> vectors.  There are at least some places (uses of the function) that
> look like they could use
> ggc_free on the vector as it doesn't escape (like
> placeholder_type_constraint_dependent_p)
> but the API would need to indicate whether it performed any expansion.
>
> The other big offenders are
>
> cp/pt.cc:9001 (coerce_template_parameter_pack)        2161M: 41.1%
> 7940M: 33.8%        0 :  0.0%     2122M: 35.5%       90k
> cp/pt.cc:24388 (unify_pack_expansion)                 2551M: 48.5%
> 2800M: 11.9%        0 :  0.0%     1130M: 18.9%       58k
>
> also make_tree_vec cases related to pack expansion.  I'm not sure we
> need to put those vectors into GC memory
> throughout all this, using heap vectors with appropriate lifetime
> (smart pointers?) would possibly be a better
> solution.  I'll leave that to frontend folks but you can also try
> looking.  For example the 2nd above is
>
>       TREE_TYPE (packs) = make_tree_vec (len - start);
>
> maybe there is already a vec in TREE_TYPE we could ggc_free.  Maybe
> it's possible to "share" packs?


It looks like for example the "outermost" template pack expansion
unify_bound_ttp_args does might be done for each parm/arg pair we
match?  The expansion vector also looks dead there.

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index 04f0a1d5fff..0ad15675c92 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -9442,6 +9460,9 @@ coerce_template_parms (tree parms,
     SET_NON_DEFAULT_TEMPLATE_ARGS_COUNT (new_inner_args,
                                         TREE_VEC_LENGTH (new_inner_args));

+  if ((return_full_args ? new_args != inner_args : new_inner_args !=
inner_args)
+      && inner_args != orig_inner_args)
+    ggc_free (inner_args);
   return return_full_args ? new_args : new_inner_args;
 }

makes a 800MB difference in peak memory use (but otherwise untested).  I've
tried plugging the holes in a similar way elsewhere but that didn't make a
difference for peak memory use.  Even callgrind isn't of much help discovering
the ultimate callers of when expand_template_argument_pack allocates memory.
user-level tracing tools might help but I'm not familiar with those.

Richard.

> Richard.
>
> > This won't improve memory use of course, making "leaner" template
> > instantiations likely would help
> > (maybe somehow allow on-demand copying of sub-structures?).
> >
> > Richard.
> >
> >
> >
> > >
> > > Richard
> > >
> > > > Thanks,
> > > >  Matthias
> > > >
> > > > --
> > > > ──────────────────────────────────────────────────────────────────────────
> > > > Dr. Matthias Kretz                           https://mattkretz.github.io
> > > > GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
> > > > std::simd
> > > > ──────────────────────────────────────────────────────────────────────────
> > > > <signature.asc>

Re: How to debug/improve excessive compiler memory usage and compile times

Reply via email to