https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264
--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 22 May 2020, freddie at witherden dot org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264 > > --- Comment #6 from Freddie Witherden <freddie at witherden dot org> --- > (In reply to Richard Biener from comment #3) > > So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to > > compile and about 3GB of memory, -O2 needs around 2 minutes (same memory), > > -O3 > > is the same as -O2. > > > > Maybe instead of [[gnu::flatten]] you want to bump --param > > inline-unit-growth > > or --param large-function-growth more moderately in case you can measure an > > effect on runtime. > > > > Note multiple [[gnu::flatten]] can really exponentially grow program size > > since it is not appearant which functions might be used from another > > translation unit until you can use -fwhole-program (single CU program) > > or -flto (but there [[gnu::flatten]] is applied to early to avoid such > > growth - sth we might want to fix). Placing things not used from outside > > in anonymous namespaces might help. > > The [[gnu::flatten]] was added to get GCC's performance in the case of T = > double on a par with Clang's. (We don't care about performance with T = > bfloat > as it is just used as a final polishing pass.) I can understand why GCC does > not want to inline it in the case of T = bfloat which is a complex type, but > for T = double the function is basically just a sequence of mov's to populate > an array. > > As the function is of the form > > for (int i = 0; i < N; i++) // N = template arg > for (int j = 0; j < p[N]; j++) // runtime trip count > foo(i, ...); // static polymorphism > > with foo being a large switch-case on its first argument the expectation was > for the compiler to inline foo, unroll the outer loop, and then prune the dead > cases such that we have something similar to > > for (int j = 0; j < p[0]; j++) > foo(0, ...); // inline i = 0 case > for (int j = 0; j < p[1]; j++) > foo(1, ...); // inline i = 1 case > // ... Ah, interesting. This kind of static polymorphism should be handled by IPA-CP already but it's of course possible we're confused about a detail in this very testcase. Honza? Instead of [[gnu::flatten]] you could use the __attribute__((always_inline)) attribute on the foo function definition if you didn't simplify the outline above too much to make that infeasible. IIRC we do not have sth like [[gnu::inline]] foo(i, ...); to force inlining of a specific call, nor [[gnu::noinline]] foo(i, ...); both which seem useful. Not sure if the C++ syntax would support such placement of an attribute of course.