https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #2 from Chris Elrod <elrodc at gmail dot com> --- https://godbolt.org/z/3648aMTz8 Perhaps a simpler diff is that you can reproduce by uncommenting the pragma, but codegen becomes good with it. template<typename T, ptrdiff_t N> constexpr auto operator*(OuterDualUA2<T,N> a, OuterDualUA2<T,N> b)->OuterDualUA2<T,N>{ //return {a.value*b.value,a.value*b.p[0]+b.value*a.p[0],a.value*b.p[1]+b.value*a.p[1]}; OuterDualUA2<T,N> c; c.value = a.value*b.value; #pragma GCC unroll 16 for (ptrdiff_t i = 0; i < 2; ++i) c.p[i] = a.value*b.p[i] + b.value*a.p[i]; //c.p[0] = a.value*b.p[0] + b.value*a.p[0]; //c.p[1] = a.value*b.p[1] + b.value*a.p[1]; return c; } It's not great to have to add pragmas everywhere to my actual codebase. I thought I hit the important cases, but my non-minimal example still gets unnecessary register splits and stack spills, so maybe I missed places, or perhaps there's another issue. Given that GCC unrolls the above code even without the pragma, it seems like a definite bug that the pragma is needed for the resulting code generation to actually be good. Not knowing the compiler pipeline, my naive guess is that the pragma causes earlier unrolling than whatever optimization pass does it sans pragma, and that some important analysis/optimization gets run between those two times.