[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

tnfchris at gcc dot gnu.org via Gcc-bugs Mon, 24 Apr 2023 04:57:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587


--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
> (In reply to Tamar Christina from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > The issue isn't unrolling but invariant motion.  We unroll the innermost
> > > loop, vectorizer the middle loop and then unroll that as well.  That 
> > > leaves
> > > us with
> > > 64 invariant loads from b[] in the outer loop which I think RTL opts never
> > > "schedule back", even with -fsched-pressure.
> > > 
> > 
> > Aside from the loads, by fully unrolling the inner loop, that means we need
> > 16 unique registers live for the destination every iteration.  That's
> > already half the SIMD register file on AArch64 gone, not counting the
> > invariant loads.
> 
> Why?  You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning

Oh, I was basing that on the output of the existing using a lower loop count
with e.g.
template void f<16, 16, 4>

But yes, those options avoid the spills, but of course without them you leave
all the loads inside the loop iteration.

I was hoping more we could get closer to https://godbolt.org/z/7c5YfxE5j which
is a lot better code. i.e. the invariants moved inside the outer loop.  But
yes, I do understand this may be hard to do automatically.

> 
> > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
> > and does it at RTL instead.
> 
> ... because on GIMPLE we only can fully unroll or not.

But is this an intrinsic limitation or just because atm we only unroll for SLP?

> 
> > At the moment a way for the user to locally control the unroll amount would
> > already be a good step. I know there's the param, but that's global and
> > typically the unroll factor would depend on the GEMM kernel.
> 
> As said it should already work to the extent that on GIMPLE we do not
> perform classical loop unrolling.

Right, but the RTL unroller produces horrible code.. e.g. the addressing modes
are pretty bad.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

Reply via email to