https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587
--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #5) > (In reply to Tamar Christina from comment #4) > > (In reply to Richard Biener from comment #3) > > > The issue isn't unrolling but invariant motion. We unroll the innermost > > > loop, vectorizer the middle loop and then unroll that as well. That > > > leaves > > > us with > > > 64 invariant loads from b[] in the outer loop which I think RTL opts never > > > "schedule back", even with -fsched-pressure. > > > > > > > Aside from the loads, by fully unrolling the inner loop, that means we need > > 16 unique registers live for the destination every iteration. That's > > already half the SIMD register file on AArch64 gone, not counting the > > invariant loads. > > Why? You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning Oh, I was basing that on the output of the existing using a lower loop count with e.g. template void f<16, 16, 4> But yes, those options avoid the spills, but of course without them you leave all the loads inside the loop iteration. I was hoping more we could get closer to https://godbolt.org/z/7c5YfxE5j which is a lot better code. i.e. the invariants moved inside the outer loop. But yes, I do understand this may be hard to do automatically. > > > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling > > and does it at RTL instead. > > ... because on GIMPLE we only can fully unroll or not. But is this an intrinsic limitation or just because atm we only unroll for SLP? > > > At the moment a way for the user to locally control the unroll amount would > > already be a good step. I know there's the param, but that's global and > > typically the unroll factor would depend on the GEMM kernel. > > As said it should already work to the extent that on GIMPLE we do not > perform classical loop unrolling. Right, but the RTL unroller produces horrible code.. e.g. the addressing modes are pretty bad.