https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587
Bug ID: 109587 Summary: Deeply nested loop unrolling overwhelms register allocator Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* On matrix multiplication routines such as #include <arm_neon.h> template<int N, int M, int K> void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t *c) { for (int i = 0; i < N; ++i) { for (int j=0; j < M; ++j) { for (int k=0; k < K; ++k) { c[i*N + j] += a[i*K + k] * b[k*M + j]; } } } } template void f<16, 16, 16>(const float32_t *__restrict a, const float32_t *__restrict b, float32_t *c); the loop unroller fully unrolls the inner loop because the iteration count 16 is below the threshold. But especially because this results in a RMW operation we don't have enough registers to deal with it and we spill profoundly. The loop can be split in two but this requires manual work on each GEMM kernel. Perhaps the loop unroller can use a better heuristic here? It also looks like adding a pragmas template<int N, int M, int K> void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t *c) { for (int i = 0; i < N; ++i) { for (int j=0; j < M; ++j) { #pragma GCC unroll 8 for (int k=0; k < K; ++k) { c[i*N + j] += a[i*K + k] * b[k*M + j]; } } } } helps but because this blocks cunrolli and instead unrolls in RTL we loose scheduling and result in not as efficient code. So can we do better here with early unrolling?