This is a missed-optimization bug. The following reduced test case illustrates the problem. It doesn't do anything useful, but just compile it with
mipsisa32r2-elfoabi-gcc -S -mtune=24kc -G4096 -O2 example4.c #define N 511 #define M 9 long A[N]; long B[N]; long AA[N]; long BB[N]; long tA; long tB; void foo (unsigned iterations) { unsigned loop_cnt; static long *aLow; static long *bLow; static long *aHi; static long *bHi; static long n1; static long n2; static long l; static long i; static long j; static long k; for (loop_cnt = 0; loop_cnt < iterations; loop_cnt ++) { /* This is the loop we're interested in. */ for (i = 0; i < N; i ++) { AA[i] = A[i]; BB[i] = B[i]; } /* The rest of this stuff is just here to add some context to the outer loop. */ for (k = 1; k <= M; k++) { n1 = 1 << k; n2 = n1 >> 1; for (j = 0; j < n2; j++) { for (i = j; i < N; i += n1) { l = i + n2; aLow = &A[l]; bLow = &B[l]; aHi = &A[i]; bHi = &B[i]; A[l] = *aHi - tA; B[l] = *bHi - tB; A[i] += tA; B[i] += tB; } } } } } The -G option forces the global variables to use GP-relative addressing, which involves an extra addition. Thus the first nested loop should be optimized as if it were written: { long *t1 = AA; long *t2 = A; long *t3 = BB; long *t4 = B; for (i = 0; i < N; i++) { *t1 = *t2; *t3 = *t4; t1++; t2++; t3++; t4++; } } In 4.3.1, though, it is producing code with GP-relative addressing inside the loop, so that the loop body has 9 adds instead of 5. Mainline head does a better job and at least pulls out the references to A and B (which also appear in the second nested loop). PRE is working fine, and pulling the invariant GP-relative addressing of all four variables all the way out of the outer loop. However, this means the lifetimes of the corresponding pseudo-registers span the entire outer loop, and the register allocator is (correctly) giving priority to the more localized pseudos in the more deeply nested loops that follow. Having failed to allocate a hardware register to span the entire lifetime of the pseudos, reload stupidly re-inserts the previously hoisted GP-relative address computation at the point of reference, inside the first nested loop. I think what is needed is more smarts to make it understand that it should try allocating a register just around the inner loop if it can't get one for the entire outer loop, before giving up. Any thoughts on where the best place for this to happen would be? Can this be done entirely within the register allocator or do we need another pass to identify places where we can potentially shorten the lifetimes of pseudos? While this example is specific to MIPS with the GP-relative addressing, I can see that the underlying PRE/register allocation conflict is a more general problem that probably crops up in lots of other code with similar structure of outer-loop-containing-multiple-inner-loops. -Sandra -- Summary: bad interaction between PRE/register allocation/reload Product: gcc Version: 4.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: sandra at codesourcery dot com GCC build triplet: i686-pc-linux-gnu GCC host triplet: i686-pc-linux-gnu GCC target triplet: mipsisa32r2-elfoabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36223