This is a missed-optimization bug.  The following reduced test case illustrates
the problem.  It doesn't do anything useful, but just compile it with

mipsisa32r2-elfoabi-gcc -S -mtune=24kc -G4096 -O2 example4.c


#define N 511
#define M 9

long A[N];
long B[N];

long AA[N];
long BB[N];

long tA;
long tB;

void foo (unsigned iterations)
{
  unsigned loop_cnt;
  static long *aLow;
  static long *bLow;
  static long *aHi;
  static long *bHi;
  static long n1;
  static long n2;
  static long l;
  static long i;
  static long j;
  static long k;

  for (loop_cnt = 0; loop_cnt < iterations; loop_cnt ++) {

    /* This is the loop we're interested in.  */
    for (i = 0; i < N; i ++) {
      AA[i] = A[i];
      BB[i] = B[i];
    }

    /* The rest of this stuff is just here to add some context to the
       outer loop.  */
    for (k = 1; k <= M; k++) {
      n1 = 1 << k;
      n2 = n1 >> 1;
      for (j = 0; j < n2; j++) {
        for (i = j; i < N; i += n1) {
          l = i + n2;
          aLow = &A[l];
          bLow = &B[l];
          aHi = &A[i];
          bHi = &B[i];

          A[l] = *aHi - tA;
          B[l] = *bHi - tB;
          A[i] += tA;
          B[i] += tB;
        }
      }
    }

  }
}


The -G option forces the global variables to use GP-relative addressing, which
involves an extra addition.  Thus the first nested loop should be optimized as
if it were written:

{
    long *t1 = AA;
    long *t2 = A;
    long *t3 = BB;
    long *t4 = B;

    for (i = 0; i < N; i++)  {
      *t1 = *t2;
      *t3 = *t4;
      t1++; t2++; t3++; t4++;
    }
  }

In 4.3.1, though, it is producing code with GP-relative addressing inside the
loop, so that the loop body has 9 adds instead of 5.  Mainline head does a
better job and at least pulls out the references to A and B (which also appear
in the second nested loop).

PRE is working fine, and pulling the invariant GP-relative addressing of all
four variables all the way out of the outer loop.  However, this means the
lifetimes of the corresponding pseudo-registers span the entire outer loop, and
the register allocator is (correctly) giving priority to the more localized
pseudos in the more deeply nested loops that follow.  Having failed to allocate
a hardware register to span the entire lifetime of the pseudos, reload stupidly
re-inserts the previously hoisted GP-relative address computation at the point
of reference, inside the first nested loop.

I think what is needed is more smarts to make it understand that it should try
allocating a register just around the inner loop if it can't get one for the
entire outer loop, before giving up.  Any thoughts on where the best place for
this to happen would be?  Can this be done entirely within the register
allocator or do we need another pass to identify places where we can
potentially shorten the lifetimes of pseudos?  

While this example is specific to MIPS with the GP-relative addressing, I can
see that the underlying PRE/register allocation conflict is a more general
problem that probably crops up in lots of other code with similar structure of
outer-loop-containing-multiple-inner-loops.

-Sandra


-- 
           Summary: bad interaction between PRE/register allocation/reload
           Product: gcc
           Version: 4.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: sandra at codesourcery dot com
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: mipsisa32r2-elfoabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36223

Reply via email to