https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767

            Bug ID: 88767
           Summary: 'unroll and jam' not optimizing some loops
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: helijia at gcc dot gnu.org
  Target Milestone: ---

The test source is as follows:
__attribute__((noinline)) void calculate(const double* __restrict__ A, const
double* __restrict__ B, double* __restrict__ C) {
  unsigned int l_m = 0;
  unsigned int l_n = 0;
  unsigned int l_k = 0;

  A = (const double*)__builtin_assume_aligned(A,16);
  B = (const double*)__builtin_assume_aligned(B,16);
  C = (double*)__builtin_assume_aligned(C,16);

  for ( l_n = 0; l_n < 9; l_n++ ) { // loop 1 
   for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } // loop 2 

    for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3 
      for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4
        C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
      }
    }
  }
}

#define SIZE 36
double A[SIZE][SIZE] __attribute__((aligned(16)));
double B[SIZE][SIZE] __attribute__((aligned(16)));
double C[SIZE][SIZE] __attribute__((aligned(16)));

int main()
{
  long r, i, j;

  for (i=0; i < SIZE; i++) {
    for (j=0; j < SIZE; j++) {
      A[i][j] = 1.0;
      B[i][j] = 2.0;
      C[i][j] = 3.0;
    }
  }

  for (r=0; r < 1000000; r++) {
    calculate(&A[0][0],&B[0][0], &C[0][0]);
  }

  return 0;
}

First, I compile the test case with the following command. g++
unroll_jam_bug.cpp -O3  -funroll-loops -floop-unroll-and-jam -o unroll_jam_bug
-fdump-tree-unrolljam-details. In the generated file of
unroll_jam_bug.cpp.143t.unrolljam, I found that there is no unroll and jam
optimization for the loop in the calculate function.

Second, I added the -fdump-tree-all parameter to the command line. I found that
the innermost loop(loop 3 and 4) is completely unrolled because
pass_data_complete_unrolli pass thinks innermost loop is small. As the inner
loop is fully expanded, the original loop becomes large. When the loop is
expanded in the pass_loop_jam pass, the number of unroll_factor * loop
instruction > 200 will be judged. If the result is true, the optimization will
be abandoned. Otherwise, the optimization will proceed. 

By the second analysis, I tried to ban the unrolli optimization.So I use the
following command line. g++ unroll_jam_bug.cpp -O3 -mcpu=power8
-fdisable-tree-cunrolli -floop-unroll-and-jam -o unroll_jam_bug
-fdump-tree-unrolljam-details
Using this command, loop unroll and jam
optimization will be executed, but there seems to be room for optimization.

Original code:
for ( l_n = 0; l_n < 9; l_n++ ) {
    for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; }

    for ( l_k = 0; l_k < 17; l_k++ ) {
           for ( l_m = 0; l_m < 10; l_m++ ) {
        C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
      }
    }
  }
After unroll and jam pass:
for ( l_n = 0; l_n < 9; l_n++ ) {
    for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; }

    for ( l_k = 0; l_k < 17; l_k += 2 ) {
      for ( l_m = 0; l_m < 10; l_m++ ) {
        C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
        C[(l_n*10)+l_m] += A[(l_k*20 + 20)+l_m] * B[(l_n*20)+l_k + 1];
      }
    }
  }

Reply via email to