[Bug tree-optimization/88767] 'unroll and jam' not optimizing some loops

helijia at gcc dot gnu.org Wed, 09 Jan 2019 19:14:34 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767


Li Jia He <helijia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org

--- Comment #9 from Li Jia He <helijia at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> What's the room for improvement?  Why's unrolling the innermost loop not
> profitable?

Hi Richard, I want to achieve the effect of the following code:
__attribute__((noinline)) void calculate(const double* __restrict__ A, const
double* __restrict__ B, double* __restrict__ C) {
  unsigned int l_m = 0;
  unsigned int l_n = 0;
  unsigned int l_k = 0;

  A = (const double*)__builtin_assume_aligned(A,16);
  B = (const double*)__builtin_assume_aligned(B,16);
  C = (double*)__builtin_assume_aligned(C,16);

  for ( l_n = 0; l_n < 9; l_n += 3 ) { // loop 1
   for ( l_m = 0; l_m < 10; l_m++ ) { // loop 2
     C[(l_n*10)+l_m] = 0.0;
     C[(l_n*10)+l_m+10] = 0.0;
     C[(l_n*10)+l_m+20] = 0.0;
   }

   for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3
     for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4
       C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
       C[(l_n*10)+l_m+10] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+20];
       C[(l_n*10)+l_m+20] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+40];
      }
    }
  }
}

#define SIZE 36
double A[SIZE][SIZE] __attribute__((aligned(16)));
double B[SIZE][SIZE] __attribute__((aligned(16)));
double C[SIZE][SIZE] __attribute__((aligned(16)));

int main()
{
  long r, i, j;

  for (i=0; i < SIZE; i++) {
    for (j=0; j < SIZE; j++) {
      A[i][j] = 1.0;
      B[i][j] = 2.0;
      C[i][j] = 3.0;
    }
  }

  for (r=0; r < 1000000; r++) {
    calculate(&A[0][0],&B[0][0], &C[0][0]);
  }

  return 0;
}
In the original code, cunrolli pass will completely expand loop2 and loop4, 
causing unroll-and-jam to have no chance to do it. From my test, the
performance 
of these codes is expectation code > enable cunrolli > disable cunrolli.
Sorry for not responding in time.

[Bug tree-optimization/88767] 'unroll and jam' not optimizing some loops

Reply via email to