https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767
Li Jia He <helijia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amker at gcc dot gnu.org --- Comment #9 from Li Jia He <helijia at gcc dot gnu.org> --- (In reply to Richard Biener from comment #1) > What's the room for improvement? Why's unrolling the innermost loop not > profitable? Hi Richard, I want to achieve the effect of the following code: __attribute__((noinline)) void calculate(const double* __restrict__ A, const double* __restrict__ B, double* __restrict__ C) { unsigned int l_m = 0; unsigned int l_n = 0; unsigned int l_k = 0; A = (const double*)__builtin_assume_aligned(A,16); B = (const double*)__builtin_assume_aligned(B,16); C = (double*)__builtin_assume_aligned(C,16); for ( l_n = 0; l_n < 9; l_n += 3 ) { // loop 1 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 2 C[(l_n*10)+l_m] = 0.0; C[(l_n*10)+l_m+10] = 0.0; C[(l_n*10)+l_m+20] = 0.0; } for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4 C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; C[(l_n*10)+l_m+10] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+20]; C[(l_n*10)+l_m+20] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k+40]; } } } } #define SIZE 36 double A[SIZE][SIZE] __attribute__((aligned(16))); double B[SIZE][SIZE] __attribute__((aligned(16))); double C[SIZE][SIZE] __attribute__((aligned(16))); int main() { long r, i, j; for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) { A[i][j] = 1.0; B[i][j] = 2.0; C[i][j] = 3.0; } } for (r=0; r < 1000000; r++) { calculate(&A[0][0],&B[0][0], &C[0][0]); } return 0; } In the original code, cunrolli pass will completely expand loop2 and loop4, causing unroll-and-jam to have no chance to do it. From my test, the performance of these codes is expectation code > enable cunrolli > disable cunrolli. Sorry for not responding in time.