https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90292
Bug ID: 90292 Summary: GCC Fails to hoist loop invariant in nested loops Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: giuliano.belinassi at usp dot br Target Milestone: --- Both GCC 8.3.0 and 9.0.1 fails to hoist the index calculation for 'i' and 'k' variables of the following function (matrix multiplication with blocking): void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A, double *restrict _B) { #define BLOCK_I 128 #define BLOCK_K 8 #define A(i, j) _A[n*(i) + (j)] #define B(i, j) _B[n*(i) + (j)] #define C(i, j) _C[n*(i) + (j)] unsigned ii, kk, i, j, k; for (ii = 0; ii < n; ii += BLOCK_I) for (kk = 0; kk < n; kk += BLOCK_K) for (i = ii; i < ii + BLOCK_I; ++i) for (k = kk; k < kk + BLOCK_K; ++k) for (j = 0; j < n; ++j) C(i, j) += A(i, k) * B(k, j); } which then generate the following GIMPLE code on the deepest nested block (-O2) <bb 4> [local count: 955630225]: # ivtmp.3_88 = PHI <_1(6), ivtmp.3_87(4)> _3 = (long unsigned int) ivtmp.3_88; _4 = _3 * 8; _5 = _C_34(D) + _4; _6 = *_5; _13 = ivtmp.14_81 + ivtmp.3_88; _14 = (long unsigned int) _13; _15 = _14 * 8; _16 = _B_36(D) + _15; _17 = *_16; _18 = _11 * _17; _19 = _6 + _18; *_5 = _19; ivtmp.3_87 = ivtmp.3_88 + 1; if (ivtmp.25_71 != ivtmp.3_87) goto <bb 4>; [89.00%] else goto <bb 5>; [11.00%] If I modify the function code to do the following: void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A, double *restrict _B) { #define BLOCK_I 128 #define BLOCK_K 8 #define A(i, j) _A[n*(i) + (j)] #define B(i, j) _B[n*(i) + (j)] #define C(i, j) _C[n*(i) + (j)] unsigned ii, kk, i, j, k; for (ii = 0; ii < n; ii += BLOCK_I) for (kk = 0; kk < n; kk += BLOCK_K) for (i = ii; i < ii + BLOCK_I; ++i) for (k = kk; k < kk + BLOCK_K; ++k) { double a = A(i, k); double* b = &B(k, 0); double* c = &C(i, 0); for (j = 0; j < n; ++j) { *c = a * (*b); c++; b++; } } } Then GCC generates the following deepest block: (-O2) <bb 6> [local count: 955630224]: # ivtmp.1_47 = PHI <0(5), ivtmp.1_46(6)> _11 = MEM[base: b_32, index: ivtmp.1_47, step: 8, offset: 0B]; _12 = _11 * a_30; MEM[base: c_34, index: ivtmp.1_47, step: 8, offset: 0B] = _12; ivtmp.1_46 = ivtmp.1_47 + 1; if (_44 != ivtmp.1_47) goto <bb 6>; [89.00%] else goto <bb 7>; [11.00%] and therefore it will generate faster final code. With a 2048x2048 matrix at -O2, this modification only gives about 0.3s of speedup, however at -O3 -march=native in a Core(TM) i5-8250U, this modification provides a 2x speedup, arguably due to vectorization.