https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403
Bug ID: 83403 Summary: Missed register promotion opportunities in loop Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wschmidt at gcc dot gnu.org Target Milestone: --- Created attachment 42857 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42857&action=edit Source file One of our performance folks ran across a large performance opportunity inside libxsmm. The attached test shows a loop that would run about 25% faster if load/store motion could promote some memory expressions to registers within a loop. The test contains this loop nest: for ( l_n = 0; l_n < 9; l_n++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } for ( l_k = 0; l_k < 17; l_k++ ) { #pragma simd for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; } } } The cunrolli phase unrolls the innermost loop fully, so that we have ten separate accumulators. Inside the l_k loop, each of these accumulators could be loaded once prior to the loop, remain in a register during the loop, and be stored once upon exit. GCC is not able to do this. Note that A, B, and C are restrict pointers, so there shouldn't be an aliasing issue. The ten accumulators are array elements of C (common base register, different offsets). It looks like load/store motion is done as part of pass_lim. I don't know whether it is intended to be able to handle array elements. For simpler cases, it looks like this gets optimized in lim2. Prior to lim2, the l_k loop is as follows: <bb 4> [local count: 97603132]: # l_k_71 = PHI <0(3), l_k_32(4)> _51 = *_290; _52 = l_k_71 * 20; _54 = (long unsigned int) _52; _55 = _54 * 8; _56 = &A[0][0] + _55; _57 = *_56; _58 = l_n_70 * 20; _59 = _58 + l_k_71; _60 = (long unsigned int) _59; _61 = _60 * 8; _62 = &B[0][0] + _61; _63 = *_62; _64 = _57 * _63; _65 = _51 + _64; *_290 = _65; _75 = *_299; _77 = _52 + 1; _78 = (long unsigned int) _77; _79 = _78 * 8; _80 = &A[0][0] + _79; _81 = *_80; _88 = _63 * _81; _89 = _75 + _88; *_299 = _89; _99 = *_308; _101 = _52 + 2; _102 = (long unsigned int) _101; _103 = _102 * 8; _104 = &A[0][0] + _103; _105 = *_104; _112 = _63 * _105; _113 = _99 + _112; *_308 = _113; _123 = *_317; _125 = _52 + 3; _126 = (long unsigned int) _125; _127 = _126 * 8; _128 = &A[0][0] + _127; _129 = *_128; _136 = _63 * _129; _137 = _123 + _136; *_317 = _137; _147 = *_326; _149 = _52 + 4; _150 = (long unsigned int) _149; _151 = _150 * 8; _152 = &A[0][0] + _151; _153 = *_152; _160 = _63 * _153; _161 = _147 + _160; *_326 = _161; _171 = *_335; _173 = _52 + 5; _174 = (long unsigned int) _173; _175 = _174 * 8; _176 = &A[0][0] + _175; _177 = *_176; _184 = _63 * _177; _185 = _171 + _184; *_335 = _185; _195 = *_344; _197 = _52 + 6; _198 = (long unsigned int) _197; _199 = _198 * 8; _200 = &A[0][0] + _199; _201 = *_200; _208 = _63 * _201; _209 = _195 + _208; *_344 = _209; _219 = *_353; _221 = _52 + 7; _219 = *_353; _221 = _52 + 7; _222 = (long unsigned int) _221; _223 = _222 * 8; _224 = &A[0][0] + _223; _225 = *_224; _232 = _63 * _225; _233 = _219 + _232; *_353 = _233; _243 = *_362; _245 = _52 + 8; _246 = (long unsigned int) _245; _247 = _246 * 8; _248 = &A[0][0] + _247; _249 = *_248; _256 = _63 * _249; _257 = _243 + _256; *_362 = _257; _267 = *_371; _269 = _52 + 9; _270 = (long unsigned int) _269; _271 = _270 * 8; _272 = &A[0][0] + _271; _273 = *_272; _280 = _63 * _273; _281 = _267 + _280; *_371 = _281; l_k_32 = l_k_71 + 1; if (l_k_32 == 17) goto <bb 5>; [5.89%] else goto <bb 4>; [94.11%] Each of *290, *299, ..., *_371 should be independently optimizable. At -O3, if -fgcse-sm is specified, the RTL store motion pass is able to optimize the code. I'm told that this pass is not on by default at -O3 for (at least) reasons of inefficiency. But this would tend to indicate that we don't have an aliasing problem, so this looks like just a missed opportunity in the GIMPLE optimizations.