https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403

            Bug ID: 83403
           Summary: Missed register promotion opportunities in loop
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wschmidt at gcc dot gnu.org
  Target Milestone: ---

Created attachment 42857
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42857&action=edit
Source file

One of our performance folks ran across a large performance opportunity inside
libxsmm.  The attached test shows a loop that would run about 25% faster if
load/store motion could promote some memory expressions to registers within a
loop.

The test contains this loop nest:

  for ( l_n = 0; l_n < 9; l_n++ ) {
    for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; }

    for ( l_k = 0; l_k < 17; l_k++ ) {
      #pragma simd
      for ( l_m = 0; l_m < 10; l_m++ ) {
        C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k];
      }
    }
  }

The cunrolli phase unrolls the innermost loop fully, so that we have ten
separate accumulators.  Inside the l_k loop, each of these accumulators could
be loaded once prior to the loop, remain in a register during the loop, and be
stored once upon exit.  GCC is not able to do this.  Note that A, B, and C are
restrict pointers, so there shouldn't be an aliasing issue.  The ten
accumulators are array elements of C (common base register, different offsets).

It looks like load/store motion is done as part of pass_lim.  I don't know
whether it is intended to be able to handle array elements.  For simpler cases,
it looks like this gets optimized in lim2.

Prior to lim2, the l_k loop is as follows:

  <bb 4> [local count: 97603132]:
  # l_k_71 = PHI <0(3), l_k_32(4)>
  _51 = *_290;
  _52 = l_k_71 * 20;
  _54 = (long unsigned int) _52;
  _55 = _54 * 8;
  _56 = &A[0][0] + _55;
  _57 = *_56;
  _58 = l_n_70 * 20;
  _59 = _58 + l_k_71;
  _60 = (long unsigned int) _59;
  _61 = _60 * 8;
  _62 = &B[0][0] + _61;
  _63 = *_62;
  _64 = _57 * _63;
  _65 = _51 + _64;
  *_290 = _65;
  _75 = *_299;
  _77 = _52 + 1;
  _78 = (long unsigned int) _77;
  _79 = _78 * 8;
  _80 = &A[0][0] + _79;
  _81 = *_80;
  _88 = _63 * _81;
  _89 = _75 + _88;
  *_299 = _89;
  _99 = *_308;
  _101 = _52 + 2;
  _102 = (long unsigned int) _101;
  _103 = _102 * 8;
  _104 = &A[0][0] + _103;
  _105 = *_104;
  _112 = _63 * _105;
  _113 = _99 + _112;
  *_308 = _113;
  _123 = *_317;
  _125 = _52 + 3;
  _126 = (long unsigned int) _125;
  _127 = _126 * 8;
  _128 = &A[0][0] + _127;
  _129 = *_128;
  _136 = _63 * _129;
  _137 = _123 + _136;
  *_317 = _137;
  _147 = *_326;
  _149 = _52 + 4;
  _150 = (long unsigned int) _149;
  _151 = _150 * 8;
  _152 = &A[0][0] + _151;
  _153 = *_152;
  _160 = _63 * _153;
  _161 = _147 + _160;
  *_326 = _161;
  _171 = *_335;
  _173 = _52 + 5;
  _174 = (long unsigned int) _173;
  _175 = _174 * 8;
  _176 = &A[0][0] + _175;
  _177 = *_176;
  _184 = _63 * _177;
  _185 = _171 + _184;
  *_335 = _185;
  _195 = *_344;
  _197 = _52 + 6;
  _198 = (long unsigned int) _197;
  _199 = _198 * 8;
  _200 = &A[0][0] + _199;
  _201 = *_200;
  _208 = _63 * _201;
  _209 = _195 + _208;
  *_344 = _209;
  _219 = *_353;
  _221 = _52 + 7;
  _219 = *_353;
  _221 = _52 + 7;
  _222 = (long unsigned int) _221;
  _223 = _222 * 8;
  _224 = &A[0][0] + _223;
  _225 = *_224;
  _232 = _63 * _225;
  _233 = _219 + _232;
  *_353 = _233;
  _243 = *_362;
  _245 = _52 + 8;
  _246 = (long unsigned int) _245;
  _247 = _246 * 8;
  _248 = &A[0][0] + _247;
  _249 = *_248;
  _256 = _63 * _249;
  _257 = _243 + _256;
  *_362 = _257;
  _267 = *_371;
  _269 = _52 + 9;
  _270 = (long unsigned int) _269;
  _271 = _270 * 8;
  _272 = &A[0][0] + _271;
  _273 = *_272;
  _280 = _63 * _273;
  _281 = _267 + _280;
  *_371 = _281;
  l_k_32 = l_k_71 + 1;
  if (l_k_32 == 17)
    goto <bb 5>; [5.89%]
  else
    goto <bb 4>; [94.11%]

Each of *290, *299, ..., *_371 should be independently optimizable.

At -O3, if -fgcse-sm is specified, the RTL store motion pass is able to
optimize the code.  I'm told that this pass is not on by default at -O3 for (at
least) reasons of inefficiency.  But this would tend to indicate that we don't
have an aliasing problem, so this looks like just a missed opportunity in the
GIMPLE optimizations.

Reply via email to