https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

            Bug ID: 98598
           Summary: Missed opportunity to optimize dependent loads in
                    loops
           Product: gcc
           Version: tree-ssa
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

As we know, dependent loads are not friendly to cache. Especially when in
nested loops, dependent loads such as pa->pb->pc->val may be repeated many
times. For example:

typedef struct C { int val; } C;
typedef struct B { C *pc; } B;
typedef struct A { B *pb; } A;

int foo (int n, int m, A *pa) {
  int sum;

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < m; j++) {
      sum += pa[j].pb->pc->val;  // each value is repeatedly loaded "n" times
      // ...
    }
    // ...
  }

  return sum;
}

Such access pattern can be found in real applications and benchmarks, and this
can be critical to performance.

Can we cache the loaded value and avoid repeated dependent loads? E.g.
transform above case into following (suppose there is no alias issue or other
clobber, and "n" is big enough):

int foo (int n, int m, A *pa) {
  int *cache = (int *) malloc(m * sizeof(int));
  for (int j = 0; j < m; j++) {
    cache[j] = pa[j].pb->pc->val;
  }

  int sum;

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < m; j++) {
      sum += cache[j];   // pa[j].pb->pc->val;
      // ...
    }
    // ...
  }

  free(cache);
  return sum;
}

This should improve performance a lot.

Reply via email to