https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115845

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |DUPLICATE
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ah, so the main issue is that we are traversing a multi-dimensional
array and the above is the innermost loop the next outer iteration will
load from the 512bit masked stored value offsetted by 256bits (the
loops accumulate one array to another).  On Zen4 at least the
masking does not enable store forwarding - the 256bit data path might
suggest it could - but possibly the memory model for "masked" elements
prohibits this.

So the performance hit is caused mainly because of store-to-load forward
fails.

A testcase would be

void __attribute__((noipa))
test (double * __restrict a, double *b, int n, int m)
{
  for (int j = 0; j < m; ++j)
    for (int i = 0; i < n; ++i)
      a[i + j*n] = a[i + j*n] + b[i + j*n];
}

double a[1024];
double b[1024];

int main(int argc, char **argv)
{
  int m = atoi (argv[1]);
  for (long i = 0; i < m; ++i)
    test (a + 4, b + 4, 4, 1024/4);
  return 0;
}

and I've already reported this as PR110456 it seems ...

*** This bug has been marked as a duplicate of bug 110456 ***

Reply via email to