[Bug tree-optimization/115845] 25% runtime regression of 527.cam4_r when enabling --param vect-partial-vector-usage={1,2} ontop of -Ofast --march=znver4

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 09 Jul 2024 23:14:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115845


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |DUPLICATE
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ah, so the main issue is that we are traversing a multi-dimensional
array and the above is the innermost loop the next outer iteration will
load from the 512bit masked stored value offsetted by 256bits (the
loops accumulate one array to another).  On Zen4 at least the
masking does not enable store forwarding - the 256bit data path might
suggest it could - but possibly the memory model for "masked" elements
prohibits this.

So the performance hit is caused mainly because of store-to-load forward
fails.

A testcase would be

void __attribute__((noipa))
test (double * __restrict a, double *b, int n, int m)
{
  for (int j = 0; j < m; ++j)
    for (int i = 0; i < n; ++i)
      a[i + j*n] = a[i + j*n] + b[i + j*n];
}

double a[1024];
double b[1024];

int main(int argc, char **argv)
{
  int m = atoi (argv[1]);
  for (long i = 0; i < m; ++i)
    test (a + 4, b + 4, 4, 1024/4);
  return 0;
}

and I've already reported this as PR110456 it seems ...

*** This bug has been marked as a duplicate of bug 110456 ***

[Bug tree-optimization/115845] 25% runtime regression of 527.cam4_r when enabling --param vect-partial-vector-usage={1,2} ontop of -Ofast --march=znver4

Reply via email to