https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115845
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Ah, so the main issue is that we are traversing a multi-dimensional array and the above is the innermost loop the next outer iteration will load from the 512bit masked stored value offsetted by 256bits (the loops accumulate one array to another). On Zen4 at least the masking does not enable store forwarding - the 256bit data path might suggest it could - but possibly the memory model for "masked" elements prohibits this. So the performance hit is caused mainly because of store-to-load forward fails. A testcase would be void __attribute__((noipa)) test (double * __restrict a, double *b, int n, int m) { for (int j = 0; j < m; ++j) for (int i = 0; i < n; ++i) a[i + j*n] = a[i + j*n] + b[i + j*n]; } double a[1024]; double b[1024]; int main(int argc, char **argv) { int m = atoi (argv[1]); for (long i = 0; i < m; ++i) test (a + 4, b + 4, 4, 1024/4); return 0; } and I've already reported this as PR110456 it seems ... *** This bug has been marked as a duplicate of bug 110456 ***