https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Last reconfirmed| |2021-04-15 Status|UNCONFIRMED |NEW --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Note even when avoiding the STLF hit the vectorized version is slower. You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading the lower/upper half of vectors separately. The reason is that without -ffast-math we are using an in-order reduction which doesn't save us much but instead just combines dependence chains here. We do have a related bug for this somewhere. With -ffast-math the version with/without -mtune-ctl=^sse_unaligned_load_optimal is about the same speed, so STLF is a red herring here (on Zen2). Still not vectorizing is a lot faster. Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?