https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Richard Biener from comment #5) > Note even when avoiding the STLF hit the vectorized version is slower. > You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading > the lower/upper half of vectors separately. > This leads to extra instructions(extra 2 loads), and if the vectorizer knew that, it would find that the cost of vectorization is larger than scalar. > The reason is that without -ffast-math we are using an in-order reduction > which doesn't save us much but instead just combines dependence chains > here. We do have a related bug for this somewhere. > > With -ffast-math the version with/without > -mtune-ctl=^sse_unaligned_load_optimal > is about the same speed, so STLF is a red herring here (on Zen2). > > Still not vectorizing is a lot faster. > Yes, As far as vectorization is concerned, vectorization does not improve performance here(compare -O2 -funroll-loops vs -O2 -ftree-vectorize -funroll-loops) so I'm wondering if we can adjust the heuristic or cost model so that the loop is not vectorized. > Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX? doesn't help.