https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #5)
> Note even when avoiding the STLF hit the vectorized version is slower.
> You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading
> the lower/upper half of vectors separately.
> 
This leads to extra instructions(extra 2 loads), and if the vectorizer knew
that, it would find that the cost of vectorization is larger than scalar. 

> The reason is that without -ffast-math we are using an in-order reduction
> which doesn't save us much but instead just combines dependence chains
> here.  We do have a related bug for this somewhere.
> 
> With -ffast-math the version with/without
> -mtune-ctl=^sse_unaligned_load_optimal
> is about the same speed, so STLF is a red herring here (on Zen2).
> 
> Still not vectorizing is a lot faster.
> 

Yes, As far as vectorization is concerned, vectorization does not improve
performance here(compare -O2 -funroll-loops vs -O2 -ftree-vectorize
-funroll-loops) so I'm wondering if we can adjust the heuristic or cost model
so that the loop is not vectorized.

> Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?

doesn't help.

Reply via email to