https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-15
             Status|UNCONFIRMED                 |NEW

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note even when avoiding the STLF hit the vectorized version is slower.
You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading
the lower/upper half of vectors separately.

The reason is that without -ffast-math we are using an in-order reduction
which doesn't save us much but instead just combines dependence chains
here.  We do have a related bug for this somewhere.

With -ffast-math the version with/without
-mtune-ctl=^sse_unaligned_load_optimal
is about the same speed, so STLF is a red herring here (on Zen2).

Still not vectorizing is a lot faster.

Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?

Reply via email to