https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #8 from mjr19 at cam dot ac.uk ---
If it is tricky to teach gfortran that it can flip the signs of alternate
elements in a vector trivially with an xor, would a possible step to an
improvement be to teach it that the cost of vpermpd (as opposed to vpermilpd)
is high on most Intel processors (3 cycle latency, one cycle throughput, just
one functional unit), and therefore the "optimisation" of using several vperms
to save the odd vadd or vmul is a step backwards, not forwards?

The cost model seems to be wrong, in that there are several cases where
-ffast-math makes things slower on all Intel CPUs to which I have access,
including when I set -march=native. In this particularly bad case, -ffast-math
adds about 65% to the runtime on a Kaby Lake.

Reply via email to