https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #27 from Paul Caprioli <paul at hpkfft dot com> ---
The motivation for this bug report was accuracy (one might even say
correctness), not so much performance.  Using FMA in a complex product gives
lower maximum relative normwise error.  An explanation is given in section 1.1
of https://inria.hal.science/hal-04714173 (and references are given to papers
proving the theory).

The experimental results in that paper in sections 3 and 4 show that GCC is
more accurate than clang for complex multiplication for the code that was
tested.  GCC (unlike clang) is using FMA for that code, which is great.

The "always" in the title of this bug expresses the desire to have FMA used
regardless of whether a function is inlined, whether constant propagation
allows compile-time computation of the product, whether the code is vectorized,
and regardless of cost model or other optimization decisions.  For scientific
work, it's nice to have this robustness.

As an aside comment, in the code for "fast complex" at the bottom of comment
26, I'm not sure I understand:

        vmovshdup       %xmm0, %xmm4
        vmovss  %xmm0, -8(%rsp)
        vmovss  %xmm4, -4(%rsp)
        vmovq   -8(%rsp), %xmm0

It seems %xmm0 is split into two scalars, which are each stored, and then %xmm0
is loaded to the same value it already has.  (If %xmm0 needs to be stored on
the stack, then one 8-byte store could be used instead of the shuffle
(vmovshdup) and the two 4-byte stores.)

Reply via email to