https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jsm28 at gcc dot gnu.org
--- Comment #28 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Paul Caprioli from comment #27)
> The motivation for this bug report was accuracy (one might even say
> correctness), not so much performance. Using FMA in a complex product gives
> lower maximum relative normwise error. An explanation is given in section
> 1.1 of https://inria.hal.science/hal-04714173 (and references are given to
> papers proving the theory).
>
> The experimental results in that paper in sections 3 and 4 show that GCC is
> more accurate than clang for complex multiplication for the code that was
> tested. GCC (unlike clang) is using FMA for that code, which is great.
I see. ISTR that in some other bug Joseph M. says using FMA directly in the
lowering of complex multiplication makes some corner-cases incorrect, but I
get that this is what you'd rather see - FMA always used, not only as
side-effect of optimization.
This bug de-railed into the optimization part of the compiler with focus
on vectorization cost modeling in the end. I suppose I should split it up ...
> The "always" in the title of this bug expresses the desire to have FMA used
> regardless of whether a function is inlined, whether constant propagation
> allows compile-time computation of the product, whether the code is
> vectorized, and regardless of cost model or other optimization decisions.
> For scientific work, it's nice to have this robustness.
>
> As an aside comment, in the code for "fast complex" at the bottom of comment
> 26, I'm not sure I understand:
>
> vmovshdup %xmm0, %xmm4
> vmovss %xmm0, -8(%rsp)
> vmovss %xmm4, -4(%rsp)
> vmovq -8(%rsp), %xmm0
>
> It seems %xmm0 is split into two scalars, which are each stored, and then
> %xmm0 is loaded to the same value it already has. (If %xmm0 needs to be
> stored on the stack, then one 8-byte store could be used instead of the
> shuffle (vmovshdup) and the two 4-byte stores.)
Yes, that part is definitely very bad as it also breaks store-to-load
forwarding. It's one of those argument-return issues that plagues GCC.