https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jsm28 at gcc dot gnu.org

--- Comment #28 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Paul Caprioli from comment #27)
> The motivation for this bug report was accuracy (one might even say
> correctness), not so much performance.  Using FMA in a complex product gives
> lower maximum relative normwise error.  An explanation is given in section
> 1.1 of https://inria.hal.science/hal-04714173 (and references are given to
> papers proving the theory).
> 
> The experimental results in that paper in sections 3 and 4 show that GCC is
> more accurate than clang for complex multiplication for the code that was
> tested.  GCC (unlike clang) is using FMA for that code, which is great.

I see.  ISTR that in some other bug Joseph M. says using FMA directly in the
lowering of complex multiplication makes some corner-cases incorrect, but I
get that this is what you'd rather see - FMA always used, not only as
side-effect of optimization.

This bug de-railed into the optimization part of the compiler with focus
on vectorization cost modeling in the end.  I suppose I should split it up ...

> The "always" in the title of this bug expresses the desire to have FMA used
> regardless of whether a function is inlined, whether constant propagation
> allows compile-time computation of the product, whether the code is
> vectorized, and regardless of cost model or other optimization decisions. 
> For scientific work, it's nice to have this robustness.
> 
> As an aside comment, in the code for "fast complex" at the bottom of comment
> 26, I'm not sure I understand:
> 
>         vmovshdup       %xmm0, %xmm4
>         vmovss  %xmm0, -8(%rsp)
>         vmovss  %xmm4, -4(%rsp)
>         vmovq   -8(%rsp), %xmm0
> 
> It seems %xmm0 is split into two scalars, which are each stored, and then
> %xmm0 is loaded to the same value it already has.  (If %xmm0 needs to be
> stored on the stack, then one 8-byte store could be used instead of the
> shuffle (vmovshdup) and the two 4-byte stores.)

Yes, that part is definitely very bad as it also breaks store-to-load
forwarding.  It's one of those argument-return issues that plagues GCC.

Reply via email to