https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116979

--- Comment #32 from Paul Caprioli <paul at hpkfft dot com> ---
> Yes, that part is definitely very bad as it also breaks store-to-load
> forwarding.  It's one of those argument-return issues that plagues GCC.

Yes, since x86 provides total store order, all the older stores in the store
buffer have to drain as well, so that's the biggest performance problem.  (I
just now noticed this was mentioned in comment 18.  Sorry for repeating.)
  For the sake of conversation, let's assume the function is inlined so that
problem goes away.

I was thinking about your comment about using AVX512 embedded broadcast.  I
think that's a nice idea for a complex multiply.  Without the embedded
broadcast, the code below is 2 loads, 3 shuffles, and 2 floating-point
instructions.  If hardware has only 1 shuffle port (e.g., Intel Sapphire
Rapids), that's the resource constraint.

        vmovq   (%rdi), %xmm0
        vmovq   (%rsi), %xmm2
        vmovsldup       %xmm0, %xmm1
        vmovshdup       %xmm0, %xmm0
        vshufps $0xe1, %xmm2, %xmm2, %xmm3
        vmulps  %xmm3, %xmm0, %xmm0
        vfmaddsub231ps  %xmm2, %xmm1, %xmm0

For double, using the EVEX.128 encoding, I think your idea just works!  The two
dups are eliminated and only one shuffle remains.
For float, I think it's good to avoid setting the invalid bit in MXCSR in case
an infinity is broadcast and multiplied by zero.  Instead of using a mask
register, maybe the vshufps can use $0x11 so that bits [127:64] are written to
be the same as [63:0].  The idea is to perform the same math in the top 64 bits
as in the bottom 64.
For _Float16, the complex multiply instruction VFMULCSH does all the shuffles
for free in the floating-point unit, so that's best.

For multiplying arrays of complex numbers (not just one complex multiply), the
embedded broadcast doesn't seem useful.  You'd want to use the full SIMD
register to do more than a single complex multiply at a time.
For single:

    vmovsldup (%rax), %ymm0
    vmovshdup (%rax), %ymm1
    vmovups   (%rdx), %ymm4
    vshufps $0xB1, %ymm4, %ymm4, %ymm5
    vmulps         %ymm1, %ymm5, %ymm5
    vfmaddsub231ps %ymm0, %ymm4, %ymm5

Above is 3 loads, 1 shuffle, 2 floating-point instructions.  The duplication is
done for free in the load unit by using the memory-operand form of the dup
instructions.
For double, sadly there's no such thing as vmovdhdup.  It's probably good to
use vmovddup from memory to do the one duplication for free in the load unit. 
The other duplication is a register-to-register shuffle.  So, 3 loads, 2
shuffles, 2 floating-point instructions.
For _Float16, VFMULCPH does all shuffles for free in the floating-point unit.

Reply via email to