[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

peter at cordes dot ca via Gcc-bugs Mon, 25 Oct 2021 14:44:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #10 from Peter Cordes <peter at cordes dot ca> ---
Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa 
Still doing two separate sign extensions and two stores / wider reload (store
forwarding stall):

-O3 -march=skylake -fopenmp
simde_vaddlv_s8:
        push    rbp
        vpmovsxbw       xmm2, xmm0
        vpsrlq  xmm0, xmm0, 32
        mov     rbp, rsp
        vpmovsxbw       xmm3, xmm0
        and     rsp, -32
        vmovq   QWORD PTR [rsp-16], xmm2
        vmovq   QWORD PTR [rsp-8], xmm3
        vmovdqa xmm4, XMMWORD PTR [rsp-16]
   ... then asm using byte-shifts

Including stuff like
   movdqa  xmm1, xmm0
   psrldq  xmm1, 4

instead of pshufd, which is an option because high garbage can be ignored.

And ARM64 goes scalar.

----

Current trunk *without* -fopenmp produces decent asm
https://godbolt.org/z/h1KEKPTW9

For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3)
simde_vaddlv_s8:
        sxtl    v0.8h, v0.8b
        addv    h0, v0.8h
        umov    w0, v0.h[0]
        ret

x86-64 gcc  -O3 -march=skylake
simde_vaddlv_s8:
        vpmovsxbw       xmm1, xmm0
        vpsrlq  xmm0, xmm0, 32
        vpmovsxbw       xmm0, xmm0
        vpaddw  xmm0, xmm1, xmm0
        vpsrlq  xmm1, xmm0, 32
        vpaddw  xmm0, xmm0, xmm1
        vpsrlq  xmm1, xmm0, 16
        vpaddw  xmm0, xmm0, xmm1
        vpextrw eax, xmm0, 0
        ret


That's pretty good, but  VMOVD eax, xmm0  would be more efficient than  VPEXTRW
when we don't need to avoid high garbage (because it's a return value in this
case).  VPEXTRW zero-extends into RAX, so it's not directly helpful if we need
to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx.

Or with BMI2, go scalar before the last shift / VPADDW step, e.g.
  ...
  vmovd  eax, xmm0
  rorx   edx, eax, 16
  add    eax, edx

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

Reply via email to