https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #10 from Peter Cordes <peter at cordes dot ca> --- Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa Still doing two separate sign extensions and two stores / wider reload (store forwarding stall): -O3 -march=skylake -fopenmp simde_vaddlv_s8: push rbp vpmovsxbw xmm2, xmm0 vpsrlq xmm0, xmm0, 32 mov rbp, rsp vpmovsxbw xmm3, xmm0 and rsp, -32 vmovq QWORD PTR [rsp-16], xmm2 vmovq QWORD PTR [rsp-8], xmm3 vmovdqa xmm4, XMMWORD PTR [rsp-16] ... then asm using byte-shifts Including stuff like movdqa xmm1, xmm0 psrldq xmm1, 4 instead of pshufd, which is an option because high garbage can be ignored. And ARM64 goes scalar. ---- Current trunk *without* -fopenmp produces decent asm https://godbolt.org/z/h1KEKPTW9 For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3) simde_vaddlv_s8: sxtl v0.8h, v0.8b addv h0, v0.8h umov w0, v0.h[0] ret x86-64 gcc -O3 -march=skylake simde_vaddlv_s8: vpmovsxbw xmm1, xmm0 vpsrlq xmm0, xmm0, 32 vpmovsxbw xmm0, xmm0 vpaddw xmm0, xmm1, xmm0 vpsrlq xmm1, xmm0, 32 vpaddw xmm0, xmm0, xmm1 vpsrlq xmm1, xmm0, 16 vpaddw xmm0, xmm0, xmm1 vpextrw eax, xmm0, 0 ret That's pretty good, but VMOVD eax, xmm0 would be more efficient than VPEXTRW when we don't need to avoid high garbage (because it's a return value in this case). VPEXTRW zero-extends into RAX, so it's not directly helpful if we need to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx. Or with BMI2, go scalar before the last shift / VPADDW step, e.g. ... vmovd eax, xmm0 rorx edx, eax, 16 add eax, edx