[Bug tree-optimization/115833] SLP of signed short multiply goes wrong

liuhongt at gcc dot gnu.org via Gcc-bugs Tue, 09 Jul 2024 01:47:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833


Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> It seems the very bad code generation is mostly from constructing the
> V4HImode vectors going via GPRs with shifts and ORs.  Possibly
> constructing a V4SImode vector and then packing to V4HImode would be
> better?

void v4hi_contruct(signed short *t, signed short tt, short tt1)
{
  t[0] = tt;
  t[1] = tt1;
  t[2] = tt1;
  t[3] = tt1;
}


void v4si_contruct(int *t, int tt, int tt2)
{
  t[0] = tt;
  t[1] = tt2;
  t[2] = tt2;
  t[3] = tt2;
}

v4hi_contruct(short*, short, short):
        movzx   eax, dx
        movzx   esi, si
        mov     rdx, rax
        sal     rdx, 16
        or      rdx, rax
        sal     rdx, 16
        or      rdx, rax
        sal     rdx, 16
        or      rdx, rsi
        mov     QWORD PTR [rdi], rdx
        ret
v4si_contruct(int*, int, int):
        vmovd   xmm2, edx
        vmovd   xmm3, esi
        vpinsrd xmm1, xmm2, edx, 1
        vpinsrd xmm0, xmm3, edx, 1
        vpunpcklqdq     xmm0, xmm0, xmm1
        vmovdqu XMMWORD PTR [rdi], xmm0
        ret

both vmovd and vpinsrd is expensive, and v4hi_contruct is not necessary worse
than v4si_construct, but v4hi_construct can be optimized to be a little more
parallel via GPRs.

[Bug tree-optimization/115833] SLP of signed short multiply goes wrong

Reply via email to