https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2017-09-12 00:00:00         |2021-8-11

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm not sure if it's "better" now.  While we merge the store we now do not
vectorize the load at all.  Basically it's a trade-off between giving up
when doing discovery on the whole store group or building sub-chains from
scalars.  We do have heuristics in place that try to anticipate whether
splitting the group would succeed - if you change 'float' to 'double' we'll
go the splitting way ending up with

        movupd  (%rsi), %xmm1
        unpcklpd        %xmm0, %xmm0
        divpd   %xmm0, %xmm1
        movups  %xmm1, (%rdi)
        movupd  (%rdx), %xmm1
        divpd   %xmm0, %xmm1
        movups  %xmm1, 16(%rdi)

which is quite optimal.  But for float we have

        movss   4(%rdx), %xmm1
        movss   (%rdx), %xmm2
        shufps  $0, %xmm0, %xmm0
        movss   4(%rsi), %xmm3
        unpcklps        %xmm1, %xmm2
        movss   (%rsi), %xmm1
        unpcklps        %xmm3, %xmm1
        movlhps %xmm2, %xmm1
        divps   %xmm0, %xmm1
        movups  %xmm1, (%rdi)

that might be not too bad - it avoids STLF issues compared to doing two
8 byte loads and combining them.

The get-away is that SLP could be better in handling running into different
interleaving chains (or also into non-grouped loads).

Reply via email to