https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82189
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2017-09-12 00:00:00 |2021-8-11 --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- I'm not sure if it's "better" now. While we merge the store we now do not vectorize the load at all. Basically it's a trade-off between giving up when doing discovery on the whole store group or building sub-chains from scalars. We do have heuristics in place that try to anticipate whether splitting the group would succeed - if you change 'float' to 'double' we'll go the splitting way ending up with movupd (%rsi), %xmm1 unpcklpd %xmm0, %xmm0 divpd %xmm0, %xmm1 movups %xmm1, (%rdi) movupd (%rdx), %xmm1 divpd %xmm0, %xmm1 movups %xmm1, 16(%rdi) which is quite optimal. But for float we have movss 4(%rdx), %xmm1 movss (%rdx), %xmm2 shufps $0, %xmm0, %xmm0 movss 4(%rsi), %xmm3 unpcklps %xmm1, %xmm2 movss (%rsi), %xmm1 unpcklps %xmm3, %xmm1 movlhps %xmm2, %xmm1 divps %xmm0, %xmm1 movups %xmm1, (%rdi) that might be not too bad - it avoids STLF issues compared to doing two 8 byte loads and combining them. The get-away is that SLP could be better in handling running into different interleaving chains (or also into non-grouped loads).