https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- Re-checking today we reject AVX vectorization via the costmodel but do SSE vectorization. With versioning for alias we could also SLP vectorize this, keeping the loop body smaller and avoiding an epilogue. Esp. since we're ending up without any vector load or store anyway. Of course SLP analysis requires a grouped store which we do not have since we do not identify XPQKL(MPQ,MKL) and XPQKL(MRS,MKL) as such (they ain't with MPQ == MRS but the runtime alias check ensures that's not the case). That is, we miss "strided group" detection or in general SLP forming via different mechanism. That said, I have a hard time thinking of a heuristic aligning with reality (it's of course possible to come up with a hack). Generally we'd need to work towards doing the versioning / cost model checks on outer loops but the better versioning condition thing would be a prerequesite for this. I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting to bogus state). Scalar inner loop assembly: .L8: vmulsd (%rax,%rdi,8), %xmm3, %xmm0 incl %ecx vfmadd231sd (%rax), %xmm4, %xmm0 vfmadd213sd (%rdx), %xmm6, %xmm0 vmovsd %xmm0, (%rdx) vmulsd (%rax,%r8,8), %xmm1, %xmm0 vfmadd231sd (%rax,%r10,8), %xmm2, %xmm0 addq %r15, %rax vfmadd213sd (%rdx,%rsi,8), %xmm5, %xmm0 vmovsd %xmm0, (%rdx,%rsi,8) addq %rbp, %rdx cmpl %r9d, %ecx jne .L8 vectorized inner loop assembly: .L9: vmovsd (%r10,%rcx), %xmm13 vmovsd (%rdx), %xmm0 incl %r14d vmovhpd (%r10,%rsi), %xmm13, %xmm13 vmovhpd (%rdx,%r13), %xmm0, %xmm14 vmovsd (%rdi,%rcx), %xmm0 vmulpd %xmm9, %xmm13, %xmm13 vmovhpd (%rdi,%rsi), %xmm0, %xmm0 vfmadd132pd %xmm10, %xmm13, %xmm0 vfmadd132pd %xmm12, %xmm14, %xmm0 vmovlpd %xmm0, (%rdx) vmovhpd %xmm0, (%rdx,%r13) vmovsd (%r8,%rcx), %xmm13 vmovsd (%rax), %xmm0 addq %r11, %rdx vmovhpd (%r8,%rsi), %xmm13, %xmm13 vmovhpd (%rax,%r13), %xmm0, %xmm14 vmovsd (%r9,%rcx), %xmm0 addq %rbx, %rcx vmulpd %xmm7, %xmm13, %xmm13 vmovhpd (%r9,%rsi), %xmm0, %xmm0 addq %rbx, %rsi vfmadd132pd %xmm8, %xmm13, %xmm0 vfmadd132pd %xmm11, %xmm14, %xmm0 vmovlpd %xmm0, (%rax) vmovhpd %xmm0, (%rax,%r13) addq %r11, %rax cmpl %r14d, %r15d jne .L9 only outer loop context and knowledge of low trip count makes this bad. The cost modeling doesn't know the scalar loop can execute like if vectorized given the CPUs plenty of resources (speculating non-dependence), whereas the vector variant introduces more constraints to the pipelining due to data dependences from using vectors. But even IACA doesn't tell us the differences are big.