https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- --param vect-max-peeling-for-alignment=0 disables peeling for alignment (but also makes the runtime profitability trigger at 6 loop iterations already). I suspect gather has a quite high latency and the loop just doesn't have enough work to compensate for that (given we have two gathers in the loop as well). We're also using vect__15.18_120 = VEC_PERM_EXPR <vect__11.16_116, vect__11.16_116, { 4, 5, 6, 7, 4, 5, 6, 7 }>; for the index vector of the upper half of the gather but the upper half of the vector is likely ignored and thus a representation with half of the vector size and using a BIT_FIELD_REF would be more appropriate here. .L10: vmovdqa (%r15,%rax), %ymm2 vmovapd %ymm5, %ymm6 vmovapd %ymm5, %ymm7 addl $1, %edi vgatherdpd %ymm6, (%r9,%xmm2,8), %ymm3 vperm2i128 $17, %ymm2, %ymm2, %ymm2 vmovdqa %xmm2, %xmm4 vgatherdpd %ymm7, (%r9,%xmm4,8), %ymm2 vmulpd 32(%r11,%rax,2), %ymm2, %ymm2 vfmadd231pd (%r11,%rax,2), %ymm3, %ymm2 addq $32, %rax vaddpd %ymm2, %ymm0, %ymm0 cmpl %edi, %r14d ja .L10 eventually the x86 cost hook needs to consider overall instruction count to properly penaltize gather use. I suspect two xmm loads from %r15/%rax feeding the two gathers would be easier to pipeline. The fma is likely also pessimizing pipelining.