https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
--param vect-max-peeling-for-alignment=0 disables peeling for alignment (but
also makes the runtime profitability trigger at 6 loop iterations already).

I suspect gather has a quite high latency and the loop just doesn't have enough
work to compensate for that (given we have two gathers in the loop as well).
We're also using

  vect__15.18_120 = VEC_PERM_EXPR <vect__11.16_116, vect__11.16_116, { 4, 5, 6,
7, 4, 5, 6, 7 }>;

for the index vector of the upper half of the gather but the upper half of the
vector is likely ignored and thus a representation with half of the vector
size and using a BIT_FIELD_REF would be more appropriate here.

.L10:
        vmovdqa (%r15,%rax), %ymm2
        vmovapd %ymm5, %ymm6
        vmovapd %ymm5, %ymm7
        addl    $1, %edi
        vgatherdpd      %ymm6, (%r9,%xmm2,8), %ymm3
        vperm2i128      $17, %ymm2, %ymm2, %ymm2
        vmovdqa %xmm2, %xmm4
        vgatherdpd      %ymm7, (%r9,%xmm4,8), %ymm2
        vmulpd  32(%r11,%rax,2), %ymm2, %ymm2
        vfmadd231pd     (%r11,%rax,2), %ymm3, %ymm2
        addq    $32, %rax
        vaddpd  %ymm2, %ymm0, %ymm0
        cmpl    %edi, %r14d
        ja      .L10

eventually the x86 cost hook needs to consider overall instruction count to
properly penaltize gather use.  I suspect two xmm loads from %r15/%rax
feeding the two gathers would be easier to pipeline.  The fma is likely
also pessimizing pipelining.

Reply via email to