https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
IIRC we have a duplicate for this.  The issue is the SLP vectorizer doesn't
handle reductions (not implemented) and thus the vector results need
to be decomposed for the scalar reduction tail.  On x86 we get with -mavx2

        vmovdqu (%rdi), %xmm0
        vpshufb .LC0(%rip), %xmm0, %xmm0
        vpmovzxbw       %xmm0, %xmm1
        vpsrldq $8, %xmm0, %xmm0
        vpmovzxwd       %xmm1, %xmm2
        vpsrldq $8, %xmm1, %xmm1
        vpmovzxbw       %xmm0, %xmm0
        vpmovzxwd       %xmm1, %xmm1
        vmovaps %xmm2, -72(%rsp)
        movl    -68(%rsp), %eax
        vmovaps %xmm1, -56(%rsp)
        vpmovzxwd       %xmm0, %xmm1
        vpsrldq $8, %xmm0, %xmm0
        addl    -52(%rsp), %eax
        vpmovzxwd       %xmm0, %xmm0
        vmovaps %xmm1, -40(%rsp)
        movl    -56(%rsp), %edx
        addl    -36(%rsp), %eax
        vmovaps %xmm0, -24(%rsp)
        addl    -72(%rsp), %edx
        addl    -20(%rsp), %eax
        addl    -40(%rsp), %edx
        addl    -24(%rsp), %edx
        addl    %edx, %eax
        movl    -48(%rsp), %edx
        addl    -64(%rsp), %edx
        addl    -32(%rsp), %edx
        addl    -16(%rsp), %edx
        addl    %edx, %eax
        movl    -44(%rsp), %edx
        addl    -60(%rsp), %edx
        addl    -28(%rsp), %edx
        addl    -12(%rsp), %edx
        addl    %edx, %eax
        ret

the main issue of course that we fail to elide the stack temporary.
Re-running FRE after loop opts might help here but of course
SLP vectorization handling the reduction would be best (though the
tail loop is structured badly, not matching up with the head one).

Whether vectorizing this specific testcases head loop is profitable
or not is questionable on its own of course (but you can easily make
it so and still get similar ugly code in the tail).

Reply via email to