https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81673

--- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> What happens if you use -march=intel.

With -mtune=intel, the lower half of the vector is moved directly
whereas the upper one is still done through the stack:

        .cfi_startproc
        leaq    -56(%rsp), %rsp
        .cfi_def_cfa_offset 64
        movq    %rdx, %xmm0
        movq    %rcx, (%rsp)
        leaq    16(%rsp), %rdi
        movq    %r9, 8(%rsp)
        movhps  (%rsp), %xmm0
        movdqa  %xmm0, 32(%rsp)
        movq    %r8, %xmm0
        movhps  8(%rsp), %xmm0
        movdqa  %xmm0, 16(%rsp)
        call    bar
        leaq    56(%rsp), %rsp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc

...so I guess this would still incur some penalty on the benchmark,
but I am not sure.

Reply via email to