https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81673
--- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #1) > What happens if you use -march=intel. With -mtune=intel, the lower half of the vector is moved directly whereas the upper one is still done through the stack: .cfi_startproc leaq -56(%rsp), %rsp .cfi_def_cfa_offset 64 movq %rdx, %xmm0 movq %rcx, (%rsp) leaq 16(%rsp), %rdi movq %r9, 8(%rsp) movhps (%rsp), %xmm0 movdqa %xmm0, 32(%rsp) movq %r8, %xmm0 movhps 8(%rsp), %xmm0 movdqa %xmm0, 16(%rsp) call bar leaq 56(%rsp), %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc ...so I guess this would still incur some penalty on the benchmark, but I am not sure.