Consider the following function, which adds 1 to its argument using Intel intrinsics:
#include <emmintrin.h> unsigned add1(unsigned x) { __m128i a = _mm_cvtsi32_si128(x); __m128i b = _mm_add_epi32(a, _mm_set_epi32(0, 0, 0, 1)); return _mm_cvtsi128_si32(b); } GCC goes through memory no less than three times: once when converting x to a vector, once when converting 1 to a vector, and once when converting the result back to an integer: add1: pxor %xmm0, %xmm0 movq %rdi, -16(%rsp) movq -16(%rsp), %xmm1 movss %xmm1, %xmm0 paddd .LC0(%rip), %xmm0 movd %xmm0, -4(%rsp) movl -4(%rsp), %eax ret For comparison, here is the code generated by the Intel compiler: add1: movl $1, %edx movd %edi, %xmm1 movd %edx, %xmm0 paddd %xmm0, %xmm1 movd %xmm1, %eax ret -- Summary: Converting between int and vector using intrinsics goes through memory Product: gcc Version: 4.3.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: jch at pps dot jussieu dot fr GCC build triplet: x86_64-linux-gnu GCC host triplet: x86_64-linux-gnu GCC target triplet: x86_64-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38015