Consider the following functions (compiled with "g++-4.1.2 -msse3 -O3"): #include <emmintrin.h> __m128i int2vector(int i) { return _mm_cvtsi32_si128(i); } int vector2int(__m128i i) { return _mm_cvtsi128_32(i); } __m128i long2vector(long long i) { return _mm_cvtsi64x_si128(i); } long long vector2long(__m128i) { return _mm_cvtsi128_si64x(i); }
They become: _Z10int2vectori: movd %edi, %xmm0 ret _Z10vector2intU8__vectorx: movd %xmm0, %rax movq %xmm0, -16(%rsp) ret _Z11long2vectorx: movd %rdi, %mm0 movq %rdi, -8(%rsp) movq2dq %mm0, %xmm0 ret _Z11vector2longU8__vectorx: movd %xmm0, %rax movq %xmm0, -16(%rsp) ret long2vector() should use a simple MOVQ instruction the way int2vector() uses MOVD. It appears that the reason for the stack access is that the original code used a reg64->mem->mm->xmm path, which the optimizer partly noticed; gcc-4.3-20070617 leaves the full path in place. Also, do the vector2<X>() functions really need to access the stack? Finally, I've noticed several places where instructions involving 64-bit values use the "d/l" suffix (e.g. "long i = 0" ==> "xorl %eax, %eax"), or 32-bit operations that use 64-bit registers (e.g. "movd %xmm0, %rax" above). Are those generally features, bugs, or a "who cares?" -- Summary: _mm_cvtsi64x_si128() and _mm_cvtsi128_si64x() inefficient Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: x86_64-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32708