Consider the following functions (compiled with "g++-4.1.2 -msse3 -O3"):
#include <emmintrin.h>
__m128i int2vector(int i) { return _mm_cvtsi32_si128(i); }
int vector2int(__m128i i) { return _mm_cvtsi128_32(i); }
__m128i long2vector(long long i) { return _mm_cvtsi64x_si128(i); }
long long vector2long(__m128i) { return _mm_cvtsi128_si64x(i); }

They become:

_Z10int2vectori:
        movd    %edi, %xmm0
        ret
_Z10vector2intU8__vectorx:
        movd    %xmm0, %rax
        movq    %xmm0, -16(%rsp)
        ret
_Z11long2vectorx:
        movd    %rdi, %mm0
        movq    %rdi, -8(%rsp)
        movq2dq %mm0, %xmm0
        ret
_Z11vector2longU8__vectorx:
        movd    %xmm0, %rax
        movq    %xmm0, -16(%rsp)
        ret

long2vector() should use a simple MOVQ instruction the way int2vector() uses
MOVD. It appears that the reason for the stack access is that the original code
used a reg64->mem->mm->xmm path, which the optimizer partly noticed;
gcc-4.3-20070617 leaves the full path in place.

Also, do the vector2<X>() functions really need to access the stack?

Finally, I've noticed several places where instructions involving 64-bit values
use the "d/l" suffix (e.g. "long i = 0" ==> "xorl %eax, %eax"), or 32-bit
operations that use 64-bit registers (e.g. "movd %xmm0, %rax" above). Are those
generally features, bugs, or a "who cares?"


-- 
           Summary: _mm_cvtsi64x_si128() and _mm_cvtsi128_si64x()
                    inefficient
           Product: gcc
           Version: 4.1.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: scovich at gmail dot com
GCC target triplet: x86_64-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32708

Reply via email to