Consider the following function, which adds 1 to its argument using Intel
intrinsics:

  #include <emmintrin.h>

  unsigned
  add1(unsigned x)
  {
      __m128i a = _mm_cvtsi32_si128(x);
      __m128i b = _mm_add_epi32(a, _mm_set_epi32(0, 0, 0, 1));
      return _mm_cvtsi128_si32(b);
  }

GCC goes through memory no less than three times: once when converting x to a
vector, once when converting 1 to a vector, and once when converting the result
back to an integer:

  add1:
        pxor    %xmm0, %xmm0
        movq    %rdi, -16(%rsp)
        movq    -16(%rsp), %xmm1
        movss   %xmm1, %xmm0
        paddd   .LC0(%rip), %xmm0
        movd    %xmm0, -4(%rsp)
        movl    -4(%rsp), %eax
        ret

For comparison, here is the code generated by the Intel compiler:

  add1:
        movl      $1, %edx
        movd      %edi, %xmm1
        movd      %edx, %xmm0
        paddd     %xmm0, %xmm1
        movd      %xmm1, %eax
        ret


-- 
           Summary: Converting between int and vector using intrinsics goes
                    through memory
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jch at pps dot jussieu dot fr
 GCC build triplet: x86_64-linux-gnu
  GCC host triplet: x86_64-linux-gnu
GCC target triplet: x86_64-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38015

Reply via email to