pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

peter at cordes dot ca Thu, 02 Jun 2016 10:38:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511


--- Comment #7 from Peter Cordes <peter at cordes dot ca> ---
I'm seeing the same symptom, affecting gcc4.9 through 5.3.  Not present in 6.1.

IDK if the cause is the same.

(code from an improvement to the horizontal_add functions in Agner Fog's vector
class library)

#include <immintrin.h>
int hsum16_gccmovdqa (__m128i const a) {
        __m128i lo    = _mm_cvtepi16_epi32(a);                 // sign-extended
a0, a1, a2, a3
        __m128i hi    = _mm_unpackhi_epi64(a,a);     // gcc4.9 through 5.3
wastes a movdqa on this
                hi    = _mm_cvtepi16_epi32(hi);
        __m128i sum1  = _mm_add_epi32(lo,hi);                      // add
sign-extended upper / lower halves
        //return horizontal_add(sum1);  // manually inlined.
    // Shortening the code below can avoid the movdqa
    __m128i shuf  = _mm_shuffle_epi32(sum1, 0xEE);
    __m128i sum2  = _mm_add_epi32(shuf,sum1);              // 2 sums
            shuf  = _mm_shufflelo_epi16(sum2, 0xEE);
    __m128i sum4  = _mm_add_epi32(shuf,sum2);
    return          _mm_cvtsi128_si32(sum4);               // 32 bit sum
}

gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1):

        movdqa  %xmm0, %xmm1
        pmovsxwd        %xmm0, %xmm2
        punpckhqdq      %xmm0, %xmm1
        pmovsxwd        %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        ...

gcc6.1 output:

        pmovsxwd        %xmm0, %xmm1
        punpckhqdq      %xmm0, %xmm0
        pmovsxwd        %xmm0, %xmm0
        paddd   %xmm0, %xmm1
        ...



In a more complicated case, when inlining this code or not, there's actually a
difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases. 
See my attachment, copied from https://godbolt.org/g/e8iQsj

[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

Reply via email to