https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511
--- Comment #7 from Peter Cordes <peter at cordes dot ca> --- I'm seeing the same symptom, affecting gcc4.9 through 5.3. Not present in 6.1. IDK if the cause is the same. (code from an improvement to the horizontal_add functions in Agner Fog's vector class library) #include <immintrin.h> int hsum16_gccmovdqa (__m128i const a) { __m128i lo = _mm_cvtepi16_epi32(a); // sign-extended a0, a1, a2, a3 __m128i hi = _mm_unpackhi_epi64(a,a); // gcc4.9 through 5.3 wastes a movdqa on this hi = _mm_cvtepi16_epi32(hi); __m128i sum1 = _mm_add_epi32(lo,hi); // add sign-extended upper / lower halves //return horizontal_add(sum1); // manually inlined. // Shortening the code below can avoid the movdqa __m128i shuf = _mm_shuffle_epi32(sum1, 0xEE); __m128i sum2 = _mm_add_epi32(shuf,sum1); // 2 sums shuf = _mm_shufflelo_epi16(sum2, 0xEE); __m128i sum4 = _mm_add_epi32(shuf,sum2); return _mm_cvtsi128_si32(sum4); // 32 bit sum } gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1): movdqa %xmm0, %xmm1 pmovsxwd %xmm0, %xmm2 punpckhqdq %xmm0, %xmm1 pmovsxwd %xmm1, %xmm0 paddd %xmm2, %xmm0 ... gcc6.1 output: pmovsxwd %xmm0, %xmm1 punpckhqdq %xmm0, %xmm0 pmovsxwd %xmm0, %xmm0 paddd %xmm0, %xmm1 ... In a more complicated case, when inlining this code or not, there's actually a difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases. See my attachment, copied from https://godbolt.org/g/e8iQsj