https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79938
--- Comment #2 from postmaster at raasu dot org --- (In reply to Richard Biener from comment #1) > The situation is slightly better with GCC 7, only two spill/loads are > remaining. > Possibly BIT_INSERT_EXPR helps here. With gcc 6.2.0 and gcc -msse4.1 -mtune=core2 -O3 -S hadd.c -Wall -Wextra -fno-strict-aliasing -fwrapv -o hadd.s The resulting assembler output is almost perfect, but adding -mtune=core2 kinda makes the code optimal only for Intel processors. --- ... pxor %xmm1, %xmm1 movl $1, %edi movd %eax, %xmm0 pshufb %xmm1, %xmm0 pextrb $1, %xmm0, %edx pextrb $0, %xmm0, %eax addl %edx, %eax pextrb $2, %xmm0, %edx addl %edx, %eax pextrb $4, %xmm0, %ecx pextrb $3, %xmm0, %edx addl %eax, %edx pextrb $5, %xmm0, %eax addl %eax, %ecx pextrb $6, %xmm0, %eax addl %eax, %ecx pextrb $9, %xmm0, %esi pextrb $7, %xmm0, %eax addl %eax, %ecx pextrb $8, %xmm0, %eax addl %esi, %eax pextrb $10, %xmm0, %esi addl %esi, %eax pextrb $11, %xmm0, %esi addl %esi, %eax pextrb $13, %xmm0, %esi movd %eax, %xmm1 pextrb $12, %xmm0, %eax addl %esi, %eax pextrb $14, %xmm0, %esi addl %eax, %esi pextrb $15, %xmm0, %eax movd %edx, %xmm0 addl %esi, %eax pinsrd $1, %ecx, %xmm0 movl $.LC0, %esi pinsrd $1, %eax, %xmm1 xorl %eax, %eax punpcklqdq %xmm1, %xmm0 ...