http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50829
Bug #: 50829 Summary: avx extra copy for _mm256_insertf128_pd Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: marc.gli...@normalesup.org Target: x86_64-linux-gnu With -Ofast -mavx (or -Os -mavx), this code: __m256d concat(__m128d x){ __m256d z=_mm256_castpd128_pd256(x); return _mm256_insertf128_pd(z,x,1); } is compiled (by a snapshot from Oct 10) to: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 vmovapd %xmm0, %xmm1 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 andq $-32, %rsp addq $16, %rsp vinsertf128 $0x1, %xmm0, %ymm1, %ymm0 leave .cfi_def_cfa 7, 8 ret .cfi_endproc Apart from all the fun with stack manipulation, this boils down to: vmovapd %xmm0, %xmm1 vinsertf128 $0x1, %xmm0, %ymm1, %ymm0 when it looks like this would be enough (and I tested it): vinsertf128 $0x1, %xmm0, %ymm0, %ymm0 I am not sure if gcc thinks that vinsertf128 shouldn't use the same register for everything, or if it doesn't realize that it doesn't need to zero the upper 128 bits of the ymm register before calling insert. I understand that the avx support is young, but avxintrin.h contains a comment saying that _mm256_castpd128_pd256 "shouldn't generate any extra moves". (I am not using broadcast because going through memory looks like a waste).