------- Comment #5 from jose dot r dot fonseca at gmail dot com 2007-08-07 14:01 ------- Note that this problem is actually more general. I bumped into this when doing a very used pattern for MMX/SSE2 programming, which is making a union between a vector type and an array of integers:
union I16x8 { __m128i m; short v[8]; }; For example this code: #include <emmintrin.h> union I16x8 { __m128i m; short v[8]; }; void test(I16x8 *p) { I16x8 a, c; a = *p; c.m = _mm_add_epi16(a.m, a.m); *p = c; } Generates unnecessary copying in the body of the function: movl 8(%ebp), %edx movl (%edx), %eax movl %eax, -24(%ebp) movl 4(%edx), %eax movl %eax, -20(%ebp) movl 8(%edx), %eax movl %eax, -16(%ebp) movl 12(%edx), %eax movl %eax, -12(%ebp) movdqa -24(%ebp), %xmm0 paddw %xmm0, %xmm0 movdqa %xmm0, -40(%ebp) movl -40(%ebp), %eax movl %eax, (%edx) movl -36(%ebp), %eax movl %eax, 4(%edx) movl -32(%ebp), %eax movl %eax, 8(%edx) movl -28(%ebp), %eax movl %eax, 12(%edx) The more strange is that eliminating the array member of the union as following union I16x8 { __m128i m; }; Also generates *exactly* the same redundant code: movl 8(%ebp), %edx movl (%edx), %eax movl %eax, -24(%ebp) movl 4(%edx), %eax movl %eax, -20(%ebp) movl 8(%edx), %eax movl %eax, -16(%ebp) movl 12(%edx), %eax movl %eax, -12(%ebp) movdqa -24(%ebp), %xmm0 paddw %xmm0, %xmm0 movdqa %xmm0, -40(%ebp) movl -40(%ebp), %eax movl %eax, (%edx) movl -36(%ebp), %eax movl %eax, 4(%edx) movl -32(%ebp), %eax movl %eax, 8(%edx) movl -28(%ebp), %eax movl %eax, 12(%edx) However overwriting the assignment operator as: union I16x8 { __m128i m; short v[8]; I16x8 & operator =(I16x8 &o) { m = o.m; return *this; } }; Generates the right assembly code for the function above: movl 8(%ebp), %eax movdqa (%eax), %xmm0 paddw %xmm0, %xmm0 movdqa %xmm0, (%eax) Also strange, is that a dummy structure as follows: struct I16x8 { __m128i m; }; Also generates the right code (exactly as above): movl 8(%ebp), %eax movdqa (%eax), %xmm0 paddw %xmm0, %xmm0 movdqa %xmm0, (%eax) The union of vector type with a array of integers is an example used in almost every tutorial of the SIMD intrinsics out there. This bug was causing gcc to perform poorly with my code compared with Microsoft Visual C++ Compiler and Intel C++ Compiler, but after working around this it generated faster code than both. -- jose dot r dot fonseca at gmail dot com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jose dot r dot fonseca at | |gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29881