--- Comment #5 from jose dot r dot fonseca at gmail dot com 2007-08-07
14:01 ---
Note that this problem is actually more general. I bumped into this when doing
a very used pattern for MMX/SSE2 programming, which is making a union between a
vector type and an array of integers:
union I16x8 {
__m128i m;
short v[8];
};
For example this code:
#include emmintrin.h
union I16x8 {
__m128i m;
short v[8];
};
void test(I16x8 *p) {
I16x8 a, c;
a = *p;
c.m = _mm_add_epi16(a.m, a.m);
*p = c;
}
Generates unnecessary copying in the body of the function:
movl8(%ebp), %edx
movl(%edx), %eax
movl%eax, -24(%ebp)
movl4(%edx), %eax
movl%eax, -20(%ebp)
movl8(%edx), %eax
movl%eax, -16(%ebp)
movl12(%edx), %eax
movl%eax, -12(%ebp)
movdqa -24(%ebp), %xmm0
paddw %xmm0, %xmm0
movdqa %xmm0, -40(%ebp)
movl-40(%ebp), %eax
movl%eax, (%edx)
movl-36(%ebp), %eax
movl%eax, 4(%edx)
movl-32(%ebp), %eax
movl%eax, 8(%edx)
movl-28(%ebp), %eax
movl%eax, 12(%edx)
The more strange is that eliminating the array member of the union as following
union I16x8 {
__m128i m;
};
Also generates *exactly* the same redundant code:
movl8(%ebp), %edx
movl(%edx), %eax
movl%eax, -24(%ebp)
movl4(%edx), %eax
movl%eax, -20(%ebp)
movl8(%edx), %eax
movl%eax, -16(%ebp)
movl12(%edx), %eax
movl%eax, -12(%ebp)
movdqa -24(%ebp), %xmm0
paddw %xmm0, %xmm0
movdqa %xmm0, -40(%ebp)
movl-40(%ebp), %eax
movl%eax, (%edx)
movl-36(%ebp), %eax
movl%eax, 4(%edx)
movl-32(%ebp), %eax
movl%eax, 8(%edx)
movl-28(%ebp), %eax
movl%eax, 12(%edx)
However overwriting the assignment operator as:
union I16x8 {
__m128i m;
short v[8];
I16x8 operator =(I16x8 o) {
m = o.m;
return *this;
}
};
Generates the right assembly code for the function above:
movl8(%ebp), %eax
movdqa (%eax), %xmm0
paddw %xmm0, %xmm0
movdqa %xmm0, (%eax)
Also strange, is that a dummy structure as follows:
struct I16x8 {
__m128i m;
};
Also generates the right code (exactly as above):
movl8(%ebp), %eax
movdqa (%eax), %xmm0
paddw %xmm0, %xmm0
movdqa %xmm0, (%eax)
The union of vector type with a array of integers is an example used in almost
every tutorial of the SIMD intrinsics out there. This bug was causing gcc to
perform poorly with my code compared with Microsoft Visual C++ Compiler and
Intel C++ Compiler, but after working around this it generated faster code than
both.
--
jose dot r dot fonseca at gmail dot com changed:
What|Removed |Added
CC||jose dot r dot fonseca at
||gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29881