[Bug middle-end/29881] union causes inefficient code

2007-08-07 Thread jose dot r dot fonseca at gmail dot com


--- Comment #5 from jose dot r dot fonseca at gmail dot com  2007-08-07 
14:01 ---
Note that this problem is actually more general. I bumped into this when doing
a very used pattern for MMX/SSE2 programming, which is making a union between a
vector type and an array of integers:

union I16x8 {
__m128i m;
short v[8];
};

For example this code:

#include emmintrin.h

union I16x8 {
__m128i m;
short v[8];
};

void test(I16x8 *p) {
I16x8 a, c;
a = *p;
c.m = _mm_add_epi16(a.m, a.m);
*p = c;
}

Generates unnecessary copying in the body of the function:

movl8(%ebp), %edx
movl(%edx), %eax
movl%eax, -24(%ebp)
movl4(%edx), %eax
movl%eax, -20(%ebp)
movl8(%edx), %eax
movl%eax, -16(%ebp)
movl12(%edx), %eax
movl%eax, -12(%ebp)
movdqa  -24(%ebp), %xmm0
paddw   %xmm0, %xmm0
movdqa  %xmm0, -40(%ebp)
movl-40(%ebp), %eax
movl%eax, (%edx)
movl-36(%ebp), %eax
movl%eax, 4(%edx)
movl-32(%ebp), %eax
movl%eax, 8(%edx)
movl-28(%ebp), %eax
movl%eax, 12(%edx)

The more strange is that eliminating the array member of the union as following 

union I16x8 {
__m128i m;
};

Also generates *exactly* the same redundant code:

movl8(%ebp), %edx
movl(%edx), %eax
movl%eax, -24(%ebp)
movl4(%edx), %eax
movl%eax, -20(%ebp)
movl8(%edx), %eax
movl%eax, -16(%ebp)
movl12(%edx), %eax
movl%eax, -12(%ebp)
movdqa  -24(%ebp), %xmm0
paddw   %xmm0, %xmm0
movdqa  %xmm0, -40(%ebp)
movl-40(%ebp), %eax
movl%eax, (%edx)
movl-36(%ebp), %eax
movl%eax, 4(%edx)
movl-32(%ebp), %eax
movl%eax, 8(%edx)
movl-28(%ebp), %eax
movl%eax, 12(%edx)

However overwriting the assignment operator as:

union I16x8 {
__m128i m;
short v[8];

I16x8  operator =(I16x8 o) {
m = o.m;
return *this;
}
};

Generates the right assembly code for the function above:

movl8(%ebp), %eax
movdqa  (%eax), %xmm0
paddw   %xmm0, %xmm0
movdqa  %xmm0, (%eax)

Also strange, is that a dummy structure as follows:

struct I16x8 {
__m128i m;
};

Also generates the right code (exactly as above):

movl8(%ebp), %eax
movdqa  (%eax), %xmm0
paddw   %xmm0, %xmm0
movdqa  %xmm0, (%eax)

The union of vector type with a array of integers is an example used in almost
every tutorial of the SIMD intrinsics out there. This bug was causing gcc to
perform poorly with my code compared with Microsoft Visual C++ Compiler and
Intel C++ Compiler, but after working around this it generated faster code than
both.


-- 

jose dot r dot fonseca at gmail dot com changed:

   What|Removed |Added

 CC||jose dot r dot fonseca at
   ||gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29881



[Bug middle-end/29881] union causes inefficient code

2007-08-07 Thread jose dot r dot fonseca at gmail dot com


--- Comment #6 from jose dot r dot fonseca at gmail dot com  2007-08-07 
14:18 ---
Created an attachment (id=14031)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14031action=view)
Example code

This is the source-code for my example above. To get the assembly run as:

gcc -S -DCASE=0 -O3 -msse2 -o sse2-union-0.s sse2-union.cpp
gcc -S -DCASE=1 -O3 -msse2 -o sse2-union-1.s sse2-union.cpp
gcc -S -DCASE=2 -O3 -msse2 -o sse2-union-2.s sse2-union.cpp
gcc -S -DCASE=3 -O3 -msse2 -o sse2-union-3.s sse2-union.cpp

This was run on (gcc -v):

Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v
--enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.1.3 --program-suffix=-4.1
--enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug
--enable-mpfr --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.3 20070718 (prerelease) (Debian 4.1.2-14)

But I actually first discovered this in an unofficial build of gcc-4.2 for
MinGW.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29881