------- Comment #5 from jose dot r dot fonseca at gmail dot com  2007-08-07 
14:01 -------
Note that this problem is actually more general. I bumped into this when doing
a very used pattern for MMX/SSE2 programming, which is making a union between a
vector type and an array of integers:

union I16x8 {
        __m128i m;
        short v[8];
};

For example this code:

#include <emmintrin.h>

union I16x8 {
        __m128i m;
        short v[8];
};

void test(I16x8 *p) {
        I16x8 a, c;
        a = *p;
        c.m = _mm_add_epi16(a.m, a.m);
        *p = c;
}

Generates unnecessary copying in the body of the function:

        movl    8(%ebp), %edx
        movl    (%edx), %eax
        movl    %eax, -24(%ebp)
        movl    4(%edx), %eax
        movl    %eax, -20(%ebp)
        movl    8(%edx), %eax
        movl    %eax, -16(%ebp)
        movl    12(%edx), %eax
        movl    %eax, -12(%ebp)
        movdqa  -24(%ebp), %xmm0
        paddw   %xmm0, %xmm0
        movdqa  %xmm0, -40(%ebp)
        movl    -40(%ebp), %eax
        movl    %eax, (%edx)
        movl    -36(%ebp), %eax
        movl    %eax, 4(%edx)
        movl    -32(%ebp), %eax
        movl    %eax, 8(%edx)
        movl    -28(%ebp), %eax
        movl    %eax, 12(%edx)

The more strange is that eliminating the array member of the union as following 

union I16x8 {
        __m128i m;
};

Also generates *exactly* the same redundant code:

        movl    8(%ebp), %edx
        movl    (%edx), %eax
        movl    %eax, -24(%ebp)
        movl    4(%edx), %eax
        movl    %eax, -20(%ebp)
        movl    8(%edx), %eax
        movl    %eax, -16(%ebp)
        movl    12(%edx), %eax
        movl    %eax, -12(%ebp)
        movdqa  -24(%ebp), %xmm0
        paddw   %xmm0, %xmm0
        movdqa  %xmm0, -40(%ebp)
        movl    -40(%ebp), %eax
        movl    %eax, (%edx)
        movl    -36(%ebp), %eax
        movl    %eax, 4(%edx)
        movl    -32(%ebp), %eax
        movl    %eax, 8(%edx)
        movl    -28(%ebp), %eax
        movl    %eax, 12(%edx)

However overwriting the assignment operator as:

union I16x8 {
        __m128i m;
        short v[8];

        I16x8 & operator =(I16x8 &o) {
                m = o.m;
                return *this;
        }
};

Generates the right assembly code for the function above:

        movl    8(%ebp), %eax
        movdqa  (%eax), %xmm0
        paddw   %xmm0, %xmm0
        movdqa  %xmm0, (%eax)

Also strange, is that a dummy structure as follows:

struct I16x8 {
        __m128i m;
};

Also generates the right code (exactly as above):

        movl    8(%ebp), %eax
        movdqa  (%eax), %xmm0
        paddw   %xmm0, %xmm0
        movdqa  %xmm0, (%eax)

The union of vector type with a array of integers is an example used in almost
every tutorial of the SIMD intrinsics out there. This bug was causing gcc to
perform poorly with my code compared with Microsoft Visual C++ Compiler and
Intel C++ Compiler, but after working around this it generated faster code than
both.


-- 

jose dot r dot fonseca at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jose dot r dot fonseca at
                   |                            |gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29881

Reply via email to