http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60826

            Bug ID: 60826
           Summary: inefficient code for vector xor on SSE2
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: sunfish at mozilla dot com

On the following C testcase:

#include <stdint.h>

typedef double v2f64 __attribute__((__vector_size__(16), may_alias));
typedef int64_t v2i64 __attribute__((__vector_size__(16), may_alias));

static inline v2f64 f_and   (v2f64 l, v2f64 r) { return (v2f64)((v2i64)l &
(v2i64)r); }
static inline v2f64 f_xor   (v2f64 l, v2f64 r) { return (v2f64)((v2i64)l ^
(v2i64)r); }
static inline double vector_to_scalar(v2f64 v) { return v[0]; }

double test(v2f64 w, v2f64 x, v2f64 z)
{
    v2f64 y = f_and(w, x);

    return vector_to_scalar(f_xor(z, y));
}

GCC emits this code:

    andpd    %xmm1, %xmm0
    movdqa    %xmm0, %xmm3
    pxor    %xmm2, %xmm3
    movdqa    %xmm3, -24(%rsp)
    movsd    -24(%rsp), %xmm0
    ret

GCC should move the result of the xor to the return register directly instead
of spilling it. Also, it should avoid the first movdqa, which is an unnecessary
copy.

Also, this should ideally use xorpd instead of pxor, to avoid a domain-crossing
penalty on Nehalem and other micro-architectures (or xorps if domain-crossing
doesn't matter, since its smaller).

Reply via email to