http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60826
Bug ID: 60826 Summary: inefficient code for vector xor on SSE2 Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: sunfish at mozilla dot com On the following C testcase: #include <stdint.h> typedef double v2f64 __attribute__((__vector_size__(16), may_alias)); typedef int64_t v2i64 __attribute__((__vector_size__(16), may_alias)); static inline v2f64 f_and (v2f64 l, v2f64 r) { return (v2f64)((v2i64)l & (v2i64)r); } static inline v2f64 f_xor (v2f64 l, v2f64 r) { return (v2f64)((v2i64)l ^ (v2i64)r); } static inline double vector_to_scalar(v2f64 v) { return v[0]; } double test(v2f64 w, v2f64 x, v2f64 z) { v2f64 y = f_and(w, x); return vector_to_scalar(f_xor(z, y)); } GCC emits this code: andpd %xmm1, %xmm0 movdqa %xmm0, %xmm3 pxor %xmm2, %xmm3 movdqa %xmm3, -24(%rsp) movsd -24(%rsp), %xmm0 ret GCC should move the result of the xor to the return register directly instead of spilling it. Also, it should avoid the first movdqa, which is an unnecessary copy. Also, this should ideally use xorpd instead of pxor, to avoid a domain-crossing penalty on Nehalem and other micro-architectures (or xorps if domain-crossing doesn't matter, since its smaller).