Hello. the autovectorizer is enabled by default in g++ 4.3 and does a fine job most of the time. Except it gets mightily pissed off if you dare to tweak the alignment and after much experimentation i haven't yet devised how to plug all the holes. This silly example shows where things start to get ugly # cat autovec.cc enum { N = 4, align_to = 16/sizeof(char) }; typedef float scalar_type; struct foo_t { scalar_type m[N]; foo_t operator +(const foo_t &rhs) const { foo_t v(*this); for (unsigned i=0; i<N; ++i) v.m[i] += rhs.m[i]; return v; } }; struct bar_t { scalar_type __attribute__((aligned(sizeof(char)*align_to))) m[N]; bar_t operator +(const bar_t &rhs) const { bar_t v(*this); for (unsigned i=0; i<N; ++i) v.m[i] += rhs.m[i]; return v; } };
template<typename T> __attribute__((noinline)) void foobar(T &dst, const T *src) { T v = {{ 0 }}; for (unsigned i=0; i<64; ++i) v = v + src[i]; dst = v; } int main(int argc, char *argv[]) { foo_t *p((foo_t*) argv); bar_t *q((bar_t*) argv); foobar(*p, p + 1); foobar(*q, q + 1); return 0; } # g++ -O3 -march=native autovec.cc # g++ 4.3.1, x86_64 There's not much to say about foobar<foo_t> and the addition in foobar<bar_t> gets somewhat vectorized but 400620: 89 54 24 f4 mov %edx,-0xc(%rsp) 400624: 89 4c 24 f0 mov %ecx,-0x10(%rsp) 400628: 44 89 44 24 ec mov %r8d,-0x14(%rsp) 40062d: 44 89 4c 24 e8 mov %r9d,-0x18(%rsp) 400632: 0f 28 c1 movaps %xmm1,%xmm0 400635: 0f 12 04 06 movlps (%rsi,%rax,1),%xmm0 400639: 0f 16 44 06 08 movhps 0x8(%rsi,%rax,1),%xmm0 40063e: 48 83 c0 10 add $0x10,%rax 400642: 41 0f 58 02 addps (%r10),%xmm0 400646: 48 3d 00 04 00 00 cmp $0x400,%rax 40064c: 41 0f 29 02 movaps %xmm0,(%r10) 400650: 8b 54 24 f4 mov -0xc(%rsp),%edx 400654: 8b 4c 24 f0 mov -0x10(%rsp),%ecx 400658: 44 8b 44 24 ec mov -0x14(%rsp),%r8d 40065d: 44 8b 4c 24 e8 mov -0x18(%rsp),%r9d 400662: 75 bc jne 400620 <void foobar<bar_t>(bar_t&, bar_t const*)+0x20> as you can see there's a lot of undue load/store. And that's for a POD (or something really looking like one). So, you start fixing that with some looping copy ctor/operator (surely losing the POD property in the process) and so on. Doing that i can fix most reload issues, but stores are much more elusive (note that it depends on the underlying type & its natural alignment). Ideally i'd like PODs to remain PODs, and synthetized ctor/operators to be efficient (ie not falling back to using gpr based memcpy when everything is in an XMM register already), or at least a consistent way how such ctor/operators can be written (and dead store removed). Briefly: how am i supposed to decorate my structures with larger aligment and not royally piss off the autovectorizer (and g++ in general)?