http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46006
Summary: vectorization outside of loops Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ada AssignedTo: unassig...@gcc.gnu.org ReportedBy: ja...@gcc.gnu.org CC: i...@gcc.gnu.org Are there any plans to try to vectorize parts of code like: struct A { double x, y, z; }; struct B { struct A a, b; }; struct C { struct A c; double d; }; __attribute__((noinline, noclone)) int foo (const struct C *u, struct B v) { double a, b, c, d; a = v.b.x * v.b.x + v.b.y * v.b.y + v.b.z * v.b.z; b = 2.0 * v.b.x * (v.a.x - u->c.x) + 2.0 * v.b.y * (v.a.y - u->c.y) + 2.0 * v.b.z * (v.a.z - u->c.z); c = u->c.x * u->c.x + u->c.y * u->c.y + u->c.z * u->c.z + v.a.x * v.a.x + v.a.y * v.a.y + v.a.z * v.a.z + 2.0 * (-u->c.x * v.a.x - u->c.y * v.a.y - u->c.z * v.a.z) - u->d * u->d; if ((d = b * b - 4.0 * a * c) < 0.0) return 0; return d; } int main (void) { int i, j; struct C c = { { 1.0, 1.0, 1.0 }, 1.0 }; struct B b = { { 1.0, 1.0, 1.0 }, { 1.0, 1.0, 1.0 } }; for (i = 0; i < 100000000; i++) { asm volatile ("" : : "r" (&c), "r" (&b) : "memory"); j = foo (&c, b); asm volatile ("" : : "r" (j)); } return 0; } (this is the hot spot from c-ray benchmark, the function is actually larger but at least according to callgrind in most cases the early return on < 0.0 happens; as the function is large and called from multiple spots, it isn't inlined). I'd say (though, haven't tried to code it by hand using intrinsics) that by doing many of the multiplications/additions in parallel (especially for AVX) there could be significant speedups (-O3 -ffast-math).