http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46006

           Summary: vectorization outside of loops
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: ada
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: ja...@gcc.gnu.org
                CC: i...@gcc.gnu.org


Are there any plans to try to vectorize parts of code like:
struct A
{
  double x, y, z;
};

struct B
{
  struct A a, b;
};

struct C
{
  struct A c;
  double d;
};

__attribute__((noinline, noclone)) int
foo (const struct C *u, struct B v)
{
  double a, b, c, d;

  a = v.b.x * v.b.x + v.b.y * v.b.y + v.b.z * v.b.z;
  b = 2.0 * v.b.x * (v.a.x - u->c.x)
      + 2.0 * v.b.y * (v.a.y - u->c.y) + 2.0 * v.b.z * (v.a.z - u->c.z);
  c = u->c.x * u->c.x + u->c.y * u->c.y + u->c.z * u->c.z
      + v.a.x * v.a.x + v.a.y * v.a.y + v.a.z * v.a.z
      + 2.0 * (-u->c.x * v.a.x - u->c.y * v.a.y - u->c.z * v.a.z)
      - u->d * u->d;
  if ((d = b * b - 4.0 * a * c) < 0.0)
    return 0;
  return d;
}

int
main (void)
{
  int i, j;
  struct C c = { { 1.0, 1.0, 1.0 }, 1.0 };
  struct B b = { { 1.0, 1.0, 1.0 }, { 1.0, 1.0, 1.0 } };
  for (i = 0; i < 100000000; i++)
    {
      asm volatile ("" : : "r" (&c), "r" (&b) : "memory");
      j = foo (&c, b);
      asm volatile ("" : : "r" (j));
    }
  return 0;
}
(this is the hot spot from c-ray benchmark, the function is actually larger but
at least according to callgrind in most cases the early return on < 0.0
happens;
as the function is large and called from multiple spots, it isn't inlined).
I'd say (though, haven't tried to code it by hand using intrinsics) that by
doing many of the multiplications/additions in parallel (especially for AVX)
there could be significant speedups (-O3 -ffast-math).

Reply via email to