http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50728

             Bug #: 50728
           Summary: Inefficient vector loads from aggregates passed by
                    value
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: rgue...@gcc.gnu.org
            Target: x86_64-*-*


For

typedef float Value;
struct A {
  Value a[4];
} __attribute__ ((aligned(16)));

A sum(A a, A b)
{
  a.a[0]+=b.a[0];
  a.a[1]+=b.a[1];
  a.a[2]+=b.a[2];
  a.a[3]+=b.a[3];
  return a;
}

due to the way the x86_64 ABI passes A by value generates horribly inefficient
code at -O3 when the vectorizer generates vector loads/stores from/to a and b.

Initial RTL expansion for the load from a is

(insn 2 11 3 2 (set (reg:DI 64)
        (reg:DI 21 xmm0 [ a ])) t.C:7 -1
     (nil))

(insn 3 2 6 2 (set (reg:DI 65)
        (reg:DI 22 xmm1 [ a+8 ])) t.C:7 -1
     (nil))

(insn 4 7 5 2 (set (mem/s/c:DI (plus:DI (reg/f:DI 54 virtual-stack-vars)
                (const_int -32 [0xffffffffffffffe0])) [2 a+0 S8 A128])
        (reg:DI 64)) t.C:7 -1
     (nil))

(insn 5 4 8 2 (set (mem/s/c:DI (plus:DI (reg/f:DI 54 virtual-stack-vars)
                (const_int -24 [0xffffffffffffffe8])) [2 a+8 S8 A64])
        (reg:DI 65)) t.C:7 -1
     (nil))

(insn 14 13 15 3 (parallel [
            (set (reg:DI 69)
                (plus:DI (reg/f:DI 54 virtual-stack-vars)
                    (const_int -32 [0xffffffffffffffe0])))
            (clobber (reg:CC 17 flags))
        ]) t.C:8 -1
     (nil))

(insn 15 14 16 3 (set (reg:V4SF 71)
        (mem/c:V4SF (reg:DI 69) [2 MEM[(struct A *)&a]+0 S16 A128])) t.C:8 -1
     (nil))

so it is forced to go through first general regs and then memory,
instead of a simple sequence of mov[hl]ps.

As this is probably hard to fix at RTL expansion time something later
should be able to fix this up - and as this is all memory the only
candidate seems to be (g)cse.

Reply via email to