https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116312

            Bug ID: 116312
           Summary: Use LDP instead of LD2 on for Advanced SIMD when
                    possible
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

The testcase:
#define N 32000
float a[N],b[N],c[N];

void foo(struct args_t * func_args)
{
    for (int i = 0; i < N; i+=2) {
        c[i] = a[i] + b[i];
    }
}

Compiled, for example with -O3 -mcpu=neoverse-v2 will generate the loop:
.L2:
        ld2     {v29.4s - v30.4s}, [x4], 32
        mov     x3, x0
        add     x2, x0, 16
        add     x1, x0, 24
        add     x0, x0, 32
        ld2     {v30.4s - v31.4s}, [x5], 32
        fadd    v30.4s, v30.4s, v29.4s
        str     s30, [x3], 8
        st1     {v30.s}[1], [x3]
        st1     {v30.s}[2], [x2]
        st1     {v30.s}[3], [x1]
        cmp     x4, x6
        bne     .L2

LLVM generates:
.LBB0_1:
        ldp     q0, q1, [x9], #32
        ldp     q2, q3, [x11], #32
        sub     x12, x8, #8
        add     x13, x8, #8
        subs    x10, x10, #4
        fadd    v1.4s, v1.4s, v3.4s
        fadd    v0.4s, v0.4s, v2.4s
        st1     { v1.s }[2], [x13]
        stur    s0, [x8, #-16]
        str     s1, [x8], #32
        st1     { v0.s }[2], [x12]
        b.ne    .LBB0_1

LDP is known to be more efficient than LD2. For example from the Neoverse V2
SWOG the throughput/latency of thhe LD2 here is 8 and 3/2.
For the equivalent LDP it's 6 and 3/2.
It's not a huge improvement to be honest, but we could implement it as a simple
final assembly output template change for minimal invasion.
Though if we wanted to take advantage of wider post-index immediates available
to LDP we could make it more elaborate

Reply via email to