https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116312
Bug ID: 116312 Summary: Use LDP instead of LD2 on for Advanced SIMD when possible Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 The testcase: #define N 32000 float a[N],b[N],c[N]; void foo(struct args_t * func_args) { for (int i = 0; i < N; i+=2) { c[i] = a[i] + b[i]; } } Compiled, for example with -O3 -mcpu=neoverse-v2 will generate the loop: .L2: ld2 {v29.4s - v30.4s}, [x4], 32 mov x3, x0 add x2, x0, 16 add x1, x0, 24 add x0, x0, 32 ld2 {v30.4s - v31.4s}, [x5], 32 fadd v30.4s, v30.4s, v29.4s str s30, [x3], 8 st1 {v30.s}[1], [x3] st1 {v30.s}[2], [x2] st1 {v30.s}[3], [x1] cmp x4, x6 bne .L2 LLVM generates: .LBB0_1: ldp q0, q1, [x9], #32 ldp q2, q3, [x11], #32 sub x12, x8, #8 add x13, x8, #8 subs x10, x10, #4 fadd v1.4s, v1.4s, v3.4s fadd v0.4s, v0.4s, v2.4s st1 { v1.s }[2], [x13] stur s0, [x8, #-16] str s1, [x8], #32 st1 { v0.s }[2], [x12] b.ne .LBB0_1 LDP is known to be more efficient than LD2. For example from the Neoverse V2 SWOG the throughput/latency of thhe LD2 here is 8 and 3/2. For the equivalent LDP it's 6 and 3/2. It's not a huge improvement to be honest, but we could implement it as a simple final assembly output template change for minimal invasion. Though if we wanted to take advantage of wider post-index immediates available to LDP we could make it more elaborate