https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95967
Bug ID: 95967 Summary: Poor aarch64 vector constructor code when using arm_neon.h Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Depends on: 95962 Blocks: 95958 Target Milestone: --- Target: aarch64*-*-* We generate poor code for the attached functions: f1: movi v4.4s, 0 ins v4.s[0], v0.s[0] ins v4.s[1], v1.s[0] ins v4.s[2], v2.s[0] mov v0.16b, v4.16b ins v0.s[3], v3.s[0] ret f2: dup v0.4s, v0.s[0] ins v0.s[1], v1.s[0] ins v0.s[2], v2.s[0] ins v0.s[3], v3.s[0] ret f3: sub sp, sp, #16 stp s0, s1, [sp] stp s2, s3, [sp, 8] ldr q0, [sp] add sp, sp, 16 ret g1: movi v0.4s, 0 ld1 {v0.s}[0], [x0] ld1 {v0.s}[1], [x1] ld1 {v0.s}[2], [x2] ld1 {v0.s}[3], [x3] ret g2: ld1r {v0.4s}, [x0] ld1 {v0.s}[1], [x1] ld1 {v0.s}[2], [x2] ld1 {v0.s}[3], [x3] ret g3: sub sp, sp, #16 ldr s0, [x3] ldr s3, [x0] ldr s2, [x1] ldr s1, [x2] stp s3, s2, [sp] stp s1, s0, [sp, 8] ldr q0, [sp] add sp, sp, 16 ret All three f functions should generate: mov v0.s[1], v1.s[0] mov v0.s[2], v2.s[0] mov v0.s[3], v3.s[0] ret and all three g functions should generate: ldr s0, [x0] ld1 { v0.s }[1], [x1] ld1 { v0.s }[2], [x2] ld1 { v0.s }[3], [x3] ret which is what current Clang does. Getting the right code for f3 and g3 depends on the fix for PR95962. There's a reasonable chance that PR95962 will be enough on its own to fix f3 and g3, but I included them just in case it isn't. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958 [Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962 [Bug 95962] Inefficient code for simple arm_neon.h iota operation