https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92665
Bug ID: 92665 Summary: [AArch64] low lanes select not optimized out for vmlal intrinsics Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org Target Milestone: --- With gcc as of today I see dup instructions that could be optimized out: $ cat red.c #include "arm_neon.h" int32x4_t fun(int32x4_t a, int16x8_t b, int16x8_t c) { a = vmlal_s16(a, vget_low_s16(b), vget_low_s16(c)); a = vmlal_high_s16(a, b, c); return a; } $ gcc -O3 -S -o- red.c fun: dup d3, v1.d[0] dup d4, v2.d[0] smlal v0.4s,v3.4h,v4.4h smlal2 v0.4s,v1.8h,v2.8h ret $ clang -O3 -S -o- red.c fun: smlal v0.4s, v1.4h, v2.4h smlal2 v0.4s, v1.8h, v2.8h ret