https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406
Bug ID: 109406 Summary: Missing use of aarch64 SVE2 unpredicated integer multiply Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 For the testcase #define N 1024 long long res[N]; long long in1[N]; long long in2[N]; void mult (void) { for (int i = 0; i < N; i++) res[i] = in1[i] * in2[i]; } With -O3 -march=armv8.5-a+sve2 we generate the loop: ptrue p1.b, all whilelo p0.d, wzr, w2 .L2: ld1d z0.d, p0/z, [x4, x0, lsl 3] ld1d z1.d, p0/z, [x3, x0, lsl 3] mul z0.d, p1/m, z0.d, z1.d st1d z0.d, p0, [x1, x0, lsl 3] incd x0 whilelo p0.d, w0, w2 b.any .L2 ret SVE2 supports the MUL (vectors, unpredicated) instruction that would allow us to eliminate the use of p1. Clang manages to do this (though it has other inefficiencies) in https://godbolt.org/z/7xj6xEchx