https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192
Bug ID: 114192 Summary: scalar code left around following early break vectorization of reduction Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: acoplan at gcc dot gnu.org Target Milestone: --- For the following testcase: int a[1024]; int f4(int *x, int n) { int sum = 0; for (int i = 0; i < n; i++) { sum += a[i]; if (a[i] == 42) break; } return sum; } at -O3 on aarch64 we vectorize it and get the following vector loop: .L4: cmp x7, x2 beq .L23 .L6: ubfiz x3, x2, 4, 32 ldr w6, [x4, x2, lsl 2] // scalar load mov v27.16b, v30.16b mov w0, w5 add v30.4s, v30.4s, v25.4s add w5, w5, w6 // scalar add ldr q29, [x4, x3] add x2, x2, 1 cmeq v31.4s, v29.4s, v26.4s add v28.4s, v28.4s, v29.4s umaxp v31.4s, v31.4s, v31.4s fmov x3, d31 cbz x3, .L4 but here the old scalar code has been left around. If we remove the early exit from the loop, then although we still leave the scalar code around in the vectorizer, it gets optimized away immediately by the following DCE pass. Without the early exit, in the vectorizer dump we have: <bb 3> [local count: 860067200]: # sum_10 = PHI <sum_6(6), 0(9)> # i_12 = PHI <i_7(6), 0(9)> # vect_sum_10.8_25 = PHI <vect_sum_6.12_29(6), { 0, 0, 0, 0 }(9)> # vectp_a.9_26 = PHI <vectp_a.9_27(6), &a(9)> # ivtmp_32 = PHI <ivtmp_33(6), 0(9)> vect__1.11_28 = MEM <vector(4) int> [(int *)vectp_a.9_26]; _1 = a[i_12]; // scalar load vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25; sum_6 = _1 + sum_10; i_7 = i_12 + 1; vectp_a.9_27 = vectp_a.9_26 + 16; ivtmp_33 = ivtmp_32 + 1; if (ivtmp_33 < bnd.5_22) goto <bb 6>; [89.00%] else goto <bb 11>; [11.00%] i.e. the scalar load is left around, but it seems to get cleaned up by the (immediately following) dce pass: <bb 3> [local count: 860067200]: # vect_sum_10.8_25 = PHI <vect_sum_6.12_29(6), { 0, 0, 0, 0 }(9)> # vectp_a.9_26 = PHI <vectp_a.9_27(6), &a(9)> # ivtmp_32 = PHI <ivtmp_33(6), 0(9)> vect__1.11_28 = MEM <vector(4) int> [(int *)vectp_a.9_26]; vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25; vectp_a.9_27 = vectp_a.9_26 + 16; ivtmp_33 = ivtmp_32 + 1; if (ivtmp_33 < bnd.5_22) goto <bb 6>; [89.00%] else goto <bb 11>; [11.00%] perhaps the dce needs improving to clean up the dead scalar code in the early exit case, too.