https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770
Bug ID: 111770 Summary: predicated loads inactive lane values not modelled Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- For this example: int foo(int n, char *a, char *b) { int sum = 0; for (int i = 0; i < n; ++i) { sum += a[i] * b[i]; } return sum; } we generate with -O3 -march=armv8-a+sve .L3: ld1b z29.b, p7/z, [x1, x3] ld1b z31.b, p7/z, [x2, x3] add x3, x3, x4 sel z31.b, p7, z31.b, z28.b whilelo p7.b, w3, w0 udot z30.s, z29.b, z31.b b.any .L3 uaddv d30, p6, z30.s fmov w0, s30 ret Which is pretty good, but we completely ruin it with the SEL. In gimple this is: vect__7.12_81 = .MASK_LOAD (_21, 8B, loop_mask_77); masked_op1_82 = .VCOND_MASK (loop_mask_77, vect__7.12_81, { 0, ... }); vect_patt_33.13_83 = DOT_PROD_EXPR <vect__3.9_78, masked_op1_82, vect_sum_19.6_74>; The missed optimization here is that we don't model what happens with predicated operations that zero inactive lanes. i.e. in this case .MASK_LOAD will zero the unactive lanes, so the .VCOND_MASK is completely superfluous. I'm not entirely sure how we should go about fixing this generally.