Tamar Christina <tamar.christ...@arm.com> writes: > Hi Jennifer, > >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Tuesday, October 1, 2024 12:20 PM >> To: Jennifer Schmitz <jschm...@nvidia.com> >> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov <ktkac...@exchange.nvidia.com> >> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector >> reductions >> >> Jennifer Schmitz <jschm...@nvidia.com> writes: >> > This patch implements the optabs reduc_and_scal_<mode>, >> > reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for Advanced SIMD >> > integers for TARGET_SVE in order to use the SVE instructions ANDV, ORV, and >> > EORV for fixed-width bitwise reductions. >> > For example, the test case >> > >> > int32_t foo (int32_t *a) >> > { >> > int32_t b = -1; >> > for (int i = 0; i < 4; ++i) >> > b &= a[i]; >> > return b; >> > } >> > >> > was previously compiled to >> > (-O2 -ftree-vectorize --param aarch64-autovec-preference=asimd-only): >> > foo: >> > ldp w2, w1, [x0] >> > ldp w3, w0, [x0, 8] >> > and w1, w1, w3 >> > and w0, w0, w2 >> > and w0, w1, w0 >> > ret >> > >> > With patch, it is compiled to: >> > foo: >> > ldr q31, [x0] >> > ptrue p7.b, all >> > andv s31, p7, z31.s >> > fmov w0, s3 >> > ret >> > >> > Test cases were added to check the produced assembly for use of SVE >> > instructions. >> >> I would imagine that in this particular case, the scalar version is >> better. But I agree it's a useful feature for other cases. >> > > Yeah, I'm concerned because ANDV and other reductions are extremely expensive. > But assuming the reductions are done outside of a loop then it should be ok, > though. > > The issue is that the reduction latency grows with VL, so e.g. compare the > latencies and > throughput for Neoverse V1 and Neoverse V2. So I think we want to gate this > on VL128. > > As an aside, is the sequence correct? With ORR reduction ptrue makes sense, > but for > VL > 128 ptrue doesn't work as the top bits would be zero. So an ANDV on zero > values > lanes would result in zero.
Argh! Thanks for spotting that. I'm kicking myself for missing it :( > You'd want to predicate the ANDV with the size of the vector being reduced. > The same > is true for SMIN and SMAX. > > I do wonder whether we need to split the pattern into two, where w->w uses > the SVE > Instructions but w->r uses Adv SIMD. > > In the case of w->r as the example above > > ext v1.16b, v0.16b, v0.16b, #8 > and v0.8b, v0.8b, v1.8b > fmov x8, d0 > lsr x9, x8, #32 > and w0, w8, w9 > > would beat the ADDV on pretty much every uarch. > > But I'll leave it up to the maintainers. Also a good point. And since these are integer reductions, an r result is more probable than a w result. w would typically only be used if the result is stored directly to memory. At which point, the question (which you might have been implying) is whether it's worth doing this at all, given the limited cases for which it's beneficial, and the complication that's needed to (a) detect those cases and (b) make them work. Thanks, Richard > > Thanks for the patch though! > > Cheers, > Tamar > >> However, a natural follow-on would be to use SMAXV and UMAXV for V2DI. >> Taking that into account: >> >> > [...] >> > diff --git a/gcc/config/aarch64/aarch64-sve.md >> > b/gcc/config/aarch64/aarch64- >> sve.md >> > index bfa28849adf..0d9e5cebef0 100644 >> > --- a/gcc/config/aarch64/aarch64-sve.md >> > +++ b/gcc/config/aarch64/aarch64-sve.md >> > @@ -8927,6 +8927,28 @@ >> > "<su>addv\t%d0, %1, %2.<Vetype>" >> > ) >> > >> > +;; Unpredicated logical integer reductions for Advanced SIMD modes. >> > +(define_expand "reduc_<optab>_scal_<mode>" >> > + [(set (match_operand:<VEL> 0 "register_operand") >> > + (unspec:<VEL> [(match_dup 2) >> > + (match_operand:VQ_I 1 "register_operand")] >> > + SVE_INT_REDUCTION_LOGICAL))] >> > + "TARGET_SVE" >> > + { >> > + operands[2] = aarch64_ptrue_reg (<VPRED>mode); >> > + } >> > +) >> > + >> > +;; Predicated logical integer reductions for Advanced SIMD modes. >> > +(define_insn "*aarch64_pred_reduc_<optab>_<mode>" >> > + [(set (match_operand:<VEL> 0 "register_operand" "=w") >> > + (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl") >> > + (match_operand:VQ_I 2 "register_operand" "w")] >> > + SVE_INT_REDUCTION_LOGICAL))] >> > + "TARGET_SVE" >> > + "<sve_int_op>\t%<Vetype>0, %1, %Z2.<Vetype>" >> > +) >> > + >> >> ...I think we should avoid adding more patterns, and instead extend >> the existing: >> >> (define_expand "reduc_<optab>_scal_<mode>" >> [(set (match_operand:<VEL> 0 "register_operand") >> (unspec:<VEL> [(match_dup 2) >> (match_operand:SVE_FULL_I 1 "register_operand")] >> SVE_INT_REDUCTION))] >> "TARGET_SVE" >> { >> operands[2] = aarch64_ptrue_reg (<VPRED>mode); >> } >> ) >> >> to all Advanced SIMD integer modes except V1DI. This would involve >> adding a new mode iterator along the same lines as SVE_FULL_SDI_SIMD >> (SVE_FULL_I_SIMD?). >> >> Then we could change the name of: >> >> (define_expand "reduc_<optab>_scal_<mode>" >> [(match_operand:<VEL> 0 "register_operand") >> (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")] >> MAXMINV)] >> "TARGET_SIMD" >> { >> rtx elt = aarch64_endian_lane_rtx (<MODE>mode, 0); >> rtx scratch = gen_reg_rtx (<MODE>mode); >> emit_insn (gen_aarch64_reduc_<optab>_internal<mode> (scratch, >> operands[1])); >> emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt)); >> DONE; >> } >> ) >> >> to something like: >> >> (define_expand "@aarch64_advsimd_<optab><mode>" >> >> The expander above would then be something like: >> >> (define_expand "reduc_<optab>_scal_<mode>" >> [(set (match_operand:<VEL> 0 "register_operand") >> (unspec:<VEL> [(match_dup 2) >> (match_operand:SVE_FULL_I_SIMD 1 "register_operand")] >> SVE_INT_REDUCTION))] >> "TARGET_SVE >> || (maybe_code_for_aarch64_advsimd (<SVE_INT_REDUCTION>, >> <MODE>mode) >> != CODE_FOR_nothing)" >> { >> if (rtx pat = maybe_gen_aarch64_advsimd (<SVE_INT_REDUCTION>, >> <MODE>mode)) >> { >> emit_insn (pat); >> DONE; >> } >> operands[2] = aarch64_ptrue_reg (<VPRED>mode); >> } >> ) >> >> which acts as a kind of union pattern for SVE and Advanced SIMD. >> >> This would handle the SMAXV and UMAXV cases as well as the ANDV, ORV >> and EORV cases. >> >> Not tested, so let me know if it doesn't work :) >> >> Thanks, >> Richard