> -----Original Message----- > From: Kyrylo Tkachov <ktkac...@nvidia.com> > Sent: Wednesday, October 2, 2024 1:09 PM > To: Richard Sandiford <richard.sandif...@arm.com> > Cc: Tamar Christina <tamar.christ...@arm.com>; Jennifer Schmitz > <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov > <ktkac...@exchange.nvidia.com> > Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector > reductions > > > > > On 2 Oct 2024, at 13:43, Richard Sandiford <richard.sandif...@arm.com> > wrote: > > > > External email: Use caution opening links or attachments > > > > > > Tamar Christina <tamar.christ...@arm.com> writes: > >> Hi Jennifer, > >> > >>> -----Original Message----- > >>> From: Richard Sandiford <richard.sandif...@arm.com> > >>> Sent: Tuesday, October 1, 2024 12:20 PM > >>> To: Jennifer Schmitz <jschm...@nvidia.com> > >>> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov > <ktkac...@exchange.nvidia.com> > >>> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for > >>> vector > >>> reductions > >>> > >>> Jennifer Schmitz <jschm...@nvidia.com> writes: > >>>> This patch implements the optabs reduc_and_scal_<mode>, > >>>> reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for Advanced SIMD > >>>> integers for TARGET_SVE in order to use the SVE instructions ANDV, ORV, > >>>> and > >>>> EORV for fixed-width bitwise reductions. > >>>> For example, the test case > >>>> > >>>> int32_t foo (int32_t *a) > >>>> { > >>>> int32_t b = -1; > >>>> for (int i = 0; i < 4; ++i) > >>>> b &= a[i]; > >>>> return b; > >>>> } > >>>> > >>>> was previously compiled to > >>>> (-O2 -ftree-vectorize --param aarch64-autovec-preference=asimd-only): > >>>> foo: > >>>> ldp w2, w1, [x0] > >>>> ldp w3, w0, [x0, 8] > >>>> and w1, w1, w3 > >>>> and w0, w0, w2 > >>>> and w0, w1, w0 > >>>> ret > >>>> > >>>> With patch, it is compiled to: > >>>> foo: > >>>> ldr q31, [x0] > >>>> ptrue p7.b, all > >>>> andv s31, p7, z31.s > >>>> fmov w0, s3 > >>>> ret > >>>> > >>>> Test cases were added to check the produced assembly for use of SVE > >>>> instructions. > >>> > >>> I would imagine that in this particular case, the scalar version is > >>> better. But I agree it's a useful feature for other cases. > >>> > >> > >> Yeah, I'm concerned because ANDV and other reductions are extremely > expensive. > >> But assuming the reductions are done outside of a loop then it should be > >> ok, > though. > >> > >> The issue is that the reduction latency grows with VL, so e.g. compare the > latencies and > >> throughput for Neoverse V1 and Neoverse V2. So I think we want to gate > >> this > on VL128. > >> > >> As an aside, is the sequence correct? With ORR reduction ptrue makes > >> sense, > but for > >> VL > 128 ptrue doesn't work as the top bits would be zero. So an ANDV on > >> zero > values > >> lanes would result in zero. > > > > Argh! Thanks for spotting that. I'm kicking myself for missing it :( > > > >> You'd want to predicate the ANDV with the size of the vector being reduced. > The same > >> is true for SMIN and SMAX. > >> > >> I do wonder whether we need to split the pattern into two, where w->w uses > the SVE > >> Instructions but w->r uses Adv SIMD. > >> > >> In the case of w->r as the example above > >> > >> ext v1.16b, v0.16b, v0.16b, #8 > >> and v0.8b, v0.8b, v1.8b > >> fmov x8, d0 > >> lsr x9, x8, #32 > >> and w0, w8, w9 > >> > >> would beat the ADDV on pretty much every uarch. > >> > >> But I'll leave it up to the maintainers. > > > > Also a good point. And since these are integer reductions, an r > > result is more probable than a w result. w would typically only > > be used if the result is stored directly to memory. > > > > At which point, the question (which you might have been implying) > > is whether it's worth doing this at all, given the limited cases > > for which it's beneficial, and the complication that's needed to > > (a) detect those cases and (b) make them work. > > These are good points in the thread. Maybe it makes sense to do this only for > V16QI reductions? > Maybe a variant of Tamar’s w->r sequence wins out even there.
I do agree that they're worth implementing, and also for 64-bit vectors (there you Skip the first reduction and just fmov the value to gpr since you don't have the Initial 128 -> 64 bit reduction step), But I think at the moment they're possibly not modelled as reductions in our cost model. Like Richard mentioned I don't think the low iteration cases should vectorize and instead just unroll. > > Originally I had hoped that we’d tackle the straight-line case from PR113816 > but it > seems that GCC didn’t even try to create a reduction op for the code there. > Maybe that’s something to look into separately. Yeah, I think unrolled scalar is going to beat the ORV there as you can have better throughput doing the reductions in pairs. > > Also, for the alternative test case that we tried to use for a motivation: > char sior_loop (char *a) > { > char b = 0; > for (int i = 0; i < 16; ++i) > b |= a[i]; > return b; > } > > GCC generates some terrible code: https://godbolt.org/z/a68jodKca > So it feels that we should do something to improve the bitwise reductions for > AArch64 regardless. > Maybe we need to agree on the optimal sequences for the various modes and > implement them. Yeah I do think they're worth having! Because the vectorizer's fallback method of doing these reductions without the optab is element-wise but on the SIMD side. I think the optimal sequence for all of them are a mix of SIMD + GPR. Cheers, Tamar > > Thanks, > Kyrill > > > > > Thanks, > > Richard > > > >> > >> Thanks for the patch though! > >> > >> Cheers, > >> Tamar > >> > >>> However, a natural follow-on would be to use SMAXV and UMAXV for V2DI. > >>> Taking that into account: > >>> > >>>> [...] > >>>> diff --git a/gcc/config/aarch64/aarch64-sve.md > b/gcc/config/aarch64/aarch64- > >>> sve.md > >>>> index bfa28849adf..0d9e5cebef0 100644 > >>>> --- a/gcc/config/aarch64/aarch64-sve.md > >>>> +++ b/gcc/config/aarch64/aarch64-sve.md > >>>> @@ -8927,6 +8927,28 @@ > >>>> "<su>addv\t%d0, %1, %2.<Vetype>" > >>>> ) > >>>> > >>>> +;; Unpredicated logical integer reductions for Advanced SIMD modes. > >>>> +(define_expand "reduc_<optab>_scal_<mode>" > >>>> + [(set (match_operand:<VEL> 0 "register_operand") > >>>> + (unspec:<VEL> [(match_dup 2) > >>>> + (match_operand:VQ_I 1 "register_operand")] > >>>> + SVE_INT_REDUCTION_LOGICAL))] > >>>> + "TARGET_SVE" > >>>> + { > >>>> + operands[2] = aarch64_ptrue_reg (<VPRED>mode); > >>>> + } > >>>> +) > >>>> + > >>>> +;; Predicated logical integer reductions for Advanced SIMD modes. > >>>> +(define_insn "*aarch64_pred_reduc_<optab>_<mode>" > >>>> + [(set (match_operand:<VEL> 0 "register_operand" "=w") > >>>> + (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl") > >>>> + (match_operand:VQ_I 2 "register_operand" "w")] > >>>> + SVE_INT_REDUCTION_LOGICAL))] > >>>> + "TARGET_SVE" > >>>> + "<sve_int_op>\t%<Vetype>0, %1, %Z2.<Vetype>" > >>>> +) > >>>> + > >>> > >>> ...I think we should avoid adding more patterns, and instead extend > >>> the existing: > >>> > >>> (define_expand "reduc_<optab>_scal_<mode>" > >>> [(set (match_operand:<VEL> 0 "register_operand") > >>> (unspec:<VEL> [(match_dup 2) > >>> (match_operand:SVE_FULL_I 1 "register_operand")] > >>> SVE_INT_REDUCTION))] > >>> "TARGET_SVE" > >>> { > >>> operands[2] = aarch64_ptrue_reg (<VPRED>mode); > >>> } > >>> ) > >>> > >>> to all Advanced SIMD integer modes except V1DI. This would involve > >>> adding a new mode iterator along the same lines as SVE_FULL_SDI_SIMD > >>> (SVE_FULL_I_SIMD?). > >>> > >>> Then we could change the name of: > >>> > >>> (define_expand "reduc_<optab>_scal_<mode>" > >>> [(match_operand:<VEL> 0 "register_operand") > >>> (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")] > >>> MAXMINV)] > >>> "TARGET_SIMD" > >>> { > >>> rtx elt = aarch64_endian_lane_rtx (<MODE>mode, 0); > >>> rtx scratch = gen_reg_rtx (<MODE>mode); > >>> emit_insn (gen_aarch64_reduc_<optab>_internal<mode> (scratch, > >>> operands[1])); > >>> emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt)); > >>> DONE; > >>> } > >>> ) > >>> > >>> to something like: > >>> > >>> (define_expand "@aarch64_advsimd_<optab><mode>" > >>> > >>> The expander above would then be something like: > >>> > >>> (define_expand "reduc_<optab>_scal_<mode>" > >>> [(set (match_operand:<VEL> 0 "register_operand") > >>> (unspec:<VEL> [(match_dup 2) > >>> (match_operand:SVE_FULL_I_SIMD 1 "register_operand")] > >>> SVE_INT_REDUCTION))] > >>> "TARGET_SVE > >>> || (maybe_code_for_aarch64_advsimd (<SVE_INT_REDUCTION>, > >>> <MODE>mode) > >>> != CODE_FOR_nothing)" > >>> { > >>> if (rtx pat = maybe_gen_aarch64_advsimd (<SVE_INT_REDUCTION>, > >>> <MODE>mode)) > >>> { > >>> emit_insn (pat); > >>> DONE; > >>> } > >>> operands[2] = aarch64_ptrue_reg (<VPRED>mode); > >>> } > >>> ) > >>> > >>> which acts as a kind of union pattern for SVE and Advanced SIMD. > >>> > >>> This would handle the SMAXV and UMAXV cases as well as the ANDV, ORV > >>> and EORV cases. > >>> > >>> Not tested, so let me know if it doesn't work :) > >>> > >>> Thanks, > >>> Richard