Tamar Christina <tamar.christ...@arm.com> writes:
> Hi Jennifer,
>
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Tuesday, October 1, 2024 12:20 PM
>> To: Jennifer Schmitz <jschm...@nvidia.com>
>> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov <ktkac...@exchange.nvidia.com>
>> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector
>> reductions
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>> > This patch implements the optabs reduc_and_scal_<mode>,
>> > reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for Advanced SIMD
>> > integers for TARGET_SVE in order to use the SVE instructions ANDV, ORV, and
>> > EORV for fixed-width bitwise reductions.
>> > For example, the test case
>> >
>> > int32_t foo (int32_t *a)
>> > {
>> >   int32_t b = -1;
>> >   for (int i = 0; i < 4; ++i)
>> >     b &= a[i];
>> >   return b;
>> > }
>> >
>> > was previously compiled to
>> > (-O2 -ftree-vectorize --param aarch64-autovec-preference=asimd-only):
>> > foo:
>> >         ldp     w2, w1, [x0]
>> >         ldp     w3, w0, [x0, 8]
>> >         and     w1, w1, w3
>> >         and     w0, w0, w2
>> >         and     w0, w1, w0
>> >         ret
>> >
>> > With patch, it is compiled to:
>> > foo:
>> >         ldr     q31, [x0]
>> >    ptrue   p7.b, all
>> >    andv    s31, p7, z31.s
>> >    fmov    w0, s3
>> >    ret
>> >
>> > Test cases were added to check the produced assembly for use of SVE
>> > instructions.
>> 
>> I would imagine that in this particular case, the scalar version is
>> better.  But I agree it's a useful feature for other cases.
>> 
>
> Yeah, I'm concerned because ANDV and other reductions are extremely expensive.
> But assuming the reductions are done outside of a loop then it should be ok, 
> though.
>
> The issue is that the reduction latency grows with VL, so e.g. compare the 
> latencies and
> throughput for Neoverse V1 and Neoverse V2.  So I think we want to gate this 
> on VL128.
>
> As an aside, is the sequence correct?  With ORR reduction ptrue makes sense, 
> but for
> VL > 128 ptrue doesn't work as the top bits would be zero. So an ANDV on zero 
> values
> lanes would result in zero.

Argh!  Thanks for spotting that.  I'm kicking myself for missing it :(

> You'd want to predicate the ANDV with the size of the vector being reduced. 
> The same
> is true for SMIN and SMAX.
>
> I do wonder whether we need to split the pattern into two, where w->w uses 
> the SVE
> Instructions but w->r uses Adv SIMD.
>
> In the case of w->r as the example above
>
>         ext     v1.16b, v0.16b, v0.16b, #8
>         and     v0.8b, v0.8b, v1.8b
>         fmov    x8, d0
>         lsr     x9, x8, #32
>         and     w0, w8, w9
>
> would beat the ADDV on pretty much every uarch.
>
> But I'll leave it up to the maintainers.

Also a good point.  And since these are integer reductions, an r
result is more probable than a w result.  w would typically only
be used if the result is stored directly to memory.

At which point, the question (which you might have been implying)
is whether it's worth doing this at all, given the limited cases
for which it's beneficial, and the complication that's needed to
(a) detect those cases and (b) make them work.

Thanks,
Richard

>
> Thanks for the patch though!
>
> Cheers,
> Tamar
>
>> However, a natural follow-on would be to use SMAXV and UMAXV for V2DI.
>> Taking that into account:
>> 
>> > [...]
>> > diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> > b/gcc/config/aarch64/aarch64-
>> sve.md
>> > index bfa28849adf..0d9e5cebef0 100644
>> > --- a/gcc/config/aarch64/aarch64-sve.md
>> > +++ b/gcc/config/aarch64/aarch64-sve.md
>> > @@ -8927,6 +8927,28 @@
>> >    "<su>addv\t%d0, %1, %2.<Vetype>"
>> >  )
>> >
>> > +;; Unpredicated logical integer reductions for Advanced SIMD modes.
>> > +(define_expand "reduc_<optab>_scal_<mode>"
>> > +  [(set (match_operand:<VEL> 0 "register_operand")
>> > +  (unspec:<VEL> [(match_dup 2)
>> > +                 (match_operand:VQ_I 1 "register_operand")]
>> > +                SVE_INT_REDUCTION_LOGICAL))]
>> > +  "TARGET_SVE"
>> > +  {
>> > +    operands[2] = aarch64_ptrue_reg (<VPRED>mode);
>> > +  }
>> > +)
>> > +
>> > +;; Predicated logical integer reductions for Advanced SIMD modes.
>> > +(define_insn "*aarch64_pred_reduc_<optab>_<mode>"
>> > +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
>> > +  (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
>> > +                 (match_operand:VQ_I 2 "register_operand" "w")]
>> > +                SVE_INT_REDUCTION_LOGICAL))]
>> > +  "TARGET_SVE"
>> > +  "<sve_int_op>\t%<Vetype>0, %1, %Z2.<Vetype>"
>> > +)
>> > +
>> 
>> ...I think we should avoid adding more patterns, and instead extend
>> the existing:
>> 
>> (define_expand "reduc_<optab>_scal_<mode>"
>>   [(set (match_operand:<VEL> 0 "register_operand")
>>      (unspec:<VEL> [(match_dup 2)
>>                     (match_operand:SVE_FULL_I 1 "register_operand")]
>>                    SVE_INT_REDUCTION))]
>>   "TARGET_SVE"
>>   {
>>     operands[2] = aarch64_ptrue_reg (<VPRED>mode);
>>   }
>> )
>> 
>> to all Advanced SIMD integer modes except V1DI.  This would involve
>> adding a new mode iterator along the same lines as SVE_FULL_SDI_SIMD
>> (SVE_FULL_I_SIMD?).
>> 
>> Then we could change the name of:
>> 
>> (define_expand "reduc_<optab>_scal_<mode>"
>>   [(match_operand:<VEL> 0 "register_operand")
>>    (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")]
>>                  MAXMINV)]
>>   "TARGET_SIMD"
>>   {
>>     rtx elt = aarch64_endian_lane_rtx (<MODE>mode, 0);
>>     rtx scratch = gen_reg_rtx (<MODE>mode);
>>     emit_insn (gen_aarch64_reduc_<optab>_internal<mode> (scratch,
>>                                                       operands[1]));
>>     emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
>>     DONE;
>>   }
>> )
>> 
>> to something like:
>> 
>> (define_expand "@aarch64_advsimd_<optab><mode>"
>> 
>> The expander above would then be something like:
>> 
>> (define_expand "reduc_<optab>_scal_<mode>"
>>   [(set (match_operand:<VEL> 0 "register_operand")
>>      (unspec:<VEL> [(match_dup 2)
>>                     (match_operand:SVE_FULL_I_SIMD 1 "register_operand")]
>>                    SVE_INT_REDUCTION))]
>>   "TARGET_SVE
>>    || (maybe_code_for_aarch64_advsimd (<SVE_INT_REDUCTION>,
>> <MODE>mode)
>>        != CODE_FOR_nothing)"
>>   {
>>     if (rtx pat = maybe_gen_aarch64_advsimd (<SVE_INT_REDUCTION>,
>> <MODE>mode))
>>       {
>>         emit_insn (pat);
>>         DONE;
>>       }
>>     operands[2] = aarch64_ptrue_reg (<VPRED>mode);
>>   }
>> )
>> 
>> which acts as a kind of union pattern for SVE and Advanced SIMD.
>> 
>> This would handle the SMAXV and UMAXV cases as well as the ANDV, ORV
>> and EORV cases.
>> 
>> Not tested, so let me know if it doesn't work :)
>> 
>> Thanks,
>> Richard

Reply via email to