RE: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector reductions

Tamar Christina Wed, 02 Oct 2024 05:34:54 -0700

> -----Original Message-----
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Sent: Wednesday, October 2, 2024 1:09 PM
> To: Richard Sandiford <richard.sandif...@arm.com>
> Cc: Tamar Christina <tamar.christ...@arm.com>; Jennifer Schmitz
> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov
> <ktkac...@exchange.nvidia.com>
> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector
> reductions
> 
> 
> 
> > On 2 Oct 2024, at 13:43, Richard Sandiford <richard.sandif...@arm.com>
> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > Tamar Christina <tamar.christ...@arm.com> writes:
> >> Hi Jennifer,
> >>
> >>> -----Original Message-----
> >>> From: Richard Sandiford <richard.sandif...@arm.com>
> >>> Sent: Tuesday, October 1, 2024 12:20 PM
> >>> To: Jennifer Schmitz <jschm...@nvidia.com>
> >>> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov
> <ktkac...@exchange.nvidia.com>
> >>> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for 
> >>> vector
> >>> reductions
> >>>
> >>> Jennifer Schmitz <jschm...@nvidia.com> writes:
> >>>> This patch implements the optabs reduc_and_scal_<mode>,
> >>>> reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for Advanced SIMD
> >>>> integers for TARGET_SVE in order to use the SVE instructions ANDV, ORV, 
> >>>> and
> >>>> EORV for fixed-width bitwise reductions.
> >>>> For example, the test case
> >>>>
> >>>> int32_t foo (int32_t *a)
> >>>> {
> >>>>  int32_t b = -1;
> >>>>  for (int i = 0; i < 4; ++i)
> >>>>    b &= a[i];
> >>>>  return b;
> >>>> }
> >>>>
> >>>> was previously compiled to
> >>>> (-O2 -ftree-vectorize --param aarch64-autovec-preference=asimd-only):
> >>>> foo:
> >>>>        ldp     w2, w1, [x0]
> >>>>        ldp     w3, w0, [x0, 8]
> >>>>        and     w1, w1, w3
> >>>>        and     w0, w0, w2
> >>>>        and     w0, w1, w0
> >>>>        ret
> >>>>
> >>>> With patch, it is compiled to:
> >>>> foo:
> >>>>        ldr     q31, [x0]
> >>>>   ptrue   p7.b, all
> >>>>   andv    s31, p7, z31.s
> >>>>   fmov    w0, s3
> >>>>   ret
> >>>>
> >>>> Test cases were added to check the produced assembly for use of SVE
> >>>> instructions.
> >>>
> >>> I would imagine that in this particular case, the scalar version is
> >>> better.  But I agree it's a useful feature for other cases.
> >>>
> >>
> >> Yeah, I'm concerned because ANDV and other reductions are extremely
> expensive.
> >> But assuming the reductions are done outside of a loop then it should be 
> >> ok,
> though.
> >>
> >> The issue is that the reduction latency grows with VL, so e.g. compare the
> latencies and
> >> throughput for Neoverse V1 and Neoverse V2.  So I think we want to gate 
> >> this
> on VL128.
> >>
> >> As an aside, is the sequence correct?  With ORR reduction ptrue makes 
> >> sense,
> but for
> >> VL > 128 ptrue doesn't work as the top bits would be zero. So an ANDV on 
> >> zero
> values
> >> lanes would result in zero.
> >
> > Argh!  Thanks for spotting that.  I'm kicking myself for missing it :(
> >
> >> You'd want to predicate the ANDV with the size of the vector being reduced.
> The same
> >> is true for SMIN and SMAX.
> >>
> >> I do wonder whether we need to split the pattern into two, where w->w uses
> the SVE
> >> Instructions but w->r uses Adv SIMD.
> >>
> >> In the case of w->r as the example above
> >>
> >>        ext     v1.16b, v0.16b, v0.16b, #8
> >>        and     v0.8b, v0.8b, v1.8b
> >>        fmov    x8, d0
> >>        lsr     x9, x8, #32
> >>        and     w0, w8, w9
> >>
> >> would beat the ADDV on pretty much every uarch.
> >>
> >> But I'll leave it up to the maintainers.
> >
> > Also a good point.  And since these are integer reductions, an r
> > result is more probable than a w result.  w would typically only
> > be used if the result is stored directly to memory.
> >
> > At which point, the question (which you might have been implying)
> > is whether it's worth doing this at all, given the limited cases
> > for which it's beneficial, and the complication that's needed to
> > (a) detect those cases and (b) make them work.
> 
> These are good points in the thread. Maybe it makes sense to do this only for
> V16QI reductions?
> Maybe a variant of Tamar’s w->r sequence wins out even there.


I do agree that they're worth implementing, and also for 64-bit vectors (there 
you
Skip the first reduction and just fmov the value to gpr since you don't have the
Initial 128 -> 64 bit reduction step),

But I think at the moment they're possibly not modelled as reductions in our
cost model.  Like Richard mentioned I don't think the low iteration cases
should vectorize and instead just unroll.

> 
> Originally I had hoped that we’d tackle the straight-line case from PR113816 
> but it
> seems that GCC didn’t even try to create a reduction op for the code there.
> Maybe that’s something to look into separately.

Yeah, I think unrolled scalar is going to beat the ORV there as you can have 
better
throughput doing the reductions in pairs.

> 
> Also, for the alternative test case that we tried to use for a motivation:
> char sior_loop (char *a)
> {
>   char b = 0;
>   for (int i = 0; i < 16; ++i)
>     b |= a[i];
>   return b;
> }
> 
> GCC generates some terrible code: https://godbolt.org/z/a68jodKca
> So it feels that we should do something to improve the bitwise reductions for
> AArch64 regardless.
> Maybe we need to agree on the optimal sequences for the various modes and
> implement them.

Yeah I do think they're worth having! Because the vectorizer's fallback method
of doing these reductions without the optab is element-wise but on the SIMD
side.  I think the optimal sequence for all of them are a mix of SIMD + GPR.

Cheers,
Tamar

> 
> Thanks,
> Kyrill
> 
> >
> > Thanks,
> > Richard
> >
> >>
> >> Thanks for the patch though!
> >>
> >> Cheers,
> >> Tamar
> >>
> >>> However, a natural follow-on would be to use SMAXV and UMAXV for V2DI.
> >>> Taking that into account:
> >>>
> >>>> [...]
> >>>> diff --git a/gcc/config/aarch64/aarch64-sve.md
> b/gcc/config/aarch64/aarch64-
> >>> sve.md
> >>>> index bfa28849adf..0d9e5cebef0 100644
> >>>> --- a/gcc/config/aarch64/aarch64-sve.md
> >>>> +++ b/gcc/config/aarch64/aarch64-sve.md
> >>>> @@ -8927,6 +8927,28 @@
> >>>>   "<su>addv\t%d0, %1, %2.<Vetype>"
> >>>> )
> >>>>
> >>>> +;; Unpredicated logical integer reductions for Advanced SIMD modes.
> >>>> +(define_expand "reduc_<optab>_scal_<mode>"
> >>>> +  [(set (match_operand:<VEL> 0 "register_operand")
> >>>> +  (unspec:<VEL> [(match_dup 2)
> >>>> +                 (match_operand:VQ_I 1 "register_operand")]
> >>>> +                SVE_INT_REDUCTION_LOGICAL))]
> >>>> +  "TARGET_SVE"
> >>>> +  {
> >>>> +    operands[2] = aarch64_ptrue_reg (<VPRED>mode);
> >>>> +  }
> >>>> +)
> >>>> +
> >>>> +;; Predicated logical integer reductions for Advanced SIMD modes.
> >>>> +(define_insn "*aarch64_pred_reduc_<optab>_<mode>"
> >>>> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
> >>>> +  (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
> >>>> +                 (match_operand:VQ_I 2 "register_operand" "w")]
> >>>> +                SVE_INT_REDUCTION_LOGICAL))]
> >>>> +  "TARGET_SVE"
> >>>> +  "<sve_int_op>\t%<Vetype>0, %1, %Z2.<Vetype>"
> >>>> +)
> >>>> +
> >>>
> >>> ...I think we should avoid adding more patterns, and instead extend
> >>> the existing:
> >>>
> >>> (define_expand "reduc_<optab>_scal_<mode>"
> >>>  [(set (match_operand:<VEL> 0 "register_operand")
> >>>     (unspec:<VEL> [(match_dup 2)
> >>>                    (match_operand:SVE_FULL_I 1 "register_operand")]
> >>>                   SVE_INT_REDUCTION))]
> >>>  "TARGET_SVE"
> >>>  {
> >>>    operands[2] = aarch64_ptrue_reg (<VPRED>mode);
> >>>  }
> >>> )
> >>>
> >>> to all Advanced SIMD integer modes except V1DI.  This would involve
> >>> adding a new mode iterator along the same lines as SVE_FULL_SDI_SIMD
> >>> (SVE_FULL_I_SIMD?).
> >>>
> >>> Then we could change the name of:
> >>>
> >>> (define_expand "reduc_<optab>_scal_<mode>"
> >>>  [(match_operand:<VEL> 0 "register_operand")
> >>>   (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")]
> >>>                 MAXMINV)]
> >>>  "TARGET_SIMD"
> >>>  {
> >>>    rtx elt = aarch64_endian_lane_rtx (<MODE>mode, 0);
> >>>    rtx scratch = gen_reg_rtx (<MODE>mode);
> >>>    emit_insn (gen_aarch64_reduc_<optab>_internal<mode> (scratch,
> >>>                                                      operands[1]));
> >>>    emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
> >>>    DONE;
> >>>  }
> >>> )
> >>>
> >>> to something like:
> >>>
> >>> (define_expand "@aarch64_advsimd_<optab><mode>"
> >>>
> >>> The expander above would then be something like:
> >>>
> >>> (define_expand "reduc_<optab>_scal_<mode>"
> >>>  [(set (match_operand:<VEL> 0 "register_operand")
> >>>     (unspec:<VEL> [(match_dup 2)
> >>>                    (match_operand:SVE_FULL_I_SIMD 1 "register_operand")]
> >>>                   SVE_INT_REDUCTION))]
> >>>  "TARGET_SVE
> >>>   || (maybe_code_for_aarch64_advsimd (<SVE_INT_REDUCTION>,
> >>> <MODE>mode)
> >>>       != CODE_FOR_nothing)"
> >>>  {
> >>>    if (rtx pat = maybe_gen_aarch64_advsimd (<SVE_INT_REDUCTION>,
> >>> <MODE>mode))
> >>>      {
> >>>        emit_insn (pat);
> >>>        DONE;
> >>>      }
> >>>    operands[2] = aarch64_ptrue_reg (<VPRED>mode);
> >>>  }
> >>> )
> >>>
> >>> which acts as a kind of union pattern for SVE and Advanced SIMD.
> >>>
> >>> This would handle the SMAXV and UMAXV cases as well as the ANDV, ORV
> >>> and EORV cases.
> >>>
> >>> Not tested, so let me know if it doesn't work :)
> >>>
> >>> Thanks,
> >>> Richard

RE: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector reductions

Reply via email to