Re: [RFC PATCH 1/5] vect: Force alignment peeling to vectorize more early break loops

Richard Biener Wed, 30 Oct 2024 01:58:07 -0700

On Tue, 29 Oct 2024, Alex Coplan wrote:

> On 29/10/2024 13:39, Richard Biener wrote:
> > On Mon, 28 Oct 2024, Alex Coplan wrote:
> > 
> > > This allows us to vectorize more loops with early exits by forcing
> > > peeling for alignment to make sure that we're guaranteed to be able to
> > > safely read an entire vector iteration without crossing a page boundary.
> > > 
> > > To make this work for VLA architectures we have to allow compile-time
> > > non-constant target alignments.  We also have to override the result of
> > > the target's preferred_vector_alignment hook if it isn't a power-of-two
> > > multiple of the TYPE_SIZE of the chosen vector type.
> > > 
> > > There is currently an implicit assumption that the TYPE_SIZE of the
> > > vector type is itself a power of two.  For non-VLA types this
> > > could be checked directly in the vectorizer.  For VLA types I
> > > had discussed offline with Richard S about adding a target hook to allow
> > > the vectorizer to query the backend to confirm that a given VLA type
> > > is known to have a power-of-two size at runtime.
> > 
> > GCC assumes all vectors have power-of-two size, so I don't think we
> > need to check anything but we'd instead have to make sure the
> > target constrains the hardware when this assumption doesn't hold
> > in silicon.
> 
> Ah, I didn't realise this was already assumed to be the case (even for
> VLA in a target-independent way).  In that case it sounds like we might
> not need the check/hook.  Thanks.
> 
> > 
> > >  I thought we
> > > might be able to do this check in vector_alignment_reachable_p.  Any
> > > thoughts on that, richi?
> > 
> > For the purpose of alignment peeling yeah, I guess this would be
> > a possible place to check this.  The hook is currently used for
> > the case where the element has a lower alignment than its
> > size and thus vector alignment cannot be reached by peeling.
> > 
> > Btw, I thought we can already apply peeling for alignment for
> > VLA vectors ...
> 
> Yes.  However, without this patch we don't attempt alignment peeling by
> a compile-time unknown quantity (e.g. to get aligned to the size of a
> VLA vector).  As things stand we take whatever
> targetm.preferred_vector_alignment says we should align to, and for SVE
> that is typically a compile-time constant, I think (see
> aarch64_vectorize_preferred_vector_alignment).
> 
> > 
> > > gcc/ChangeLog:
> > > 
> > >   * tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
> > >   Set need_peeling_for_alignment flag on read DRs instead of
> > >   failing vectorization.  Punt on gathers.
> > >   (dr_misalignment): Handle non-constant target alignments.
> > >   (vect_compute_data_ref_alignment): If need_peeling_for_alignment
> > >   flag is set on the DR, then override the target alignment chosen
> > >   by the preferred_vector_alignment hook to choose a safe
> > >   alignment.
> > >   (vect_supportable_dr_alignment): Override
> > >   support_vector_misalignment hook if need_peeling_for_alignment
> > >   is set on the DR: in this case we must return
> > >   dr_unaligned_unsupported in order to force peeling.
> > >   * tree-vect-loop-manip.cc (vect_do_peeling): Allow prolog
> > >   peeling by a compile-time non-constant amount.
> > >   * tree-vectorizer.h (dr_vec_info): Add new flag
> > >   need_peeling_for_alignment.
> > > ---
> > >  gcc/tree-vect-data-refs.cc  | 77 ++++++++++++++++++++++++++++++-------
> > >  gcc/tree-vect-loop-manip.cc |  6 ---
> > >  gcc/tree-vectorizer.h       |  5 +++
> > >  3 files changed, 68 insertions(+), 20 deletions(-)
> > 
> > Eh, where's the inline copy ...
> 
> Hmm, I suppose I should be passing --inline to git format-patch rather
> than --attach.
> 
> > 
> > @@ -739,15 +739,22 @@ vect_analyze_early_break_dependences (loop_vec_info 
> > loop_vinfo)
> >           if (DR_IS_READ (dr_ref)
> >               && !ref_within_array_bound (stmt, DR_REF (dr_ref)))
> >             {
> > +             if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
> > +               {
> > +                 const char *msg
> > 
> > you want to add STMT_VINFO_STRIDED_P as well.
> 
> Ack, thanks.
> 
> > 
> >       /* Vector size in bytes.  */
> > +      poly_uint64 safe_align
> > +       = exact_div (tree_to_poly_uint64 (TYPE_SIZE (vectype)), 
> > BITS_PER_UNIT);
> > 
> > safe_align = TYPE_SIZE_UNIT (vectype);
> 
> Nice, thanks.
> 
> > 
> > +      /* Multiply by the unroll factor to get the number of bytes read
> > +        per vector iteration.  */
> > +      if (loop_vinfo)
> > +       {
> > +         auto num_copies = vect_get_num_copies (loop_vinfo, vectype);
> > +         gcc_checking_assert (pow2p_hwi (num_copies));
> > +         safe_align *= num_copies;
> > 
> > the unroll factor is the vectorization factor
> 
> I'm probably being dense here, but I thought that the VF was the number
> of scalar iterations we perform per vector iteration.  Is that not the
> case?


Yes.

> > - I think the above goes
> > wrong for grouped accesses like an early break condition
> > 
> >  if (a[2*i] == a[2*i+1])
> > 
> > or so.  Thus, multiply by LOOP_VINFO_VECT_FACTOR (loop_vinfo).
> 
> If my understanding of VF above is correct, then won't this end up being
> too large a number?  E.g. (AIUI) if we vectorize with a single DR with
> vectype having TYPE_MODE of V16QI, then isn't the VF = 16?  So we'd take
> the safe_align=16 alignment (size of the vectype) we actually need,
> multiply it by VF = 16, and try to align unnecessarily to 64 bytes?
> 
> With the above logic (multiplying by num_copies) I was trying to handle
> cases like:
> 
>   short a[256];
>   void f(unsigned *b)
>   {
>       for (int i = 0; i < 128; i++) {
>         if (b[i] == 42)
>           break;
>         a[2*i] = b[i];
>         a[2*i+1] = b[i];
>       }
>   }
> 
> where for every vector iteration we have to load two vectors for b (and
> only store one vector's worth for a).  Hence if we need to peel for b,
> we need to align to (num_copies = 2) * TREE_SIZE (vectype) bytes.

Yeah, I understood that.  The issue is that vect_get_num_copies is
only correct for non-SLP.  So for STMT_VINFO_GROUPED_ACCESS we
are accessing GROUP_SIZE scalar elements per iteration, times
VF and that divided by TYPE_VECTOR_SUBPARTS of the vector type gives
the number of vectors accessed.  So you could use

vect_get_num_vectors (VF * GROUP_SIZE, vectype)

> > Note this number doesn't need to be a power of two (and num_copies
> > above neither)
> 
> To clarify, do you mean that there is no guarantee that the VF (or
> indeed num_copies above) will be a power of two, and as such we
> shouldn't assert it but rather fail vectorization if this is the case?

Yes (you do that below already I think).

> The other interpretation of the above is that we don't need to require
> this for correctness, but I think we do (perhaps I'm missing something,
> though).

I think we do require it for correctness - the accesses may not cross
a page boundary.

Richard.

> Thanks a lot for the reviews.
> 
> Alex
> 
> > 
> > The rest of the patch looks good to me.
> > 
> > Richard.
> > 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [RFC PATCH 1/5] vect: Force alignment peeling to vectorize more early break loops

Reply via email to