On Tue, 29 Oct 2024, Alex Coplan wrote:
> On 29/10/2024 13:39, Richard Biener wrote:
> > On Mon, 28 Oct 2024, Alex Coplan wrote:
> >
> > > This allows us to vectorize more loops with early exits by forcing
> > > peeling for alignment to make sure that we're guaranteed to be able to
> > > safely read an entire vector iteration without crossing a page boundary.
> > >
> > > To make this work for VLA architectures we have to allow compile-time
> > > non-constant target alignments. We also have to override the result of
> > > the target's preferred_vector_alignment hook if it isn't a power-of-two
> > > multiple of the TYPE_SIZE of the chosen vector type.
> > >
> > > There is currently an implicit assumption that the TYPE_SIZE of the
> > > vector type is itself a power of two. For non-VLA types this
> > > could be checked directly in the vectorizer. For VLA types I
> > > had discussed offline with Richard S about adding a target hook to allow
> > > the vectorizer to query the backend to confirm that a given VLA type
> > > is known to have a power-of-two size at runtime.
> >
> > GCC assumes all vectors have power-of-two size, so I don't think we
> > need to check anything but we'd instead have to make sure the
> > target constrains the hardware when this assumption doesn't hold
> > in silicon.
>
> Ah, I didn't realise this was already assumed to be the case (even for
> VLA in a target-independent way). In that case it sounds like we might
> not need the check/hook. Thanks.
>
> >
> > > I thought we
> > > might be able to do this check in vector_alignment_reachable_p. Any
> > > thoughts on that, richi?
> >
> > For the purpose of alignment peeling yeah, I guess this would be
> > a possible place to check this. The hook is currently used for
> > the case where the element has a lower alignment than its
> > size and thus vector alignment cannot be reached by peeling.
> >
> > Btw, I thought we can already apply peeling for alignment for
> > VLA vectors ...
>
> Yes. However, without this patch we don't attempt alignment peeling by
> a compile-time unknown quantity (e.g. to get aligned to the size of a
> VLA vector). As things stand we take whatever
> targetm.preferred_vector_alignment says we should align to, and for SVE
> that is typically a compile-time constant, I think (see
> aarch64_vectorize_preferred_vector_alignment).
>
> >
> > > gcc/ChangeLog:
> > >
> > > * tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
> > > Set need_peeling_for_alignment flag on read DRs instead of
> > > failing vectorization. Punt on gathers.
> > > (dr_misalignment): Handle non-constant target alignments.
> > > (vect_compute_data_ref_alignment): If need_peeling_for_alignment
> > > flag is set on the DR, then override the target alignment chosen
> > > by the preferred_vector_alignment hook to choose a safe
> > > alignment.
> > > (vect_supportable_dr_alignment): Override
> > > support_vector_misalignment hook if need_peeling_for_alignment
> > > is set on the DR: in this case we must return
> > > dr_unaligned_unsupported in order to force peeling.
> > > * tree-vect-loop-manip.cc (vect_do_peeling): Allow prolog
> > > peeling by a compile-time non-constant amount.
> > > * tree-vectorizer.h (dr_vec_info): Add new flag
> > > need_peeling_for_alignment.
> > > ---
> > > gcc/tree-vect-data-refs.cc | 77 ++++++++++++++++++++++++++++++-------
> > > gcc/tree-vect-loop-manip.cc | 6 ---
> > > gcc/tree-vectorizer.h | 5 +++
> > > 3 files changed, 68 insertions(+), 20 deletions(-)
> >
> > Eh, where's the inline copy ...
>
> Hmm, I suppose I should be passing --inline to git format-patch rather
> than --attach.
>
> >
> > @@ -739,15 +739,22 @@ vect_analyze_early_break_dependences (loop_vec_info
> > loop_vinfo)
> > if (DR_IS_READ (dr_ref)
> > && !ref_within_array_bound (stmt, DR_REF (dr_ref)))
> > {
> > + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
> > + {
> > + const char *msg
> >
> > you want to add STMT_VINFO_STRIDED_P as well.
>
> Ack, thanks.
>
> >
> > /* Vector size in bytes. */
> > + poly_uint64 safe_align
> > + = exact_div (tree_to_poly_uint64 (TYPE_SIZE (vectype)),
> > BITS_PER_UNIT);
> >
> > safe_align = TYPE_SIZE_UNIT (vectype);
>
> Nice, thanks.
>
> >
> > + /* Multiply by the unroll factor to get the number of bytes read
> > + per vector iteration. */
> > + if (loop_vinfo)
> > + {
> > + auto num_copies = vect_get_num_copies (loop_vinfo, vectype);
> > + gcc_checking_assert (pow2p_hwi (num_copies));
> > + safe_align *= num_copies;
> >
> > the unroll factor is the vectorization factor
>
> I'm probably being dense here, but I thought that the VF was the number
> of scalar iterations we perform per vector iteration. Is that not the
> case?
Yes.
> > - I think the above goes
> > wrong for grouped accesses like an early break condition
> >
> > if (a[2*i] == a[2*i+1])
> >
> > or so. Thus, multiply by LOOP_VINFO_VECT_FACTOR (loop_vinfo).
>
> If my understanding of VF above is correct, then won't this end up being
> too large a number? E.g. (AIUI) if we vectorize with a single DR with
> vectype having TYPE_MODE of V16QI, then isn't the VF = 16? So we'd take
> the safe_align=16 alignment (size of the vectype) we actually need,
> multiply it by VF = 16, and try to align unnecessarily to 64 bytes?
>
> With the above logic (multiplying by num_copies) I was trying to handle
> cases like:
>
> short a[256];
> void f(unsigned *b)
> {
> for (int i = 0; i < 128; i++) {
> if (b[i] == 42)
> break;
> a[2*i] = b[i];
> a[2*i+1] = b[i];
> }
> }
>
> where for every vector iteration we have to load two vectors for b (and
> only store one vector's worth for a). Hence if we need to peel for b,
> we need to align to (num_copies = 2) * TREE_SIZE (vectype) bytes.
Yeah, I understood that. The issue is that vect_get_num_copies is
only correct for non-SLP. So for STMT_VINFO_GROUPED_ACCESS we
are accessing GROUP_SIZE scalar elements per iteration, times
VF and that divided by TYPE_VECTOR_SUBPARTS of the vector type gives
the number of vectors accessed. So you could use
vect_get_num_vectors (VF * GROUP_SIZE, vectype)
> > Note this number doesn't need to be a power of two (and num_copies
> > above neither)
>
> To clarify, do you mean that there is no guarantee that the VF (or
> indeed num_copies above) will be a power of two, and as such we
> shouldn't assert it but rather fail vectorization if this is the case?
Yes (you do that below already I think).
> The other interpretation of the above is that we don't need to require
> this for correctness, but I think we do (perhaps I'm missing something,
> though).
I think we do require it for correctness - the accesses may not cross
a page boundary.
Richard.
> Thanks a lot for the reviews.
>
> Alex
>
> >
> > The rest of the patch looks good to me.
> >
> > Richard.
> >
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)