On Wed, 29 May 2024, Richard Biener wrote:

> On Wed, 29 May 2024, Richard Sandiford wrote:
> 
> > Richard Biener <rguent...@suse.de> writes:
> > > Code generation for contiguous load vectorization can already deal
> > > with generalized avoidance of loading from a gap.  The following
> > > extends detection of peeling for gaps requirement with that,
> > > gets rid of the old special casing of a half load and makes sure
> > > when we do access the gap we have peeling for gaps enabled.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > This is the first patch in a series to improve peeling for gaps,
> > > it turned out into an improvement for code rather than just doing
> > > the (delayed from stage3) removal of the "old" half-vector codepath.
> > >
> > > I'll wait for the pre-CI testing for pushing so you also have time
> > > for some comments.
> > 
> > LGTM FWIW (some trivia below).
> > 
> > Out of interest, how far are we off being able to load:
> > 
> >     a[i*8+0]
> >     a[i*8+1]
> >     a[i*8+3]
> >     a[i*8+4]
> > 
> > as two half vectors?  It doesn't look like we're quite there yet,
> > but I might have misread.
> 
> The code in vectorizable_load that eventually would do this only
> triggers when we run into the final "gap" part.  We do not look
> at the intermediate gaps at all (if the above is what we see
> in the loop body).  Extending the code to handle the case
> where the intermediate gap is produced because of unrolling (VF > 1)
> should be possible - we'd simply need to check whether the currently
> loaded elements have unused ones at the end.
> 
> > It would be nice if we could eventually integrate the overrun_p checks
> > with the vectorizable_load code that the code is trying to predict.
> > E.g. we could run through the vectorizable_load code during the
> > analysis phase and record overruns, similarly to Kewen's costing
> > patches.  As it stands, it seems difficult to make sure that the two
> > checks are exactly in sync, especially when the structure is so
> > different.
> 
> Yeah - that's why I put the assert in now (which I do expect to
> trigger - also thanks to poly-ints may vs. must...)

I quickly looked and why it should be possible to set
LOOP_VINFO_PEELING_FOR_GAPS from the loops generating the actual
accesses (we run through those now after the costing refactoring)
there are quite a lot of paths that would (possibly) need to be
covered.  We could possibly set a stmt-local 
LOOP_VINFO_PEELING_FOR_GAPS flag conservatively and clear it
in the few places we handle the gap, only updating the global
LOOP_VINFO_PEELING_FOR_GAPS at the end, but it's still going to
be tricky to not forget a path here.

I've amended my TODO accordingly.

Richard.

> Richard.
> 
> > > Richard.
> > >
> > >   PR tree-optimization/115252
> > >   * tree-vect-stmts.cc (get_group_load_store_type): Enhance
> > >   detecting the number of cases where we can avoid accessing a gap
> > >   during code generation.
> > >   (vectorizable_load): Remove old half-vector peeling for gap
> > >   avoidance which is now redundant.  Add gap-aligned case where
> > >   it's OK to access the gap.  Add assert that we have peeling for
> > >   gaps enabled when we access a gap.
> > >
> > >   * gcc.dg/vect/slp-gap-1.c: New testcase.
> > > ---
> > >  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
> > >  gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
> > >  2 files changed, 46 insertions(+), 30 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c 
> > > b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > > new file mode 100644
> > > index 00000000000..36463ca22c5
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > > @@ -0,0 +1,18 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3" } */
> > > +
> > > +typedef unsigned char uint8_t;
> > > +typedef short int16_t;
> > > +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t 
> > > *pix2) {
> > > +  for (int y = 0; y < 4; y++) {
> > > +    for (int x = 0; x < 4; x++)
> > > +      diff[x + y * 4] = pix1[x] - pix2[x];
> > > +    pix1 += 16;
> > > +    pix2 += 32;
> > > +  }
> > > +}
> > > +
> > > +/* We can vectorize this without peeling for gaps and thus without 
> > > epilogue,
> > > +   but the only thing we can reliably scan is the zero-padding trick for 
> > > the
> > > +   partial loads.  */
> > > +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target 
> > > vect64 } } } */
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index a01099d3456..b26cc74f417 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, 
> > > stmt_vec_info stmt_info,
> > >     dr_alignment_support alss;
> > >     int misalign = dr_misalignment (first_dr_info, vectype);
> > >     tree half_vtype;
> > > +   poly_uint64 remain;
> > > +   unsigned HOST_WIDE_INT tem, num;
> > >     if (overrun_p
> > >         && !masked_p
> > >         && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
> > >                                                     vectype, misalign)))
> > >              == dr_aligned
> > >             || alss == dr_unaligned_supported)
> > > -       && known_eq (nunits, (group_size - gap) * 2)
> > > -       && known_eq (nunits, group_size)
> > > -       && (vector_vector_composition_type (vectype, 2, &half_vtype)
> > > -           != NULL_TREE))
> > > +       && can_div_trunc_p (group_size
> > > +                           * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> > > +                           nunits, &tem, &remain)
> > > +       && (known_eq (remain, 0u)
> > > +           || (constant_multiple_p (nunits, remain, &num)
> > > +               && (vector_vector_composition_type (vectype, num,
> > > +                                                   &half_vtype)
> > > +                   != NULL_TREE))))
> > >       overrun_p = false;
> > 
> > Might be worth renaming half_vtype now that it isn't necessarily
> > a strict half.
> > 
> > >  
> > >     if (overrun_p && !can_overrun_p)
> > > @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
> > >               unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
> > >               unsigned int vect_align
> > >                 = vect_known_alignment_in_bytes (first_dr_info, vectype);
> > > -             unsigned int scalar_dr_size
> > > -               = vect_get_scalar_dr_size (first_dr_info);
> > > -             /* If there's no peeling for gaps but we have a gap
> > > -                with slp loads then load the lower half of the
> > > -                vector only.  See get_group_load_store_type for
> > > -                when we apply this optimization.  */
> > > -             if (slp
> > > -                 && loop_vinfo
> > > -                 && !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
> > > -                 && known_eq (nunits, (group_size - gap) * 2)
> > > -                 && known_eq (nunits, group_size)
> > > -                 && gap >= (vect_align / scalar_dr_size))
> > > -               {
> > > -                 tree half_vtype;
> > > -                 new_vtype
> > > -                   = vector_vector_composition_type (vectype, 2,
> > > -                                                     &half_vtype);
> > > -                 if (new_vtype != NULL_TREE)
> > > -                   ltype = half_vtype;
> > > -               }
> > >               /* Try to use a single smaller load when we are about
> > >                  to load excess elements compared to the unrolled
> > > -                scalar loop.
> > > -                ???  This should cover the above case as well.  */
> > > -             else if (known_gt ((vec_num * j + i + 1) * nunits,
> > > +                scalar loop.  */
> > > +             if (known_gt ((vec_num * j + i + 1) * nunits,
> > >                                  (group_size * vf - gap)))
> > 
> > Missing reindentation.
> > 
> > Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
> > We can only skip doing it if we know for sure that the load won't cross
> > the gap.  (Not sure whether the difference can trigger in practice.)
> > 
> > Thanks,
> > Richard
> > 
> > >                 {
> > > +                 poly_uint64 remain = ((group_size * vf - gap)
> > > +                                       - (vec_num * j + i) * nunits);
> > >                   if (known_ge ((vec_num * j + i + 1) * nunits
> > >                                 - (group_size * vf - gap), nunits))
> > >                     /* DR will be unused.  */
> > > @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
> > >                        at least one element is accessed in the
> > >                        scalar loop.  */
> > >                     ;
> > > +                 else if (known_gt (vect_align,
> > > +                                    ((nunits - remain)
> > > +                                     * vect_get_scalar_dr_size
> > > +                                         (first_dr_info))))
> > > +                   /* Aligned access to the gap area when there's
> > > +                      at least one element in it is OK.  */
> > > +                   ;
> > >                   else
> > >                     {
> > > -                     auto remain
> > > -                       = ((group_size * vf - gap)
> > > -                          - (vec_num * j + i) * nunits);
> > >                       /* remain should now be > 0 and < nunits.  */
> > >                       unsigned num;
> > >                       if (constant_multiple_p (nunits, remain, &num))
> > > @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
> > >                             ltype = ptype;
> > >                         }
> > >                       /* Else use multiple loads or a masked load?  */
> > > +                     /* For loop vectorization we now should have
> > > +                        an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
> > > +                        set.  */
> > > +                     if (loop_vinfo)
> > > +                       gcc_assert (new_vtype
> > > +                                   || LOOP_VINFO_PEELING_FOR_GAPS
> > > +                                        (loop_vinfo));
> > >                     }
> > >                 }
> > >               tree offset
> > 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to