On Wed, 29 May 2024, Richard Biener wrote: > On Wed, 29 May 2024, Richard Sandiford wrote: > > > Richard Biener <rguent...@suse.de> writes: > > > Code generation for contiguous load vectorization can already deal > > > with generalized avoidance of loading from a gap. The following > > > extends detection of peeling for gaps requirement with that, > > > gets rid of the old special casing of a half load and makes sure > > > when we do access the gap we have peeling for gaps enabled. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > This is the first patch in a series to improve peeling for gaps, > > > it turned out into an improvement for code rather than just doing > > > the (delayed from stage3) removal of the "old" half-vector codepath. > > > > > > I'll wait for the pre-CI testing for pushing so you also have time > > > for some comments. > > > > LGTM FWIW (some trivia below). > > > > Out of interest, how far are we off being able to load: > > > > a[i*8+0] > > a[i*8+1] > > a[i*8+3] > > a[i*8+4] > > > > as two half vectors? It doesn't look like we're quite there yet, > > but I might have misread. > > The code in vectorizable_load that eventually would do this only > triggers when we run into the final "gap" part. We do not look > at the intermediate gaps at all (if the above is what we see > in the loop body). Extending the code to handle the case > where the intermediate gap is produced because of unrolling (VF > 1) > should be possible - we'd simply need to check whether the currently > loaded elements have unused ones at the end. > > > It would be nice if we could eventually integrate the overrun_p checks > > with the vectorizable_load code that the code is trying to predict. > > E.g. we could run through the vectorizable_load code during the > > analysis phase and record overruns, similarly to Kewen's costing > > patches. As it stands, it seems difficult to make sure that the two > > checks are exactly in sync, especially when the structure is so > > different. > > Yeah - that's why I put the assert in now (which I do expect to > trigger - also thanks to poly-ints may vs. must...)
I quickly looked and why it should be possible to set LOOP_VINFO_PEELING_FOR_GAPS from the loops generating the actual accesses (we run through those now after the costing refactoring) there are quite a lot of paths that would (possibly) need to be covered. We could possibly set a stmt-local LOOP_VINFO_PEELING_FOR_GAPS flag conservatively and clear it in the few places we handle the gap, only updating the global LOOP_VINFO_PEELING_FOR_GAPS at the end, but it's still going to be tricky to not forget a path here. I've amended my TODO accordingly. Richard. > Richard. > > > > Richard. > > > > > > PR tree-optimization/115252 > > > * tree-vect-stmts.cc (get_group_load_store_type): Enhance > > > detecting the number of cases where we can avoid accessing a gap > > > during code generation. > > > (vectorizable_load): Remove old half-vector peeling for gap > > > avoidance which is now redundant. Add gap-aligned case where > > > it's OK to access the gap. Add assert that we have peeling for > > > gaps enabled when we access a gap. > > > > > > * gcc.dg/vect/slp-gap-1.c: New testcase. > > > --- > > > gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++ > > > gcc/tree-vect-stmts.cc | 58 +++++++++++++-------------- > > > 2 files changed, 46 insertions(+), 30 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c > > > b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c > > > new file mode 100644 > > > index 00000000000..36463ca22c5 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c > > > @@ -0,0 +1,18 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3" } */ > > > + > > > +typedef unsigned char uint8_t; > > > +typedef short int16_t; > > > +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t > > > *pix2) { > > > + for (int y = 0; y < 4; y++) { > > > + for (int x = 0; x < 4; x++) > > > + diff[x + y * 4] = pix1[x] - pix2[x]; > > > + pix1 += 16; > > > + pix2 += 32; > > > + } > > > +} > > > + > > > +/* We can vectorize this without peeling for gaps and thus without > > > epilogue, > > > + but the only thing we can reliably scan is the zero-padding trick for > > > the > > > + partial loads. */ > > > +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target > > > vect64 } } } */ > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > > index a01099d3456..b26cc74f417 100644 > > > --- a/gcc/tree-vect-stmts.cc > > > +++ b/gcc/tree-vect-stmts.cc > > > @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, > > > stmt_vec_info stmt_info, > > > dr_alignment_support alss; > > > int misalign = dr_misalignment (first_dr_info, vectype); > > > tree half_vtype; > > > + poly_uint64 remain; > > > + unsigned HOST_WIDE_INT tem, num; > > > if (overrun_p > > > && !masked_p > > > && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info, > > > vectype, misalign))) > > > == dr_aligned > > > || alss == dr_unaligned_supported) > > > - && known_eq (nunits, (group_size - gap) * 2) > > > - && known_eq (nunits, group_size) > > > - && (vector_vector_composition_type (vectype, 2, &half_vtype) > > > - != NULL_TREE)) > > > + && can_div_trunc_p (group_size > > > + * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap, > > > + nunits, &tem, &remain) > > > + && (known_eq (remain, 0u) > > > + || (constant_multiple_p (nunits, remain, &num) > > > + && (vector_vector_composition_type (vectype, num, > > > + &half_vtype) > > > + != NULL_TREE)))) > > > overrun_p = false; > > > > Might be worth renaming half_vtype now that it isn't necessarily > > a strict half. > > > > > > > > if (overrun_p && !can_overrun_p) > > > @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo, > > > unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info); > > > unsigned int vect_align > > > = vect_known_alignment_in_bytes (first_dr_info, vectype); > > > - unsigned int scalar_dr_size > > > - = vect_get_scalar_dr_size (first_dr_info); > > > - /* If there's no peeling for gaps but we have a gap > > > - with slp loads then load the lower half of the > > > - vector only. See get_group_load_store_type for > > > - when we apply this optimization. */ > > > - if (slp > > > - && loop_vinfo > > > - && !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0 > > > - && known_eq (nunits, (group_size - gap) * 2) > > > - && known_eq (nunits, group_size) > > > - && gap >= (vect_align / scalar_dr_size)) > > > - { > > > - tree half_vtype; > > > - new_vtype > > > - = vector_vector_composition_type (vectype, 2, > > > - &half_vtype); > > > - if (new_vtype != NULL_TREE) > > > - ltype = half_vtype; > > > - } > > > /* Try to use a single smaller load when we are about > > > to load excess elements compared to the unrolled > > > - scalar loop. > > > - ??? This should cover the above case as well. */ > > > - else if (known_gt ((vec_num * j + i + 1) * nunits, > > > + scalar loop. */ > > > + if (known_gt ((vec_num * j + i + 1) * nunits, > > > (group_size * vf - gap))) > > > > Missing reindentation. > > > > Pre-existing, but shouldn't this be maybe_gt rather than known_gt? > > We can only skip doing it if we know for sure that the load won't cross > > the gap. (Not sure whether the difference can trigger in practice.) > > > > Thanks, > > Richard > > > > > { > > > + poly_uint64 remain = ((group_size * vf - gap) > > > + - (vec_num * j + i) * nunits); > > > if (known_ge ((vec_num * j + i + 1) * nunits > > > - (group_size * vf - gap), nunits)) > > > /* DR will be unused. */ > > > @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo, > > > at least one element is accessed in the > > > scalar loop. */ > > > ; > > > + else if (known_gt (vect_align, > > > + ((nunits - remain) > > > + * vect_get_scalar_dr_size > > > + (first_dr_info)))) > > > + /* Aligned access to the gap area when there's > > > + at least one element in it is OK. */ > > > + ; > > > else > > > { > > > - auto remain > > > - = ((group_size * vf - gap) > > > - - (vec_num * j + i) * nunits); > > > /* remain should now be > 0 and < nunits. */ > > > unsigned num; > > > if (constant_multiple_p (nunits, remain, &num)) > > > @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo, > > > ltype = ptype; > > > } > > > /* Else use multiple loads or a masked load? */ > > > + /* For loop vectorization we now should have > > > + an alternate type or LOOP_VINFO_PEELING_FOR_GAPS > > > + set. */ > > > + if (loop_vinfo) > > > + gcc_assert (new_vtype > > > + || LOOP_VINFO_PEELING_FOR_GAPS > > > + (loop_vinfo)); > > > } > > > } > > > tree offset > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)