Code generation for contiguous load vectorization can already deal with generalized avoidance of loading from a gap. The following extends detection of peeling for gaps requirement with that, gets rid of the old special casing of a half load and makes sure when we do access the gap we have peeling for gaps enabled.
Bootstrapped and tested on x86_64-unknown-linux-gnu. This is the first patch in a series to improve peeling for gaps, it turned out into an improvement for code rather than just doing the (delayed from stage3) removal of the "old" half-vector codepath. I'll wait for the pre-CI testing for pushing so you also have time for some comments. Richard. PR tree-optimization/115252 * tree-vect-stmts.cc (get_group_load_store_type): Enhance detecting the number of cases where we can avoid accessing a gap during code generation. (vectorizable_load): Remove old half-vector peeling for gap avoidance which is now redundant. Add gap-aligned case where it's OK to access the gap. Add assert that we have peeling for gaps enabled when we access a gap. * gcc.dg/vect/slp-gap-1.c: New testcase. --- gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++ gcc/tree-vect-stmts.cc | 58 +++++++++++++-------------- 2 files changed, 46 insertions(+), 30 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c new file mode 100644 index 00000000000..36463ca22c5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O3" } */ + +typedef unsigned char uint8_t; +typedef short int16_t; +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) { + for (int y = 0; y < 4; y++) { + for (int x = 0; x < 4; x++) + diff[x + y * 4] = pix1[x] - pix2[x]; + pix1 += 16; + pix2 += 32; + } +} + +/* We can vectorize this without peeling for gaps and thus without epilogue, + but the only thing we can reliably scan is the zero-padding trick for the + partial loads. */ +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target vect64 } } } */ diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index a01099d3456..b26cc74f417 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, dr_alignment_support alss; int misalign = dr_misalignment (first_dr_info, vectype); tree half_vtype; + poly_uint64 remain; + unsigned HOST_WIDE_INT tem, num; if (overrun_p && !masked_p && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info, vectype, misalign))) == dr_aligned || alss == dr_unaligned_supported) - && known_eq (nunits, (group_size - gap) * 2) - && known_eq (nunits, group_size) - && (vector_vector_composition_type (vectype, 2, &half_vtype) - != NULL_TREE)) + && can_div_trunc_p (group_size + * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap, + nunits, &tem, &remain) + && (known_eq (remain, 0u) + || (constant_multiple_p (nunits, remain, &num) + && (vector_vector_composition_type (vectype, num, + &half_vtype) + != NULL_TREE)))) overrun_p = false; if (overrun_p && !can_overrun_p) @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo, unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info); unsigned int vect_align = vect_known_alignment_in_bytes (first_dr_info, vectype); - unsigned int scalar_dr_size - = vect_get_scalar_dr_size (first_dr_info); - /* If there's no peeling for gaps but we have a gap - with slp loads then load the lower half of the - vector only. See get_group_load_store_type for - when we apply this optimization. */ - if (slp - && loop_vinfo - && !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0 - && known_eq (nunits, (group_size - gap) * 2) - && known_eq (nunits, group_size) - && gap >= (vect_align / scalar_dr_size)) - { - tree half_vtype; - new_vtype - = vector_vector_composition_type (vectype, 2, - &half_vtype); - if (new_vtype != NULL_TREE) - ltype = half_vtype; - } /* Try to use a single smaller load when we are about to load excess elements compared to the unrolled - scalar loop. - ??? This should cover the above case as well. */ - else if (known_gt ((vec_num * j + i + 1) * nunits, + scalar loop. */ + if (known_gt ((vec_num * j + i + 1) * nunits, (group_size * vf - gap))) { + poly_uint64 remain = ((group_size * vf - gap) + - (vec_num * j + i) * nunits); if (known_ge ((vec_num * j + i + 1) * nunits - (group_size * vf - gap), nunits)) /* DR will be unused. */ @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo, at least one element is accessed in the scalar loop. */ ; + else if (known_gt (vect_align, + ((nunits - remain) + * vect_get_scalar_dr_size + (first_dr_info)))) + /* Aligned access to the gap area when there's + at least one element in it is OK. */ + ; else { - auto remain - = ((group_size * vf - gap) - - (vec_num * j + i) * nunits); /* remain should now be > 0 and < nunits. */ unsigned num; if (constant_multiple_p (nunits, remain, &num)) @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo, ltype = ptype; } /* Else use multiple loads or a masked load? */ + /* For loop vectorization we now should have + an alternate type or LOOP_VINFO_PEELING_FOR_GAPS + set. */ + if (loop_vinfo) + gcc_assert (new_vtype + || LOOP_VINFO_PEELING_FOR_GAPS + (loop_vinfo)); } } tree offset -- 2.35.3