[PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance

Richard Biener Tue, 28 May 2024 05:18:09 -0700

Code generation for contiguous load vectorization can already deal
with generalized avoidance of loading from a gap.  The following
extends detection of peeling for gaps requirement with that,
gets rid of the old special casing of a half load and makes sure
when we do access the gap we have peeling for gaps enabled.


Bootstrapped and tested on x86_64-unknown-linux-gnu.

This is the first patch in a series to improve peeling for gaps,
it turned out into an improvement for code rather than just doing
the (delayed from stage3) removal of the "old" half-vector codepath.

I'll wait for the pre-CI testing for pushing so you also have time
for some comments.

Richard.

        PR tree-optimization/115252
        * tree-vect-stmts.cc (get_group_load_store_type): Enhance
        detecting the number of cases where we can avoid accessing a gap
        during code generation.
        (vectorizable_load): Remove old half-vector peeling for gap
        avoidance which is now redundant.  Add gap-aligned case where
        it's OK to access the gap.  Add assert that we have peeling for
        gaps enabled when we access a gap.

        * gcc.dg/vect/slp-gap-1.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
 gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
 2 files changed, 46 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c 
b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
new file mode 100644
index 00000000000..36463ca22c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+
+typedef unsigned char uint8_t;
+typedef short int16_t;
+void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
+  for (int y = 0; y < 4; y++) {
+    for (int x = 0; x < 4; x++)
+      diff[x + y * 4] = pix1[x] - pix2[x];
+    pix1 += 16;
+    pix2 += 32;
+  }
+}
+
+/* We can vectorize this without peeling for gaps and thus without epilogue,
+   but the only thing we can reliably scan is the zero-padding trick for the
+   partial loads.  */
+/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target 
vect64 } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a01099d3456..b26cc74f417 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, 
stmt_vec_info stmt_info,
          dr_alignment_support alss;
          int misalign = dr_misalignment (first_dr_info, vectype);
          tree half_vtype;
+         poly_uint64 remain;
+         unsigned HOST_WIDE_INT tem, num;
          if (overrun_p
              && !masked_p
              && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
                                                          vectype, misalign)))
                   == dr_aligned
                  || alss == dr_unaligned_supported)
-             && known_eq (nunits, (group_size - gap) * 2)
-             && known_eq (nunits, group_size)
-             && (vector_vector_composition_type (vectype, 2, &half_vtype)
-                 != NULL_TREE))
+             && can_div_trunc_p (group_size
+                                 * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
+                                 nunits, &tem, &remain)
+             && (known_eq (remain, 0u)
+                 || (constant_multiple_p (nunits, remain, &num)
+                     && (vector_vector_composition_type (vectype, num,
+                                                         &half_vtype)
+                         != NULL_TREE))))
            overrun_p = false;
 
          if (overrun_p && !can_overrun_p)
@@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
                    unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
                    unsigned int vect_align
                      = vect_known_alignment_in_bytes (first_dr_info, vectype);
-                   unsigned int scalar_dr_size
-                     = vect_get_scalar_dr_size (first_dr_info);
-                   /* If there's no peeling for gaps but we have a gap
-                      with slp loads then load the lower half of the
-                      vector only.  See get_group_load_store_type for
-                      when we apply this optimization.  */
-                   if (slp
-                       && loop_vinfo
-                       && !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
-                       && known_eq (nunits, (group_size - gap) * 2)
-                       && known_eq (nunits, group_size)
-                       && gap >= (vect_align / scalar_dr_size))
-                     {
-                       tree half_vtype;
-                       new_vtype
-                         = vector_vector_composition_type (vectype, 2,
-                                                           &half_vtype);
-                       if (new_vtype != NULL_TREE)
-                         ltype = half_vtype;
-                     }
                    /* Try to use a single smaller load when we are about
                       to load excess elements compared to the unrolled
-                      scalar loop.
-                      ???  This should cover the above case as well.  */
-                   else if (known_gt ((vec_num * j + i + 1) * nunits,
+                      scalar loop.  */
+                   if (known_gt ((vec_num * j + i + 1) * nunits,
                                       (group_size * vf - gap)))
                      {
+                       poly_uint64 remain = ((group_size * vf - gap)
+                                             - (vec_num * j + i) * nunits);
                        if (known_ge ((vec_num * j + i + 1) * nunits
                                      - (group_size * vf - gap), nunits))
                          /* DR will be unused.  */
@@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
                             at least one element is accessed in the
                             scalar loop.  */
                          ;
+                       else if (known_gt (vect_align,
+                                          ((nunits - remain)
+                                           * vect_get_scalar_dr_size
+                                               (first_dr_info))))
+                         /* Aligned access to the gap area when there's
+                            at least one element in it is OK.  */
+                         ;
                        else
                          {
-                           auto remain
-                             = ((group_size * vf - gap)
-                                - (vec_num * j + i) * nunits);
                            /* remain should now be > 0 and < nunits.  */
                            unsigned num;
                            if (constant_multiple_p (nunits, remain, &num))
@@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
                                  ltype = ptype;
                              }
                            /* Else use multiple loads or a masked load?  */
+                           /* For loop vectorization we now should have
+                              an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
+                              set.  */
+                           if (loop_vinfo)
+                             gcc_assert (new_vtype
+                                         || LOOP_VINFO_PEELING_FOR_GAPS
+                                              (loop_vinfo));
                          }
                      }
                    tree offset
-- 
2.35.3

[PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance

Reply via email to