Hi Richard,

I believe this is in line with what you were explaining to me earlier. The one thing I haven't quite done here is the jump around for if there is no prolog peeling. Though I think skip_vectors introduces the jumping we need.

The main gist of it is:
1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the versioning threshold check when versioning a loop at first, 2) later we allow the vectorizer to use skip_vectors to basically check niters to be able to skip the vectorized loop on to the right epilogue, 3) after vectorizing the epilogue, we check whether this was a versioned loop and whether we skipped adding the versioning threshold, if so we add a threshold based on the last VF

There is a caveat here, if we don't vectorize the epilogue, because say our architecture doesn't know how, we will end up with a regression. Let's say we have a loop for which our target can vectorize for 32x but not 16x, we get:

if (alias & threshold_for_16x ())
{
  if (niters > 31)
    vectorized_31 ();
  else
    scalar ();
}
else
 scalar ();

Imagine our iteration count is 18, all we hit is the scalar loop, but now we go through 1x alias and 2x niters. Whereas potentially we could have done with just 1x niters.

A mitigation for this could be to only add the threshold check if we actually vectorized the last loop, I believe a: 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new "else" in vect_transform_loop would do the trick. Though then we will still have that extra alias check...

I am going to hack around in the back-end to "fake" a situation like this and see what happens.

Another potential issue arises when the profitability threshold obtained from the cost model isn't guaranteed to be either LE or EQ to the versioning threshold. In that case there are two separate checks, which now we no longer will attempt to fold.

I am trying to find a way to test and benchmark these changes. Unfortunately I am having trouble doing this for AVX512 as I find that the old '--param vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the gcc and perlbench benchmarks.


Cheers,
Andre

On 19/07/2019 13:38, Andre Vieira (lists) wrote:


On 19/07/2019 12:35, Richard Biener wrote:
On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:



On 15/07/2019 11:54, Richard Biener wrote:
On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:



On 12/07/2019 11:19, Richard Biener wrote:
On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:


I have code that can split the condition and alias checks in
'vect_loop_versioning'. For this approach I am considering keeping that
bit of
code and seeing if I can patch up the checks after vectorizing the
epilogue
further. I think initially I will just go with a "hacked up" way of
passing
down the bb with the iteration check and split the false edge every time
we
vectorize it further. Will keep you posted on progress. If you have any
pointers/tips they are most welc ome!

I thought to somehow force the idea that we have a prologue loop
to the vectorizer so it creates the number-of-vectorized iterations
check and branch around the main (highest VF) vectorized loop.


Hmm I think I may have skimmed over this earlier. I am reading it now and am
not entirely sure what you mean by this. Specifically the "number of
vectorized iterations" or how a prologue loop plays a role in this.

When there is no prologue then the versioning condition currently
ensures we always enter the vector loop.  Only if there is a prologue
is there a check and branch eventually skipping right to the epilogue.
If the versioning condition (now using a lower VF) doesn't guarantee
this we have to add this jump-around.

Right, I haven't looked at the prologue path yet. I had a quick look and can't find where this branch skipping to the epilogue is constructed.  I will take a better look after I got my current example to work.


I guess this sheds some light on the comment above. And it definitely implies we would need to know the lowest VF when creating this condition. Which is
tricky.

We can simply use the smallest vector size supported by the target to
derive it from the actual VF, no?

So I could wait to introduce this check after all epilogue vectorization is done, back track to the last niter check and duplicate that in the outer loop.

What I didn't want to do was use the smallest possible vector size for the target because I was under the impression that does not necessarily correspond to the smallest VF we may have for a loop, as the vectorizer may have decided not to vectorize for that vector size because of costs? If it I can assume this never happens, that once it starts to vectorize epilogues that it will keep vectorizing them for any vector size it knows off then yeah I can use that.


I'm not sure I understand - why would you have any check not inside
the outer loop?  Yes, we now eventually hoist versioning checks
but the VF checks for the individual variants should be around
the vectorized loop itself (so not really part of the versioning check).

Yeah I agree. I was just explaining what I had done wrong now.

Cheers,
Andre

PS: I often find myself having to patch the DOMINATOR information, sometimes its easy to, but sometimes it can get pretty complicated. I wonder whether it would make sense to write something that traverses a loop and corrects this,
if it doesn't exist already.

There's iterate-fix-dominators, but unless you create new edges/blocks
manually rather than doing split-block/redirect-edge which should do
dominator updating for you.

Ah I was doing everything manually after having some bad experiences with lv_add_condition_to_bb.  I will have a look at those thanks!

Cheers,
Andre


Richard.



Richard.



diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..b1f1d2d6c6f8b48cce1a666dfbf8cf542f6857df 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -268,6 +268,10 @@ public:
      the basic-block from being collected but its index can still be
      reused.  */
   basic_block former_header;
+
+  /* Basic-block containing the checks for versioning this loop.  This is set
+     by `vect_loop_versioning'.  */
+  basic_block version_check_bb;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..47873eb76543c3bfa39c197a8f98eeaf132faf4c 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -355,6 +355,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->version_check_bb = NULL;
   return loop;
 }
 
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..b899a5cfb626d12125d0595b3bbbf161cb27d446 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2401,7 +2401,8 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 bool vect_epilogues_nomask)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2474,7 +2475,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues_nomask));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2968,7 +2970,8 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
 class loop *
 vect_loop_versioning (loop_vec_info loop_vinfo,
 		      unsigned int th, bool check_profitability,
-		      poly_uint64 versioning_threshold)
+		      poly_uint64 versioning_threshold,
+		      bool vect_epilogues_nomask)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2995,7 +2998,7 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
 					    th - 1));
-  if (maybe_ne (versioning_threshold, 0U))
+  if (maybe_ne (versioning_threshold, 0U) && !vect_epilogues_nomask)
     {
       tree expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			       build_int_cst (TREE_TYPE (scalar_loop_iters),
@@ -3199,6 +3202,9 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
       update_ssa (TODO_update_ssa);
     }
 
+  if (vect_epilogues_nomask)
+    LOOP_VINFO_LOOP (loop_vinfo)->version_check_bb = condition_bb;
+
   if (version_niter)
     {
       /* The versioned loop could be infinite, we need to clear existing
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..6a7329d5506233c6974a86bd56697eacc0bbea5c 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1852,6 +1852,40 @@ vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
     }
 }
 
+/* During peeling, we need to check if number of loop iterations is
+   enough for both peeled prolog loop and vector loop.  This check
+   can be merged along with threshold check of loop versioning, so
+   increase threshold for this case if necessary.  */
+
+static poly_uint64
+vec_calc_versioning_threshold (loop_vec_info loop_vinfo, poly_uint64 vf)
+{
+  poly_uint64 niters_th = 0;
+
+  if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    {
+      /* Niters for peeled prolog loop.  */
+      if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
+	{
+	  dr_vec_info *dr_info = LOOP_VINFO_UNALIGNED_DR (loop_vinfo);
+	  tree vectype = STMT_VINFO_VECTYPE (dr_info->stmt);
+	  niters_th += TYPE_VECTOR_SUBPARTS (vectype) - 1;
+	}
+      else
+	niters_th += LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+    }
+
+  /* Niters for at least one iteration of vectorized loop.  */
+  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    niters_th += vf;
+  /* One additional iteration because of peeling for gap.  */
+  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    niters_th += 1;
+
+  return niters_th;
+}
+
+
 /* Function vect_analyze_loop_2.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -2179,35 +2213,10 @@ start_over:
         }
     }
 
-  /* During peeling, we need to check if number of loop iterations is
-     enough for both peeled prolog loop and vector loop.  This check
-     can be merged along with threshold check of loop versioning, so
-     increase threshold for this case if necessary.  */
   if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
-    {
-      poly_uint64 niters_th = 0;
-
-      if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
-	{
-	  /* Niters for peeled prolog loop.  */
-	  if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
-	    {
-	      dr_vec_info *dr_info = LOOP_VINFO_UNALIGNED_DR (loop_vinfo);
-	      tree vectype = STMT_VINFO_VECTYPE (dr_info->stmt);
-	      niters_th += TYPE_VECTOR_SUBPARTS (vectype) - 1;
-	    }
-	  else
-	    niters_th += LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
-	}
-
-      /* Niters for at least one iteration of vectorized loop.  */
-      if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-	niters_th += LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-      /* One additional iteration because of peeling for gap.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-	niters_th += 1;
-      LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = niters_th;
-    }
+    LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo)
+      = vec_calc_versioning_threshold (loop_vinfo,
+				       LOOP_VINFO_VECT_FACTOR (loop_vinfo));
 
   gcc_assert (known_eq (vectorization_factor,
 			LOOP_VINFO_VECT_FACTOR (loop_vinfo)));
@@ -8530,7 +8539,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 	}
       class loop *sloop
 	= vect_loop_versioning (loop_vinfo, th, check_profitability,
-				versioning_threshold);
+				versioning_threshold,
+				PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK));
+
       sloop->force_vectorize = false;
       check_profitability = false;
     }
@@ -8557,7 +8568,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK));
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8865,9 +8877,99 @@ vect_transform_loop (loop_vec_info loop_vinfo)
       /* We may need to if-convert epilogue to vectorize it.  */
       if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
 	tree_if_conversion (epilogue);
+
+      return epilogue;
     }
+  /* EPILOGUE is null, let's see if we did versioning on this loop and
+     whether we need to patch up the condition checks.  */
+  else
+    {
+      gimple_seq stmts;
+      gimple_stmt_iterator stmt_it;
+      gimple * old_check;
+      tree low_niters_check, new_check;
+      basic_block version_check_bb;
+      loop_vec_info orig_loop_vinfo;
+      poly_uint64 low_niters;
+
+      /* We only exclude the threshold check if we are vectorizing the
+	 epilogues.  */
+      if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
+	return NULL;
+
+      /* Find the top loop info as this will be the one that was versioned.  */
+      orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+
+      while (orig_loop_vinfo &&
+	     LOOP_VINFO_ORIG_LOOP_INFO (orig_loop_vinfo) != NULL)
+	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (orig_loop_vinfo);
+
+      if (!orig_loop_vinfo)
+	return NULL;
+
+      version_check_bb = LOOP_VINFO_LOOP (orig_loop_vinfo)->version_check_bb;
+
+      /* If we don't have a basic block to patch up then there is nothing to
+	 do.  */
+      if (!version_check_bb)
+	return NULL;
+
+      low_niters
+	= vec_calc_versioning_threshold (orig_loop_vinfo,
+					 LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+
+      /* All conditions indicate we should add a check that the number of
+	 iterations is higher than the versioning threshold using the lowest
+	 vectorization factor to enter the vectorized version of the loop.  */
+      old_check = gsi_stmt (gsi_last_bb (version_check_bb));
+
+      /* Construct the check for the lowest threshold.  */
+      low_niters_check = LOOP_VINFO_NITERS (orig_loop_vinfo);
+      low_niters_check
+	= fold_build2 (GT_EXPR, boolean_type_node,
+		       low_niters_check,
+		       build_int_cst (TREE_TYPE (low_niters_check),
+				      low_niters - 1));
+
+      low_niters_check
+	= force_gimple_operand_1 (unshare_expr (low_niters_check),
+				  &stmts,
+				  is_gimple_condexpr,
+				  NULL_TREE);
+
+      stmt_it = gsi_last_bb (version_check_bb);
+      gsi_insert_seq_before (&stmt_it, stmts, GSI_SAME_STMT);
+
+
+      /* Get the last versioning check, and merge it with the new check.  */
+      gcc_assert (gimple_code (old_check) == GIMPLE_COND);
+      new_check
+	= fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+		       low_niters_check,
+		       gimple_cond_lhs (old_check));
+
+      new_check
+	= force_gimple_operand_1 (unshare_expr (new_check),
+				  &stmts,
+				  is_gimple_condexpr,
+				  NULL_TREE);
+
+      stmt_it = gsi_last_bb (version_check_bb);
+      gsi_insert_seq_before (&stmt_it, stmts, GSI_SAME_STMT);
+
+      stmt_it = gsi_last_bb (version_check_bb);
+      gimple *stmt = gsi_stmt (stmt_it);
+
+      /* Set the new check as the condition for going into the vectorized loop.
+       */
+      gimple_cond_set_condition_from_tree (as_a <gcond *> (stmt),
+					   new_check);
+
+      /* Make sure we update def/use information.  */
+      update_stmt (gsi_stmt (gsi_last_bb (version_check_bb)));
 
-  return epilogue;
+      return NULL;
+    }
 }
 
 /* The code below is trying to perform simple optimization - revert
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..be282a974a76617fbb0cf6f0e34b08d4d652c3df 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1481,9 +1481,10 @@ extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info, unsigned int, bool,
-				   poly_uint64);
+				   poly_uint64, bool);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    bool);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);

Reply via email to