On 18 March 2019 10:58:53 CET, Richard Sandiford <richard.sandif...@arm.com> 
wrote:
>This patch fixes a case in which we vectorised something with a
>fully-predicated loop even after the cost model had rejected it.
>E.g. the loop in the testcase has the costs:
>
>  Vector inside of loop cost: 27
>  Vector prologue cost: 0
>  Vector epilogue cost: 0
>  Scalar iteration cost: 7
>  Scalar outside cost: 6
>  Vector outside cost: 0
>  prologue iterations: 0
>  epilogue iterations: 0
>
>and we can see that the loop executes at most three times, but we
>decided to vectorise it anyway.
>
>(The costs here are equal for three iterations, but the same thing
>happens even when the vector code is strictly more expensive.)
>
>The problem is the handling of "/VF" in:
>
>  /* Calculate number of iterations required to make the vector version
> profitable, relative to the loop bodies only.  The following condition
>     must hold true:
>     SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
>     where
>     SIC = scalar iteration cost, VIC = vector iteration cost,
>     VOC = vector outside cost, VF = vectorization factor,
>     PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
>     SOC = scalar outside cost for run time cost model check.  */
>
>We treat the "/VF" as truncating, but for fully-predicated loops, it's
>closer to a ceil division, since fractional iterations are handled by a
>full iteration with some predicate bits set to false.
>
>The easiest fix seemed to be to calculate the minimum number of vector
>iterations first, then use that to calculate the minimum number of
>scalar
>iterations.
>
>Calculating the minimum number of vector iterations might make sense
>for
>unpredicated loops too, since calculating the scalar niters directly
>doesn't take into account the fact that the VIC multiple has to be an
>integer.  But the handling of PL_ITERS and EP_ITERS for unpredicated
>loops is a bit hand-wavy anyway, so maybe vagueness here cancels out
>vagueness there?
>
>Either way, changing this for unpredicated loops would be much too
>invasive for stage 4, so the patch keeps it specific to
>fully-predicated
>loops (i.e. SVE) for now.  There's no functional change for other
>targets.
>
>Tested on aarch64-linux-gnu with and without SVE, and on
>x86_64-linux-gnu.
>This is a regression introduced by the original cost model patches for
>fully-predicated loops, so OK for GCC 9?
>
>Thanks,
>Richard
>
>
>2019-03-18  Richard Sandiford  <richard.sandif...@arm.com>
>
>gcc/
>       * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the
>       calculation of the minimum number of scalar iterations for
>       fully-predicated loops.
>
>gcc/testsuite/
>       * gcc.target/aarch64/sve/cost_model_1.c: New test.
>
>Index: gcc/tree-vect-loop.c
>===================================================================
>--- gcc/tree-vect-loop.c       2019-03-08 18:15:33.668751875 +0000
>+++ gcc/tree-vect-loop.c       2019-03-18 09:55:03.257194326 +0000
>@@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop
>  /* Calculate number of iterations required to make the vector version
> profitable, relative to the loop bodies only.  The following condition
>      must hold true:
>-     SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
>+     SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC
>      where
>      SIC = scalar iteration cost, VIC = vector iteration cost,
>      VOC = vector outside cost, VF = vectorization factor,
>-     PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
>+     NPEEL = prologue iterations + epilogue iterations,
>      SOC = scalar outside cost for run time cost model check.  */
> 
>-  if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost)
>+  int saving_per_viter = (scalar_single_iter_cost * assumed_vf
>+                        - vec_inside_cost);
>+  if (saving_per_viter <= 0)
>+    {
>+      if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
>+      warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
>+                  "vectorization did not happen for a simd loop");
>+
>+      if (dump_enabled_p ())
>+        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>+                       "cost model: the vector iteration cost = %d "
>+                       "divided by the scalar iteration cost = %d "
>+                       "is greater or equal to the vectorization factor = %d"
>+                         ".\n",
>+                       vec_inside_cost, scalar_single_iter_cost, assumed_vf);
>+      *ret_min_profitable_niters = -1;
>+      *ret_min_profitable_estimate = -1;
>+      return;
>+    }
>+
>+  /* ??? The "if" arm is written to handle all cases; see below for
>what
>+     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
>+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+    {

The condition above seems to contain...


>+      /* Rewriting the condition above in terms of the number of
>+       vector iterations (vniters) rather than the number of
>+       scalar iterations (niters) gives:
>+
>+       SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC
>+
>+       <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC
>+
>+       For integer N, X and Y when X > 0:
>+
>+       N * X > Y <==> N >= (Y /[floor] X) + 1.  */
>+      int outside_overhead = (vec_outside_cost
>+                            - scalar_single_iter_cost * peel_iters_prologue
>+                            - scalar_single_iter_cost * peel_iters_epilogue
>+                            - scalar_outside_cost);
>+      /* We're only interested in cases that require at least one
>+       vector iteration.  */
>+      int min_vec_niters = 1;
>+      if (outside_overhead > 0)
>+      min_vec_niters = outside_overhead / saving_per_viter + 1;
>+
>+      if (dump_enabled_p ())
>+      dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
>+                   min_vec_niters);
>+
>+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+      {

... this identical condition, AFAICS?
So this second conditions else arm should be dead, shouldn't it?

thanks,

>+        /* Now that we know the minimum number of vector iterations,
>+           find the minimum niters for which the scalar cost is larger:
>+
>+           SIC * niters > VIC * vniters + VOC - SOC
>+
>+           We know that the minimum niters is no more than
>+           vniters * VF + NPEEL, but it might be (and often is) less
>+           than that if a partial vector iteration is cheaper than the
>+           equivalent scalar code.  */
>+        int threshold = (vec_inside_cost * min_vec_niters
>+                         + vec_outside_cost
>+                         - scalar_outside_cost);
>+        if (threshold <= 0)
>+          min_profitable_iters = 1;
>+        else
>+          min_profitable_iters = threshold / scalar_single_iter_cost + 1;
>+      }
>+      else
>+      /* Convert the number of vector iterations into a number of
>+         scalar iterations.  */
>+      min_profitable_iters = (min_vec_niters * assumed_vf
>+                              + peel_iters_prologue
>+                              + peel_iters_epilogue);
>+    }
>+  else
>     {
>       min_profitable_iters = ((vec_outside_cost - scalar_outside_cost)
>                             * assumed_vf
>@@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop
>         min_profitable_iters = 0;
>       else
>       {
>-        min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf)
>-                                 - vec_inside_cost);
>+        min_profitable_iters /= saving_per_viter;
> 
>         if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters)
>             <= (((int) vec_inside_cost * min_profitable_iters)
>@@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop
>           min_profitable_iters++;
>       }
>     }
>-  /* vector version will never be profitable.  */
>-  else
>-    {
>-      if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
>-      warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
>-                  "vectorization did not happen for a simd loop");
>-
>-      if (dump_enabled_p ())
>-        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>-                       "cost model: the vector iteration cost = %d "
>-                       "divided by the scalar iteration cost = %d "
>-                       "is greater or equal to the vectorization factor = %d"
>-                         ".\n",
>-                       vec_inside_cost, scalar_single_iter_cost, assumed_vf);
>-      *ret_min_profitable_niters = -1;
>-      *ret_min_profitable_estimate = -1;
>-      return;
>-    }
> 
>   if (dump_enabled_p ())
>     dump_printf (MSG_NOTE,
>@@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop
> 
>     Non-vectorized variant is SIC * niters and it must win over vector
>variant on the expected loop trip count.  The following condition must
>hold true:
>-     SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC 
>*/
>+     SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC  */
> 
>   if (vec_outside_cost <= 0)
>     min_profitable_estimate = 0;
>+  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+    {
>+      /* This is a repeat of the code above, but with + SOC rather
>+       than - SOC.  */
>+      int outside_overhead = (vec_outside_cost
>+                            - scalar_single_iter_cost * peel_iters_prologue
>+                            - scalar_single_iter_cost * peel_iters_epilogue
>+                            + scalar_outside_cost);
>+      int min_vec_niters = 1;
>+      if (outside_overhead > 0)
>+      min_vec_niters = outside_overhead / saving_per_viter + 1;
>+
>+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+      {
>+        int threshold = (vec_inside_cost * min_vec_niters
>+                         + vec_outside_cost
>+                         + scalar_outside_cost);
>+        min_profitable_estimate = threshold / scalar_single_iter_cost + 1;
>+      }
>+      else
>+      min_profitable_estimate = (min_vec_niters * assumed_vf
>+                                 + peel_iters_prologue
>+                                 + peel_iters_epilogue);
>+    }
>   else
>     {
>    min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost)
>Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c
>===================================================================
>--- /dev/null  2019-03-08 11:40:14.606883727 +0000
>+++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c        2019-03-18
>09:55:03.257194326 +0000
>@@ -0,0 +1,12 @@
>+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
>+
>+void
>+f (unsigned int *restrict x, unsigned int *restrict y,
>+   unsigned char *restrict z, unsigned int n)
>+{
>+  for (unsigned int i = 0; i < n % 4; ++i)
>+    x[i] = x[i] + y[i] + z[i];
>+}
>+
>+/* { dg-final { scan-tree-dump "not vectorized: estimated iteration
>count too small" vect } } */
>+/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */

Reply via email to