> Thanks,
> bin
> 2017-02-13  Bin Cheng  <bin.ch...@arm.com>
> 
>       PR tree-optimization/79347
>       * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>       (vect_do_peeling): Maintain profile counters during peeling.
> 
> gcc/testsuite/ChangeLog
> 2017-02-13  Bin Cheng  <bin.ch...@arm.com>
> 
>       PR tree-optimization/79347
>       * gcc.dg/vect/pr79347.c: New test.

> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
> b/gcc/testsuite/gcc.dg/vect/pr79347.c
> new file mode 100644
> index 0000000..586c638
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-fdump-tree-vect-all" } */
> +
> +short *a;
> +int c;
> +void n(void)
> +{
> +  for (int i = 0; i<c;i++)
> +    a[i]++;
> +}

Thanks for fixing the prologue.  I think there is still one extra problem in 
the vectorizer.
With the internal vectorized loop I now see:

;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
;;   Invalid sum of incoming frequencies 1961, should be 956
;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
;;                8 [100.0%]  (FALLTHRU)
  # i_18 = PHI <i_14(10), i_38(8)>
  # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
  # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
  # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
  _2 = (long unsigned int) i_18;
  _3 = _2 * 2;
  _4 = a.0_1 + _3;
  vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
  _5 = *_4;
  vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
  _6 = (unsigned short) _5;
  vect__7.17_71 = vect__6.16_69 + vect_cst__70;
  _7 = _6 + 1;
  vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
  _8 = (short int) _7;
  MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
  i_14 = i_18 + 1;
  vectp_a.13_67 = vectp_a.13_66 + 16;
  vectp_a.19_76 = vectp_a.19_75 + 16;
  ivtmp_79 = ivtmp_78 + 1;
  if (ivtmp_79 < bnd.10_59)
    goto <bb 10>; [85.00%]
  else
    goto <bb 11>; [15.00%]

So it seems that the frequency of the loop itself is unrealistically scaled 
down.
Before vetorizing the frequency is 8500 and predicted number of iterations is
6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
exit probability exits with 15% probability and thus still has 6.6 iterations,
but by BB frequencies its body executes fewer times than the preheader.

Now this is a fragile area vectirizing loop should scale number of iterations 
down
8 times. However guessed CFG profiles are always very "flat". Of course
if loop iterated 6.6 times at the average vectorizing would not make any sense.
Making guessed profiles less flat is unrealistic, because average loop iterates 
few times,
but of course while vectorizing we make additional guess that the vectorizable 
loops
matters and the guessed profile is probably unrealistic.

GCC 6 seems however bit more consistent.
> +/* Apply probability PROB to basic block BB and its single succ edge.  */
> +
> +static void
> +apply_probability_for_bb (basic_block bb, int prob)
> +{
> +  bb->frequency = apply_probability (bb->frequency, prob);
> +  bb->count = apply_probability (bb->count, prob);
> +  gcc_assert (single_succ_p (bb));
> +  single_succ_edge (bb)->count = bb->count;
> +}
> +
>  /* Function vect_do_peeling.
>  
>     Input:
> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>       may be preferred.  */
>    basic_block anchor = loop_preheader_edge (loop)->src;
>    if (skip_vector)
> -    split_edge (loop_preheader_edge (loop));
> +    {
> +      split_edge (loop_preheader_edge (loop));
> +
> +      /* Due to the order in which we peel prolog and epilog, we first
> +      propagate probability to the whole loop.  The purpose is to
> +      avoid adjusting probabilities of both prolog and vector loops
> +      separately.  Note in this case, the probability of epilog loop
> +      needs to be scaled back later.  */
> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
> +      apply_probability_for_bb (bb_before_loop, prob_vector);
Aha, this is the bit I missed while trying to fix it myself.
I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
to do this.  I plan to revamp API for this next stage1, but lets keep this 
consistent.
Path is OK with this change and ...
> +      scale_loop_profile (loop, prob_vector, bound);
... please try to check if scaling is really done reasonably.  From the above
it seems that the vectorized loop is unrealistically scalled down that may 
prevent
further optimization for speed...

Thanks for looking into this,
Honza

Reply via email to