Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-08 Thread Jan Hubicka
> On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka  wrote:
> > Hi,
> > I benchmarked the patch moving loop header copying and it is quite 
> > noticeable win.
> >
> > Some testsuite updating is needed. In many cases it is just because the
> > optimizations are now happening earlier.
> > There are however few testusite failures I have torubles to deal with:
> > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c 
> > scan-tree-dump-times vrp1 "Threaded jump" 3
> > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c 
> > scan-tree-dump-times vrp1 "Jumps threaded: 1" 1
> > ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c 
> > scan-tree-dump-times vect "vectorized 1 loops" 2
> > ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  
> > scan-tree-dump-times vrp1 "if " 1
> > ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  
> > scan-tree-dump-times vrp1 "if " 1
> >
> > This is mostly about VRP losing its ability to thread some jumps from the
> > duplicated loop header out of the loop across the loopback edge.  This 
> > seems to
> > be due to loop updating logic.  Do we care about these?
> 
> Yes, I think so.  At least we care that the optimized result is the same.

it is not, we really lose optimization in those testcases.
The ones that are still optimized well I updated in the patch bellow.
> 
> Can you elaborate on "due to loop updating logic"?

The problem is:
  /* We do not allow VRP information to be used for jump threading
 across a back edge in the CFG.  Otherwise it becomes too
 difficult to avoid eliminating loop exit tests.  Of course
 EDGE_DFS_BACK is not accurate at this time so we have to
 recompute it.  */
  mark_dfs_back_edges ();

  /* Do not thread across edges we are about to remove.  Just marking
 them as EDGE_DFS_BACK will do.  */
  FOR_EACH_VEC_ELT (edge, to_remove_edges, i, e)
e->flags |= EDGE_DFS_BACK;

Loop header copying puts some conditional before loop and we want to thread
up to exit out of the loop (that I think it rather important optimization).
But it no longer happens before back edge is in the way.  At least that was
the case in the tree-ssa failures I analyzed.
> 
> Can you elaborate on the def_split_header_continue_p change?  Which probably
> should be tested and installed separately?

Yes, that one is latent bug.  The code is expecting that loop exit is recognized
by loop depth decreasing that is not true.
It reproduces as ICE during bootstrap with the patch.
I will regtest/bootstrap and commit it today.

Honza


Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-08 Thread Richard Guenther
On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka  wrote:
> Hi,
> I benchmarked the patch moving loop header copying and it is quite noticeable 
> win.
>
> Some testsuite updating is needed. In many cases it is just because the
> optimizations are now happening earlier.
> There are however few testusite failures I have torubles to deal with:
> ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times 
> vrp1 "Threaded jump" 3
> ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c 
> scan-tree-dump-times vrp1 "Jumps threaded: 1" 1
> ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c 
> scan-tree-dump-times vect "vectorized 1 loops" 2
> ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  
> scan-tree-dump-times vrp1 "if " 1
> ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  
> scan-tree-dump-times vrp1 "if " 1
>
> This is mostly about VRP losing its ability to thread some jumps from the
> duplicated loop header out of the loop across the loopback edge.  This seems 
> to
> be due to loop updating logic.  Do we care about these?

Yes, I think so.  At least we care that the optimized result is the same.

Can you elaborate on "due to loop updating logic"?

Can you elaborate on the def_split_header_continue_p change?  Which probably
should be tested and installed separately?

Thanks,
Richard.

> Honza
>
> Index: tree-ssa-threadupdate.c
> ===
> *** tree-ssa-threadupdate.c (revision 192123)
> --- tree-ssa-threadupdate.c (working copy)
> *** static bool
> *** 846,854 
>   def_split_header_continue_p (const_basic_block bb, const void *data)
>   {
> const_basic_block new_header = (const_basic_block) data;
> !   return (bb != new_header
> ! && (loop_depth (bb->loop_father)
> ! >= loop_depth (new_header->loop_father)));
>   }
>
>   /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
> --- 846,860 
>   def_split_header_continue_p (const_basic_block bb, const void *data)
>   {
> const_basic_block new_header = (const_basic_block) data;
> !   const struct loop *l;
> !
> !   if (bb == new_header
> !   || loop_depth (bb->loop_father) < loop_depth 
> (new_header->loop_father))
> ! return false;
> !   for (l = bb->loop_father; l; l = loop_outer (l))
> ! if (l == new_header->loop_father)
> !   return true;
> !   return false;
>   }
>
>   /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
> Index: testsuite/gcc.dg/unroll_2.c
> ===
> *** testsuite/gcc.dg/unroll_2.c (revision 192123)
> --- testsuite/gcc.dg/unroll_2.c (working copy)
> ***
> *** 1,5 
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
> -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo 
> -fenable-rtl-loop2_unroll" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> --- 1,5 
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
> -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo 
> -fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> Index: testsuite/gcc.dg/unroll_3.c
> ===
> *** testsuite/gcc.dg/unroll_3.c (revision 192123)
> --- testsuite/gcc.dg/unroll_3.c (working copy)
> ***
> *** 1,5 
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
> -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo" 
> } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> --- 1,5 
>   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
> ! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
> -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo 
> -fno-tree-dominator-opts" } */
>
>   unsigned a[100], b[100];
>   inline void bar()
> Index: testsuite/gcc.dg/torture/pr23821.c
> ===
> *** testsuite/gcc.dg/torture/pr23821.c  (revision 192123)
> --- testsuite/gcc.dg/torture/pr23821.c  (working copy)
> ***
> *** 1,9 
>   /* { dg-do compile } */
>   /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
> ! /* At -O1 DOM threads a jump in a non-optimal way which leads to
>  the bogus propagation.  */
> ! /* { dg-skip-if "" { *-*-* } { "-O1" } { "" } } */
> ! /* { dg-options "-fdump-tree-ivcanon-details" } */
>
>   int a[199];
>
> --- 1,8 
>   /* { dg-do compile } */
>   /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
> ! /* DOM threads a jump in a non-o

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-06 Thread Jan Hubicka
Hi,
I benchmarked the patch moving loop header copying and it is quite noticeable 
win.

Some testsuite updating is needed. In many cases it is just because the
optimizations are now happening earlier.
There are however few testusite failures I have torubles to deal with:
./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times 
vrp1 "Threaded jump" 3
./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c 
scan-tree-dump-times vrp1 "Jumps threaded: 1" 1
./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c 
scan-tree-dump-times vect "vectorized 1 loops" 2
./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  
scan-tree-dump-times vrp1 "if " 1
./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  
scan-tree-dump-times vrp1 "if " 1

This is mostly about VRP losing its ability to thread some jumps from the
duplicated loop header out of the loop across the loopback edge.  This seems to
be due to loop updating logic.  Do we care about these?

Honza

Index: tree-ssa-threadupdate.c
===
*** tree-ssa-threadupdate.c (revision 192123)
--- tree-ssa-threadupdate.c (working copy)
*** static bool
*** 846,854 
  def_split_header_continue_p (const_basic_block bb, const void *data)
  {
const_basic_block new_header = (const_basic_block) data;
!   return (bb != new_header
! && (loop_depth (bb->loop_father)
! >= loop_depth (new_header->loop_father)));
  }
  
  /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
--- 846,860 
  def_split_header_continue_p (const_basic_block bb, const void *data)
  {
const_basic_block new_header = (const_basic_block) data;
!   const struct loop *l;
! 
!   if (bb == new_header
!   || loop_depth (bb->loop_father) < loop_depth (new_header->loop_father))
! return false;
!   for (l = bb->loop_father; l; l = loop_outer (l))
! if (l == new_header->loop_father)
!   return true;
!   return false;
  }
  
  /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
Index: testsuite/gcc.dg/unroll_2.c
===
*** testsuite/gcc.dg/unroll_2.c (revision 192123)
--- testsuite/gcc.dg/unroll_2.c (working copy)
***
*** 1,5 
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
-fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo 
-fenable-rtl-loop2_unroll" } */
  
  unsigned a[100], b[100];
  inline void bar()
--- 1,5 
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
-fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo 
-fenable-rtl-loop2_unroll -fno-tree-dominator-opts" } */
  
  unsigned a[100], b[100];
  inline void bar()
Index: testsuite/gcc.dg/unroll_3.c
===
*** testsuite/gcc.dg/unroll_3.c (revision 192123)
--- testsuite/gcc.dg/unroll_3.c (working copy)
***
*** 1,5 
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
-fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo" } 
*/
  
  unsigned a[100], b[100];
  inline void bar()
--- 1,5 
  /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
! /* { dg-options "-O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
-fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo 
-fno-tree-dominator-opts" } */
  
  unsigned a[100], b[100];
  inline void bar()
Index: testsuite/gcc.dg/torture/pr23821.c
===
*** testsuite/gcc.dg/torture/pr23821.c  (revision 192123)
--- testsuite/gcc.dg/torture/pr23821.c  (working copy)
***
*** 1,9 
  /* { dg-do compile } */
  /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
! /* At -O1 DOM threads a jump in a non-optimal way which leads to
 the bogus propagation.  */
! /* { dg-skip-if "" { *-*-* } { "-O1" } { "" } } */
! /* { dg-options "-fdump-tree-ivcanon-details" } */
  
  int a[199];
  
--- 1,8 
  /* { dg-do compile } */
  /* { dg-skip-if "" { *-*-* } { "-O0" "-fno-fat-lto-objects" } { "" } } */
! /* DOM threads a jump in a non-optimal way which leads to
 the bogus propagation.  */
! /* { dg-options "-fdump-tree-ivcanon-details -fno-tree-dominator-opts" } */
  
  int a[199];
  
Index: testsuite/gcc.dg/tree-ssa/ivopt_1.c
===
*** testsuite/gcc.dg/tree-ssa/ivopt_1.c (revision 192123)
--- testsuite/gcc.dg/tree-ssa/ivopt_1.c (working copy)
*** void foo (int i_width, TYPE dst, TYPE sr
*** 14,18 
  }
  
  
! /* { dg-final { scan-tree-dump-times "PHI 

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-05 Thread Jan Hubicka
Hi,
this is the udpated patch I comitted after testing.  I suppose we will need to 
find way
to make SOC smaller for simple loops - it is way too overestimated currently.

* tree-vectorizer.h (vect_estimate_min_profitable_iters): Remove.
* tree-vect-loop.c (vect_estimate_min_profitable_iters): Declare here.
(vect_analyze_loop_operations): Use loop count estimate to rule out
unprofitable vectorization.
(vect_estimate_min_profitable_iters): Return 
ret_min_profitable_estimate.
Index: tree-vectorizer.h
===
*** tree-vectorizer.h   (revision 192114)
--- tree-vectorizer.h   (working copy)
*** extern bool vectorizable_live_operation 
*** 976,982 
  extern bool vectorizable_reduction (gimple, gimple_stmt_iterator *, gimple *,
  slp_tree);
  extern bool vectorizable_induction (gimple, gimple_stmt_iterator *, gimple *);
- extern int vect_estimate_min_profitable_iters (loop_vec_info);
  extern tree get_initial_def_for_reduction (gimple, tree, tree *);
  extern int vect_min_worthwhile_factor (enum tree_code);
  extern int vect_get_known_peeling_cost (loop_vec_info, int, int *, int,
--- 976,981 
Index: tree-vect-loop.c
===
*** tree-vect-loop.c(revision 192114)
--- tree-vect-loop.c(working copy)
*** along with GCC; see the file COPYING3.  
*** 140,145 
--- 140,147 
 http://gcc.gnu.org/projects/tree-ssa/vectorization.html
  */
  
+ static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+ 
  /* Function vect_determine_vectorization_factor
  
 Determine the vectorization factor (VF).  VF is the number of data elements
*** vect_analyze_loop_operations (loop_vec_i
*** 1287,1292 
--- 1289,1296 
unsigned int th;
bool only_slp_in_loop = true, ok;
HOST_WIDE_INT max_niter;
+   HOST_WIDE_INT estimated_niter;
+   int min_profitable_estimate;
  
if (dump_kind_p (MSG_NOTE))
  dump_printf_loc (MSG_NOTE, vect_location,
*** vect_analyze_loop_operations (loop_vec_i
*** 1490,1496 
   vector stmts depends on VF.  */
vect_update_slp_costs_according_to_vf (loop_vinfo);
  
!   min_profitable_iters = vect_estimate_min_profitable_iters (loop_vinfo);
LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo) = min_profitable_iters;
  
if (min_profitable_iters < 0)
--- 1494,1501 
   vector stmts depends on VF.  */
vect_update_slp_costs_according_to_vf (loop_vinfo);
  
!   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
! &min_profitable_estimate);
LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo) = min_profitable_iters;
  
if (min_profitable_iters < 0)
*** vect_analyze_loop_operations (loop_vec_i
*** 1531,1536 
--- 1537,1559 
return false;
  }
  
+   if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1
+   && ((unsigned HOST_WIDE_INT) estimated_niter
+   <= MAX (th, (unsigned)min_profitable_estimate)))
+ {
+   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"not vectorized: estimated iteration count too "
+  "small.");
+   if (dump_kind_p (MSG_NOTE))
+ dump_printf_loc (MSG_NOTE, vect_location,
+"not vectorized: estimated iteration count smaller "
+  "than specified loop bound parameter or minimum "
+  "profitable iterations (whichever is more "
+  "conservative).");
+   return false;
+ }
+ 
if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
|| LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0
|| LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
*** vect_get_known_peeling_cost (loop_vec_in
*** 2603,2617 
  
 Return the number of iterations required for the vector version of the
 loop to be profitable relative to the cost of the scalar version of the
!loop.
  
!TODO: Take profile info into account before making vectorization
!decisions, if available.  */
! 
! int
! vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo)
  {
int min_profitable_iters;
int peel_iters_prologue;
int peel_iters_epilogue;
unsigned vec_inside_cost = 0;
--- 2626,2640 
  
 Return the number of iterations required for the vector version of the
 loop to be profitable relative to the cost of the scalar version of the
!loop.  */
  
! static void
! vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
!   int *ret_min_profitable_niters,
!   int *ret_min_profitable_estimate)
  {
int min_profitable_iters;
+   int min_profitab

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-04 Thread Richard Guenther
On Thu, 4 Oct 2012, Jan Hubicka wrote:

> > > So SOC cancels out in the runtime check.
> > > I still think we need two formulas - one determining if vectorization is
> > > profitable, other specifying the threshold for scalar path at runtime 
> > > (that
> > > will generally give lower values).
> > 
> > True, we want two values.  But part of the scalar path right now
> > is all the computation required for alias and alignment runtime checks
> > (because the way all the conditions are combined).
> > 
> > I'm not much into the details of what we account for in SOC (I suppose
> > it's everything we insert in the preheader of the vector loop).
> 
> Yes, it seems contain everything we insert prior the loop in unfolded form.
> > 
> > +  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> > +fprintf (vect_dump, "not vectorized: estimated iteration count 
> > too small.");
> > +  if (vect_print_dump_info (REPORT_DETAILS))
> > +fprintf (vect_dump, "not vectorized: estimated iteration count 
> > smaller than "
> > + "user specified loop bound parameter or minimum "
> > + "profitable iterations (whichever is more 
> > conservative).");
> > 
> > this won't work anymore btw - dumping infrastructure changed.
> 
> Ah, will update that.
> > 
> > I suppose your patch is a step in the right direction, but to really
> > make progress we need to re-organize the loop and predicate structure
> > produced by the vectorizer.
> 
> This reminds me what I did for string functions on x86. It gets very hard
> to get all the paths right when one starts to be really cureful to not
> output too much cruft on the short paths + do not consume too many registers.
> 
> In fact I want to re-think this for the SSE string ops patch, so I may try to
> look into that incrementally.
> > 
> > So, please update your patch, re-test and then it's ok.
> 
> Thanks.
> > > I tested enabling loop_ch in early passes with -fprofile-feedback and it 
> > > is SPEC
> > > neutral.  Given that it improves loop count estimates, I would still like 
> > > mainline
> > > doing that.  I do not like these quite important estimates to be wrong 
> > > most of time.
> > 
> > I agree.  It also helps getting rid of once rolling loops I think.
> 
> I am attaching the patch for early-ch.  Will commit it tomorrow.
> 
> Concerning jump threading, it would help to make some of it during early 
> passes
> so the profile estiamte do not get invalided.  I tried to move VRP early but 
> now it
> makes compiler to hang during bootstrap.  I will debug that.
> > 
> > > > 
> > > > Btw, I added a "similar" check in vect_analyze_loop_operations:
> > > > 
> > > >   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >&& (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
> > > >   || ((max_niter = max_stmt_executions_int (loop)) != -1
> > > >   && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> > > > {
> > > >   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > >  "not vectorized: iteration count too small.");
> > > >   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > >  "not vectorized: iteration count smaller than "
> > > >  "vectorization factor.");
> > > >   return false;
> > > > }
> > > > 
> > > > maybe you simply need to update that to also consider the profile?
> > > 
> > > Hmm, I am still getting familiar wth the code. Later we later have
> > >   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >   && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th)
> > > {
> > >   if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> > > fprintf (vect_dump, "not vectorized: vectorization not "
> > >  "profitable.");
> > >   if (vect_print_dump_info (REPORT_DETAILS))
> > > fprintf (vect_dump, "not vectorized: iteration count smaller than 
> > > "
> > >  "user specified loop bound parameter or minimum "
> > >  "profitable iterations (whichever is more 
> > > conservative).");
> > >   return false;
> > > }
> > > 
> > > where th is always greater or equal than vectorization_factor from the 
> > > cost model.
> > > So this test seems redundant if the max_stmt_executions_int was pushed 
> > > down
> > > to the second conditoinal?
> > 
> > Yes, sort of.  The new check was supposed to be crystal clear, and
> > even with the cost model disabled we want to not vectorize in this
> > case.  But yes, the whole cost-model stuff needs TLC.
> 
> Ah yes, without cost model we would skip it.  I suppose we do not need to
> brother  witht he profile estiamte in the case anyway. They are kind of aprt 
> of
> the cost models.
> 
>   * passes.c (init_optimization_passes): Schedule early CH.
> 

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-04 Thread Jan Hubicka
> > So SOC cancels out in the runtime check.
> > I still think we need two formulas - one determining if vectorization is
> > profitable, other specifying the threshold for scalar path at runtime (that
> > will generally give lower values).
> 
> True, we want two values.  But part of the scalar path right now
> is all the computation required for alias and alignment runtime checks
> (because the way all the conditions are combined).
> 
> I'm not much into the details of what we account for in SOC (I suppose
> it's everything we insert in the preheader of the vector loop).

Yes, it seems contain everything we insert prior the loop in unfolded form.
> 
> +  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> +fprintf (vect_dump, "not vectorized: estimated iteration count 
> too small.");
> +  if (vect_print_dump_info (REPORT_DETAILS))
> +fprintf (vect_dump, "not vectorized: estimated iteration count 
> smaller than "
> + "user specified loop bound parameter or minimum "
> + "profitable iterations (whichever is more 
> conservative).");
> 
> this won't work anymore btw - dumping infrastructure changed.

Ah, will update that.
> 
> I suppose your patch is a step in the right direction, but to really
> make progress we need to re-organize the loop and predicate structure
> produced by the vectorizer.

This reminds me what I did for string functions on x86. It gets very hard
to get all the paths right when one starts to be really cureful to not
output too much cruft on the short paths + do not consume too many registers.

In fact I want to re-think this for the SSE string ops patch, so I may try to
look into that incrementally.
> 
> So, please update your patch, re-test and then it's ok.

Thanks.
> > I tested enabling loop_ch in early passes with -fprofile-feedback and it is 
> > SPEC
> > neutral.  Given that it improves loop count estimates, I would still like 
> > mainline
> > doing that.  I do not like these quite important estimates to be wrong most 
> > of time.
> 
> I agree.  It also helps getting rid of once rolling loops I think.

I am attaching the patch for early-ch.  Will commit it tomorrow.

Concerning jump threading, it would help to make some of it during early passes
so the profile estiamte do not get invalided.  I tried to move VRP early but 
now it
makes compiler to hang during bootstrap.  I will debug that.
> 
> > > 
> > > Btw, I added a "similar" check in vect_analyze_loop_operations:
> > > 
> > >   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >&& (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
> > >   || ((max_niter = max_stmt_executions_int (loop)) != -1
> > >   && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> > > {
> > >   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > >  "not vectorized: iteration count too small.");
> > >   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > >  "not vectorized: iteration count smaller than "
> > >  "vectorization factor.");
> > >   return false;
> > > }
> > > 
> > > maybe you simply need to update that to also consider the profile?
> > 
> > Hmm, I am still getting familiar wth the code. Later we later have
> >   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >   && LOOP_VINFO_INT_NITERS (loop_vinfo) <= th)
> > {
> >   if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> > fprintf (vect_dump, "not vectorized: vectorization not "
> >  "profitable.");
> >   if (vect_print_dump_info (REPORT_DETAILS))
> > fprintf (vect_dump, "not vectorized: iteration count smaller than "
> >  "user specified loop bound parameter or minimum "
> >  "profitable iterations (whichever is more conservative).");
> >   return false;
> > }
> > 
> > where th is always greater or equal than vectorization_factor from the cost 
> > model.
> > So this test seems redundant if the max_stmt_executions_int was pushed down
> > to the second conditoinal?
> 
> Yes, sort of.  The new check was supposed to be crystal clear, and
> even with the cost model disabled we want to not vectorize in this
> case.  But yes, the whole cost-model stuff needs TLC.

Ah yes, without cost model we would skip it.  I suppose we do not need to
brother  witht he profile estiamte in the case anyway. They are kind of aprt of
the cost models.

* passes.c (init_optimization_passes): Schedule early CH.
* tree-pass.h (pass_early_ch): Declare it.
* tree-ssa-loop-ch.c (gate_early_ch): New function.
(pass_early_ch): New pass.
Index: passes.c
===
--- passes.c(revision 191852)
+++ passes.c(working c

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-02 Thread Richard Guenther
On Mon, 1 Oct 2012, Jan Hubicka wrote:

> > > 
> > >  So the unvectorized cost is
> > >  SIC * niters
> > > 
> > >  The vectorized path is
> > >  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> > >  The scalar path of vectorizer loop is
> > >  SIC * niters + SOC
> > 
> > Note that 'th' is used for the runtime profitability check which is
> > done at the time the setup cost has already been taken (yes, we
> 
> Yes, I understand that.
> > probably should make it more conservative but then guard the whole
> > set of loops by the check, not only the vectorized path).
> > See PR53355 for the general issue.
> 
> Yep, we may reduce the cost of SOC by outputting early guard for 
> non-vectorized
> path better than we do now. However...
> > >Of course this is very simple benchmark, in reality the vectorizatoin 
> > > can be
> > >a lot more harmful by complicating more complex control flows.
> > >
> > >So I guess we have two options
> > > 1) go with the new formula and try to make cost model a bit more 
> > > realistic.
> > > 2) stay with original formula that is quite close to reality, but I 
> > > think
> > >more by an accident.
> > 
> > I think we need to improve it as whole, thus I'd prefer 2).
> 
> ... I do not see why.
> Even if we make the check cheaper we will only distribute part of SOC to 
> vector
> prologues/epilogues.
> 
> Still I think the formula is wrong, I.e. accounting SOC where it should not.
> 
> The cost of scalar path without vectorization is 
>   niters * SIC
> while with vectorization we have scalar path
>   niters * SIC + SOC
> and vector path
>   SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> 
> So SOC cancels out in the runtime check.
> I still think we need two formulas - one determining if vectorization is
> profitable, other specifying the threshold for scalar path at runtime (that
> will generally give lower values).

True, we want two values.  But part of the scalar path right now
is all the computation required for alias and alignment runtime checks
(because the way all the conditions are combined).

I'm not much into the details of what we account for in SOC (I suppose
it's everything we insert in the preheader of the vector loop).

+  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+fprintf (vect_dump, "not vectorized: estimated iteration count 
too small.");
+  if (vect_print_dump_info (REPORT_DETAILS))
+fprintf (vect_dump, "not vectorized: estimated iteration count 
smaller than "
+ "user specified loop bound parameter or minimum "
+ "profitable iterations (whichever is more 
conservative).");

this won't work anymore btw - dumping infrastructure changed.

I suppose your patch is a step in the right direction, but to really
make progress we need to re-organize the loop and predicate structure
produced by the vectorizer.

So, please update your patch, re-test and then it's ok.

> > > 2) Even when loop iterates 2 times, it is estimated to 4 iterations by
> > >estimated_stmt_executions_int with the profile feedback.
> > >The reason is loop_ch pass.  Given a rolled loop with exit probability
> > >30%, proceeds by duplicating the header with original probabilities.
> > >This makes the loop to be executed with 60% probability.  Because the
> > >loop body counts remain the same (and they should), the expected number
> > >of iterations increase by the decrease of entry edge to the header.
> > > 
> > >I wonder what to do about this.  Obviously without path profiling
> > >loop_ch can not really do a good job.  We can artifically make
> > >header to suceed more likely, that is the reality, but that requires
> > >non-trivial loop profile updating.
> > > 
> > >We can also simply record the iteration bound into loop structure 
> > >and ignore that the profile is not realistic
> > 
> > But we don't preserve loop structure from header copying ...
> 
> From what time we keep loop structure? In general I would like to eventualy
> drop value histograms to loop structure specifying number of iterations with
> profile feedback.

We preserve it from the start of the tree loop optimizers (it's easy
to preserve them from earlier points as long as you don't cross inlining,
but to lower the impact of the change I placed it where it was enough
to prevent the excessive unrolling/peeling done by RTL)

> > 
> > >Finally we can duplicate loop headers before profilng.  I implemented
> > >that via early_ch pass executed only with profile generation or 
> > > feedback.
> > >I guess it makes sense to do, even if it breaks the assumption that
> > >we should do strictly -Os generation on paths where
> > 
> > Well, there are CH cases that do not increase code size and I doubt
> > that loop header copying is generally bad for -Os ... we are not
> > good at handling non-copied loop headers.
> 
> There is comment saying 
>   /* Loop

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-01 Thread Jan Hubicka
> > 
> >  So the unvectorized cost is
> >  SIC * niters
> > 
> >  The vectorized path is
> >  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> >  The scalar path of vectorizer loop is
> >  SIC * niters + SOC
> 
> Note that 'th' is used for the runtime profitability check which is
> done at the time the setup cost has already been taken (yes, we

Yes, I understand that.
> probably should make it more conservative but then guard the whole
> set of loops by the check, not only the vectorized path).
> See PR53355 for the general issue.

Yep, we may reduce the cost of SOC by outputting early guard for non-vectorized
path better than we do now. However...
> >Of course this is very simple benchmark, in reality the vectorizatoin 
> > can be
> >a lot more harmful by complicating more complex control flows.
> >
> >So I guess we have two options
> > 1) go with the new formula and try to make cost model a bit more 
> > realistic.
> > 2) stay with original formula that is quite close to reality, but I 
> > think
> >more by an accident.
> 
> I think we need to improve it as whole, thus I'd prefer 2).

... I do not see why.
Even if we make the check cheaper we will only distribute part of SOC to vector
prologues/epilogues.

Still I think the formula is wrong, I.e. accounting SOC where it should not.

The cost of scalar path without vectorization is 
  niters * SIC
while with vectorization we have scalar path
  niters * SIC + SOC
and vector path
  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC

So SOC cancels out in the runtime check.
I still think we need two formulas - one determining if vectorization is
profitable, other specifying the threshold for scalar path at runtime (that
will generally give lower values).
> > 2) Even when loop iterates 2 times, it is estimated to 4 iterations by
> >estimated_stmt_executions_int with the profile feedback.
> >The reason is loop_ch pass.  Given a rolled loop with exit probability
> >30%, proceeds by duplicating the header with original probabilities.
> >This makes the loop to be executed with 60% probability.  Because the
> >loop body counts remain the same (and they should), the expected number
> >of iterations increase by the decrease of entry edge to the header.
> > 
> >I wonder what to do about this.  Obviously without path profiling
> >loop_ch can not really do a good job.  We can artifically make
> >header to suceed more likely, that is the reality, but that requires
> >non-trivial loop profile updating.
> > 
> >We can also simply record the iteration bound into loop structure 
> >and ignore that the profile is not realistic
> 
> But we don't preserve loop structure from header copying ...

>From what time we keep loop structure? In general I would like to eventualy
drop value histograms to loop structure specifying number of iterations with
profile feedback.
> 
> >Finally we can duplicate loop headers before profilng.  I implemented
> >that via early_ch pass executed only with profile generation or feedback.
> >I guess it makes sense to do, even if it breaks the assumption that
> >we should do strictly -Os generation on paths where
> 
> Well, there are CH cases that do not increase code size and I doubt
> that loop header copying is generally bad for -Os ... we are not
> good at handling non-copied loop headers.

There is comment saying 
  /* Loop header copying usually increases size of the code.  This used not to
 be true, since quite often it is possible to verify that the condition is
 satisfied in the first iteration and therefore to eliminate it.  Jump
 threading handles these cases now.  */
  if (optimize_loop_for_size_p (loop))
return false;

I am not sure how much backing it has. Schedule loop_ch as part of early passes
just after profile pass makes optimize_loop_for_size_p to return true 
even for functions that are later found cold by profile feedback.  I do not see
that being big issue.

I tested enabling loop_ch in early passes with -fprofile-feedback and it is SPEC
neutral.  Given that it improves loop count estimates, I would still like 
mainline
doing that.  I do not like these quite important estimates to be wrong most of 
time.

> 
> Btw, I added a "similar" check in vect_analyze_loop_operations:
> 
>   if ((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>&& (LOOP_VINFO_INT_NITERS (loop_vinfo) < vectorization_factor))
>   || ((max_niter = max_stmt_executions_int (loop)) != -1
>   && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor))
> {
>   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "not vectorized: iteration count too small.");
>   if (dump_kind_p (MSG_MISSED_OPTIMIZATION))
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "not vectorized: iteration cou

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-01 Thread Richard Guenther
On Sun, 30 Sep 2012, Jan Hubicka wrote:

> Hi,
> the point of the following patch is to make vectorizer to not vectorize the
> following testcase with profile feedback:
> 
> int a[1];
> int i=5;
> int k=2;
> int val;
> __attribute__ ((noinline,noclone))
> test()
> {
>   int j;
>   for(j=0;j a[j]=val;
> }
> main()
> {
>   while (i)
> {
>   test ();
>   i--;
> }
> }
> 
> Here the compiler should work out that the second loop iterates 2 times at the
> average and thus it is not good candidate for vectorizing.
> 
> In my first attempt I added the following:
> @@ -1474,6 +1478,18 @@ vect_analyze_loop_operations (loop_vec_i
>return false;
>  }
>  
> +  if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1
> +  && (unsigned HOST_WIDE_INT) estimated_niter <= th)
> +{
> +  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> +fprintf (vect_dump, "not vectorized: estimated iteration count too 
> small.");
> +  if (vect_print_dump_info (REPORT_DETAILS))
> +fprintf (vect_dump, "not vectorized: estimated iteration count 
> smaller than "
> + "user specified loop bound parameter or minimum "
> + "profitable iterations (whichever is more conservative).");
> +  return false;
> +}
> +
> 
> But to my surprise it does not help.  There are two things:
> 
> 1) the value of TH is bit low.  In a way the cost model works is that
>it finds minimal niters where vectorized loop with all the setup costs
>is cheaper than the vector loop with all the setup costs.  I.e.
> 
>   /* Calculate number of iterations required to make the vector version
>  profitable, relative to the loop bodies only.  The following condition
>  must hold true:
>  SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC(A)
>  where
>  SIC = scalar iteration cost, VIC = vector iteration cost,
>  VOC = vector outside cost, VF = vectorization factor,
>  PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
>  SOC = scalar outside cost for run time cost model check.  */
> 
> This value is used for both
> 1) decision if number of iterations is too low (max iterations is known)
> 2) decision on runtime whether we want to take the vectorized path
> or the scalar path.
> 
> The vectoried loop looks like:
>   k.1_10 = k;
>   if (k.1_10 > 0)
>   {
> pretmp_2 = val;
> niters.8_4 = (unsigned int) k.1_10;
> bnd.9_13 = niters.8_4 >> 2;
> ratio_mult_vf.10_1 = bnd.9_13 << 2;
> _18 = niters.8_4 <= 3;
> _19 = ratio_mult_vf.10_1 == 0;
> _20 = _19 | _18;
> if (_20 != 0)
>   scalar loop
> else
>   vector prologue
>   }
> 
>  So the unvectorized cost is
>  SIC * niters
> 
>  The vectorized path is
>  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
>  The scalar path of vectorizer loop is
>  SIC * niters + SOC

Note that 'th' is used for the runtime profitability check which is
done at the time the setup cost has already been taken (yes, we
probably should make it more conservative but then guard the whole
set of loops by the check, not only the vectorized path).
See PR53355 for the general issue.

>It makes sense to vectorize if
>SIC * niters > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC   (B)
>That is in the optimal cse where we actually vectorize the overall
>speed of vectorized loop including the runtime check is better.
> 
>It makes sense to take the vector loop if
>SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C)
>Because the scalar loop is taken.
> 
>The attached patch implements the formula (C) and uses it to deterine the
>decision based on number of iterations estimate (that is usually provided 
> by
>the feedback)
> 
>As a reality check, I tried my testcase.
> 
>9: Cost model analysis:
>  Vector inside of loop cost: 1
>  Vector prologue cost: 7
>  Vector epilogue cost: 2
>  Scalar iteration cost: 1
>  Scalar outside cost: 6
>  Vector outside cost: 9
>  prologue iterations: 0
>  epilogue iterations: 2
>  Calculated minimum iters for profitability: 4
> 
>9:   Profitability threshold = 3
> 
>9:   Profitability estimated iterations threshold = 20
> 
>This is overrated. The loop starts to be benefical at about 4 iterations in
>reality.  I guess the values are kind of wrong.
> 
>Vector inside of loop cost and Scalar iteration cost seems to ignore the
>fact that the loops do contain some control flow that should account at 
> least
>one extra cycle.
> 
>Vector prologue cost seems bit overrated for one pack operation.
> 
>Of course this is very simple benchmark, in reality the vectorizatoin can 
> be
>a lot more harmful by complicating more complex control flows.
>
>So I guess we have two 

[RFC] Make vectorizer to skip loops with small iteration estimate

2012-09-30 Thread Jan Hubicka
Hi,
the point of the following patch is to make vectorizer to not vectorize the
following testcase with profile feedback:

int a[1];
int i=5;
int k=2;
int val;
__attribute__ ((noinline,noclone))
test()
{
  int j;
  for(j=0;j VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC(A)
 where
 SIC = scalar iteration cost, VIC = vector iteration cost,
 VOC = vector outside cost, VF = vectorization factor,
 PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
 SOC = scalar outside cost for run time cost model check.  */

This value is used for both
1) decision if number of iterations is too low (max iterations is known)
2) decision on runtime whether we want to take the vectorized path
or the scalar path.

The vectoried loop looks like:
  k.1_10 = k;
  if (k.1_10 > 0)
{
  pretmp_2 = val;
  niters.8_4 = (unsigned int) k.1_10;
  bnd.9_13 = niters.8_4 >> 2;
  ratio_mult_vf.10_1 = bnd.9_13 << 2;
  _18 = niters.8_4 <= 3;
  _19 = ratio_mult_vf.10_1 == 0;
  _20 = _19 | _18;
  if (_20 != 0)
scalar loop
  else
vector prologue
}

 So the unvectorized cost is
 SIC * niters

 The vectorized path is
 SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
 The scalar path of vectorizer loop is
 SIC * niters + SOC

   It makes sense to vectorize if
   SIC * niters > SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC   (B)
   That is in the optimal cse where we actually vectorize the overall
   speed of vectorized loop including the runtime check is better.

   It makes sense to take the vector loop if
   SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C)
   Because the scalar loop is taken.

   The attached patch implements the formula (C) and uses it to deterine the
   decision based on number of iterations estimate (that is usually provided by
   the feedback)

   As a reality check, I tried my testcase.

   9: Cost model analysis:
 Vector inside of loop cost: 1
 Vector prologue cost: 7
 Vector epilogue cost: 2
 Scalar iteration cost: 1
 Scalar outside cost: 6
 Vector outside cost: 9
 prologue iterations: 0
 epilogue iterations: 2
 Calculated minimum iters for profitability: 4

   9:   Profitability threshold = 3

   9:   Profitability estimated iterations threshold = 20

   This is overrated. The loop starts to be benefical at about 4 iterations in
   reality.  I guess the values are kind of wrong.

   Vector inside of loop cost and Scalar iteration cost seems to ignore the
   fact that the loops do contain some control flow that should account at least
   one extra cycle.

   Vector prologue cost seems bit overrated for one pack operation.

   Of course this is very simple benchmark, in reality the vectorizatoin can be
   a lot more harmful by complicating more complex control flows.

   So I guess we have two options
1) go with the new formula and try to make cost model a bit more realistic.
2) stay with original formula that is quite close to reality, but I think
   more by an accident.

2) Even when loop iterates 2 times, it is estimated to 4 iterations by
   estimated_stmt_executions_int with the profile feedback.
   The reason is loop_ch pass.  Given a rolled loop with exit probability
   30%, proceeds by duplicating the header with original probabilities.
   This makes the loop to be executed with 60% probability.  Because the
   loop body counts remain the same (and they should), the expected number
   of iterations increase by the decrease of entry edge to the header.

   I wonder what to do about this.  Obviously without path profiling
   loop_ch can not really do a good job.  We can artifically make
   header to suceed more likely, that is the reality, but that requires
   non-trivial loop profile updating.

   We can also simply record the iteration bound into loop structure 
   and ignore that the profile is not realistic

   Finally we can duplicate loop headers before profilng.  I implemented
   that via early_ch pass executed only with profile generation or feedback.
   I guess it makes sense to do, even if it breaks the assumption that
   we should do strictly -Os generation on paths where


Index: tree-vect-loop.c
===
--- tree-vect-loop.c(revision 191852)
+++ tree-vect-loop.c(working copy)
@@ -1243,6 +1243,8 @@ vect_analyze_loop_operations (loop_vec_i
   unsigned int th;
   bool only_slp_in_loop = true, ok;
   HOST_WIDE_INT max_niter;
+  HOST_WIDE_INT estimated_niter;
+  int min_profitable_estimate;
 
   if (vect_print_dump_info (REPORT_DETAILS))
 fprintf (vect_dump, "=== vect_analyze_loop_operations ===");
@@ -1436,7 +1438,8 @@ vect_analyze_loop_operations (loop_vec_i
  vector stmts depends on VF.  */
   vect_update_slp_costs_according_to_vf (loop_vinfo);
 
-  min_profita