from:"William J. Schmidt"

[PATCH, rs6000] Remove XFAIL from default_format_denormal_2.f90 for PowerPC on Linux

2014-06-17 Thread William J. Schmidt

Hi,

The testcase gfortran.dg/default_format_denormal_2.f90 has been
reporting XPASS since 4.8 on the powerpc*-unknown-linux-gnu platforms.
This patch removes the XFAIL for powerpc*-*-linux-* from the test.  I
believe this pattern doesn't match any other platforms, but please let
me know if I should replace it with a more specific pattern instead.

Verified on powerpc64-unknown-linux-gnu (-m32 and -m64) and
powerpc64le-unknown-linux-gnu (-m64).  Is this ok for trunk, 4.9, and
4.8?

Thanks,
Bill


2014-06-17  Bill Schmidt  wschm...@linux.vnet.ibm.com

* gfortran.dg/default_format_denormal_2.f90:  Remove xfail for
powerpc*-*-linux*.


Index: gcc/testsuite/gfortran.dg/default_format_denormal_2.f90
===
--- gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 (revision 
211741)
+++ gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 (working copy)
@@ -1,5 +1,5 @@
 ! { dg-require-effective-target fortran_large_real }
-! { dg-do run { xfail powerpc*-apple-darwin* powerpc*-*-linux* } }
+! { dg-do run { xfail powerpc*-apple-darwin* } }
 ! Test XFAILed on these platforms because the system's printf() lacks
 ! proper support for denormalized long doubles. See PR24685
 !

Re: [PATCH] Fix PR54674

2012-09-25 Thread William J. Schmidt



On Tue, 2012-09-25 at 09:14 +0200, Richard Guenther wrote:
 On Mon, 24 Sep 2012, William J. Schmidt wrote:
 
  In cases where pointers and ints are cast back and forth, SLSR can be
  tricked into introducing a multiply where one of the operands is of
  pointer type.  Don't do that!
  
  Verified that the reduced test case in the PR is fixed with a
  cross-compile to sh4-unknown-linux-gnu with -Os, which is the only known
  situation where the replacement looks profitable.  (It appears multiply
  costs are underestimated.)
  
  Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
  regressions.  Ok for trunk?
 
 Ok.  Btw, a multiply by/of a pointer in GIMPLE is done by casting
 to an appropriate unsigned type, doing the multiply, and then
 casting back to the pointer type.  Just in case it _is_ profitable
 to do the transform (the patch seems to try to avoid the situation
 only?)

Ok, that's good to know, thanks.  There's a general to-do in that area
to make the whole casting part better than it is right now, and that
should be addressed when I can get back to GCC and work on some of these
things.  I'll add a comment to that effect.  Appreciate the information!

Thanks,
Bill

 
 Thanks,
 Richard.
 
  Thanks,
  Bill
  
  
  2012-09-24  Bill Schmidt  wschm...@linux.vnet.ibm.com
  
  * gimple-ssa-strength-reduction.c (analyze_increments): Don't
  introduce a multiplication with a pointer operand.
  
  
  Index: gcc/gimple-ssa-strength-reduction.c
  ===
  --- gcc/gimple-ssa-strength-reduction.c (revision 191665)
  +++ gcc/gimple-ssa-strength-reduction.c (working copy)
  @@ -2028,6 +2028,15 @@ analyze_increments (slsr_cand_t first_dep, enum ma
   
  incr_vec[i].cost = COST_INFINITE;
   
  +  /* If we need to add an initializer, make sure we don't introduce
  +a multiply by a pointer type, which can happen in certain cast
  +scenarios.  */
  +  else if (!incr_vec[i].initializer
  +   TREE_CODE (first_dep-stride) != INTEGER_CST
  +   POINTER_TYPE_P (TREE_TYPE (first_dep-stride)))
  +
  +   incr_vec[i].cost = COST_INFINITE;
  +
 /* For any other increment, if this is a multiply candidate, we
   must introduce a temporary T and initialize it with
   T_0 = stride * increment.  When optimizing for speed, walk the

[PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

Richard found some N^2 behavior in SLSR that has to be suppressed.
Searching for the best possible basis is overkill when there are
hundreds of thousands of possibilities.  This patch constrains the
search to good enough in such cases.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no
regressions.  Ok for trunk?

Thanks,
Bill


2012-08-10  Bill Schmidt  wschm...@linux.vnet.ibm.com

* gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit
the time spent searching for a basis.


Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 191135)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
+  // Limit potential of N^2 behavior for long candidate chains.
+  int iters = 0;
+  const int MAX_ITERS = 50;
+
   mapping_key.base_expr = c-base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, mapping_key);
 
-  for (; chain; chain = chain-next)
+  for (; chain  iters  MAX_ITERS; chain = chain-next, ++iters)
 {
   slsr_cand_t one_basis = chain-cand;

Re: [PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

On Mon, 2012-09-10 at 16:45 +0200, Richard Guenther wrote:
 On Mon, 10 Sep 2012, William J. Schmidt wrote:
 
  Richard found some N^2 behavior in SLSR that has to be suppressed.
  Searching for the best possible basis is overkill when there are
  hundreds of thousands of possibilities.  This patch constrains the
  search to good enough in such cases.
  
  Bootstrapped and tested on powerpc64-unknown-linux-gnu with no
  regressions.  Ok for trunk?
 
 Hm, rather than stopping the search, can we stop adding new candidates
 instead so the list never grows that long?  If that's not easy
 the patch is ok as-is.

I think this way is probably better.  Right now the potential bases are
organized as a stack with new ones added to the front and considered
first.  To disable it there would require adding state to keep a count,
and then we would only be looking at the most distant ones.  This way
the 50 most recently added potential bases (most likely to be local) are
considered.

Thanks,
Bill

 
 Thanks,
 Richard.
 
  Thanks,
  Bill
  
  
  2012-08-10  Bill Schmidt  wschm...@linux.vnet.ibm.com
  
  * gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit
  the time spent searching for a basis.
  
  
  Index: gcc/gimple-ssa-strength-reduction.c
  ===
  --- gcc/gimple-ssa-strength-reduction.c (revision 191135)
  +++ gcc/gimple-ssa-strength-reduction.c (working copy)
  @@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c)
 cand_chain_t chain;
 slsr_cand_t basis = NULL;
   
  +  // Limit potential of N^2 behavior for long candidate chains.
  +  int iters = 0;
  +  const int MAX_ITERS = 50;
  +
 mapping_key.base_expr = c-base_expr;
 chain = (cand_chain_t) htab_find (base_cand_map, mapping_key);
   
  -  for (; chain; chain = chain-next)
  +  for (; chain  iters  MAX_ITERS; chain = chain-next, ++iters)
   {
 slsr_cand_t one_basis = chain-cand;

Re: [PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

On Mon, 2012-09-10 at 16:56 +0200, Richard Guenther wrote:
 On Mon, 10 Sep 2012, Jakub Jelinek wrote:
 
  On Mon, Sep 10, 2012 at 04:45:24PM +0200, Richard Guenther wrote:
   On Mon, 10 Sep 2012, William J. Schmidt wrote:
   
Richard found some N^2 behavior in SLSR that has to be suppressed.
Searching for the best possible basis is overkill when there are
hundreds of thousands of possibilities.  This patch constrains the
search to good enough in such cases.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no
regressions.  Ok for trunk?
   
   Hm, rather than stopping the search, can we stop adding new candidates
   instead so the list never grows that long?  If that's not easy
   the patch is ok as-is.
  
  Don't we want a param for that, or is a hardcoded magic constant fine here?
 
 I suppose a param for it would be nice.

OK, I'll get a param in place and get back to you.  Thanks...

Bill

 
 Richard.
 
2012-08-10  Bill Schmidt  wschm...@linux.vnet.ibm.com

* gimple-ssa-strength-reduction.c (find_basis_for_candidate): 
Limit
the time spent searching for a basis.


Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 191135)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
+  // Limit potential of N^2 behavior for long candidate chains.
+  int iters = 0;
+  const int MAX_ITERS = 50;
+
   mapping_key.base_expr = c-base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, mapping_key);
 
-  for (; chain; chain = chain-next)
+  for (; chain  iters  MAX_ITERS; chain = chain-next, ++iters)
 {
   slsr_cand_t one_basis = chain-cand;
  
  Jakub

Re: [PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

Here's the revised patch with a param.  Bootstrapped and tested in the
same manner.  Ok for trunk?

Thanks,
Bill


2012-08-10  Bill Schmidt  wschm...@linux.vnet.ibm.com

* doc/invoke.texi (max-slsr-cand-scan): New description.
* gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit
the time spent searching for a basis.
* params.def (PARAM_MAX_SLSR_CANDIDATE_SCAN): New param.


Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi (revision 191135)
+++ gcc/doc/invoke.texi (working copy)
@@ -9407,6 +9407,11 @@ having a regular register file and accurate regist
 See @file{haifa-sched.c} in the GCC sources for more details.
 
 The default choice depends on the target.
+
+@item max-slsr-cand-scan
+Set the maximum number of existing candidates that will be considered when
+seeking a basis for a new straight-line strength reduction candidate.
+
 @end table
 @end table
 
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 191135)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include domwalk.h
 #include pointer-set.h
 #include expmed.h
+#include params.h
 
 /* Information about a strength reduction candidate.  Each statement
in the candidate table represents an expression of one of the
@@ -353,10 +354,14 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
+  // Limit potential of N^2 behavior for long candidate chains.
+  int iters = 0;
+  int max_iters = PARAM_VALUE (PARAM_MAX_SLSR_CANDIDATE_SCAN);
+
   mapping_key.base_expr = c-base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, mapping_key);
 
-  for (; chain; chain = chain-next)
+  for (; chain  iters  max_iters; chain = chain-next, ++iters)
 {
   slsr_cand_t one_basis = chain-cand;
 
Index: gcc/params.def
===
--- gcc/params.def  (revision 191135)
+++ gcc/params.def  (working copy)
@@ -973,6 +973,13 @@ DEFPARAM (PARAM_SCHED_PRESSURE_ALGORITHM,
  Which -fsched-pressure algorithm to apply,
  1, 1, 2)
 
+/* Maximum length of candidate scans in straight-line strength reduction.  */
+DEFPARAM (PARAM_MAX_SLSR_CANDIDATE_SCAN,
+ max-slsr-cand-scan,
+ Maximum length of candidate scans for straight-line 
+ strength reduction,
+ 50, 1, 99)
+
 /*
 Local variables:
 mode:c

Re: [patch] rs6000: plug a leak

2012-08-23 Thread William J. Schmidt

On Thu, 2012-08-23 at 00:53 +0200, Steven Bosscher wrote:
 Hello Bill,
 
 This patch plugs a leak in rs6000.c:rs6000_density_test(). You have to
 free the array that get_loop_body returns. Noticed while going over
 all uses of get_loop_body (it's a common mistake to leak the return
 array).
 
 Patch is completely untested because I don't know when/how this
 function is used. You've added this function:
 
 2012-07-31  Bill Schmidt  ...
 
 * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise
 costs for vec_perm and vec_promote_demote down to more natural values.
 (struct _rs6000_cost_data): New data structure.
 --(rs6000_density_test): New function
 
 so I suppose you know what it's for and how to test this patch :-)
 
 Could you test this for me and commit it if nothing strange happens?

Hi Steven,

Regstrapped with no additional failures on powerpc64-unknown-linux-gnu.
Built CPU2006 without error.  Committed as obvious.

Thanks again,
Bill

 
 Thanks,
 
 Ciao!
 Steven
 
 
 
 Index: config/rs6000/rs6000.c
 ===
 --- config/rs6000/rs6000.c  (revision 190601)
 +++ config/rs6000/rs6000.c  (working copy)
 @@ -3509,6 +3509,7 @@ rs6000_density_test (rs6000_cost_data *d
 not_vec_cost++;
 }
  }
 +  free (bbs);
 
density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);

Re: [patch] rs6000: plug a leak

2012-08-22 Thread William J. Schmidt

On Thu, 2012-08-23 at 00:53 +0200, Steven Bosscher wrote:
 Hello Bill,
 
 This patch plugs a leak in rs6000.c:rs6000_density_test(). You have to
 free the array that get_loop_body returns. Noticed while going over
 all uses of get_loop_body (it's a common mistake to leak the return
 array).
 
 Patch is completely untested because I don't know when/how this
 function is used. You've added this function:
 
 2012-07-31  Bill Schmidt  ...
 
 * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise
 costs for vec_perm and vec_promote_demote down to more natural values.
 (struct _rs6000_cost_data): New data structure.
 --(rs6000_density_test): New function
 
 so I suppose you know what it's for and how to test this patch :-)
 
 Could you test this for me and commit it if nothing strange happens?

Sure thing!  Thanks for catching this.

Bill

 
 Thanks,
 
 Ciao!
 Steven
 
 
 
 Index: config/rs6000/rs6000.c
 ===
 --- config/rs6000/rs6000.c  (revision 190601)
 +++ config/rs6000/rs6000.c  (working copy)
 @@ -3509,6 +3509,7 @@ rs6000_density_test (rs6000_cost_data *d
 not_vec_cost++;
 }
  }
 +  free (bbs);
 
density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);

[PATCH] Fix PR54240

2012-08-14 Thread William J. Schmidt

Replace the once vacuously true, and now vacuously false, test for
existence of a conditional move instruction for a given mode, with one
that actually checks what it's supposed to.  Add a test case so we don't
miss such things in future.

The test is powerpc-specific.  It would be good to have an i386 version
of the test as well, if someone can help with that.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-13  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/54240
* tree-ssa-phiopt.c (hoist_adjacent_loads): Correct test for
existence of conditional move with given mode.


gcc/testsuite:

2012-08-13  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/54240
* gcc.target/powerpc/pr54240.c: New test.


Index: gcc/testsuite/gcc.target/powerpc/pr54240.c
===
--- gcc/testsuite/gcc.target/powerpc/pr54240.c  (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr54240.c  (revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -misel -fdump-tree-phiopt-details } */
+
+typedef struct s {
+  int v;
+  int b;
+  struct s *l;
+  struct s *r;
+} S;
+
+
+int foo(S *s)
+{
+  S *this;
+  S *next;
+
+  this = s;
+  if (this-b)
+next = this-l;
+  else
+next = this-r;
+
+  return next-v;
+}
+
+/* { dg-final { scan-tree-dump Hoisting adjacent loads phiopt1 } } */
+/* { dg-final { cleanup-tree-dump phiopt1 } } */
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 190305)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -1843,7 +1843,8 @@ hoist_adjacent_loads (basic_block bb0, basic_block
 
   /* Check the mode of the arguments to be sure a conditional move
 can be generated for it.  */
-  if (!optab_handler (cmov_optab, TYPE_MODE (TREE_TYPE (arg1
+  if (optab_handler (movcc_optab, TYPE_MODE (TREE_TYPE (arg1)))
+ == CODE_FOR_nothing)
continue;
 
   /* Both statements must be assignments whose RHS is a COMPONENT_REF.  */

Re: [PATCH] Fix PR54240

2012-08-14 Thread William J. Schmidt

Thanks, Andrew!

Bill

On Tue, 2012-08-14 at 14:17 -0700, Andrew Pinski wrote:
 On Tue, Aug 14, 2012 at 2:15 PM, Andrew Pinski pins...@gmail.com wrote:
  On Tue, Aug 14, 2012 at 2:11 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
  Replace the once vacuously true, and now vacuously false, test for
  existence of a conditional move instruction for a given mode, with one
  that actually checks what it's supposed to.  Add a test case so we don't
  miss such things in future.
 
  The test is powerpc-specific.  It would be good to have an i386 version
  of the test as well, if someone can help with that.
 
  Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
  regressions.  Ok for trunk?
 
  Here is one which can go into gcc.target/mips :
  /* { dg-do compile } */
  /* { dg-options -O2 -fdump-tree-phiopt-details } */
 
 Sorry the dg-options should be:
  /* { dg-options -O2 -fdump-tree-phiopt-details isa=4 } */
 
 Thanks,
 Andrew
 
 
  typedef struct s {
int v;
int b;
struct s *l;
struct s *r;
  } S;
 
 
  int foo(S *s)
  {
S *this;
S *next;
 
this = s;
if (this-b)
  next = this-l;
else
  next = this-r;
 
return next-v;
  }
 
  /* { dg-final { scan-tree-dump Hoisting adjacent loads phiopt1 } } */
  /* { dg-final { cleanup-tree-dump phiopt1 } } */

[PATCH] Fix PR54245

2012-08-14 Thread William J. Schmidt

Currently we can insert an initializer that performs a multiply in too
small of a type for correctness.  For now, detect the problem and avoid
the optimization when this would happen.  Eventually I will fix this up
to cause the multiply to be performed in a sufficiently wide type.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-14  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/54245
* gimple-ssa-strength-reduction.c (legal_cast_p_1): New function.
(legal_cast_p): Split out logic to legal_cast_p_1.
(analyze_increments): Avoid introducing multiplies in smaller types.


gcc/testsuite:

2012-08-14  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/54245
* gcc.dg/tree-ssa/pr54245.c: New test.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr54245.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr54245.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr54245.c (revision 0)
@@ -0,0 +1,49 @@
+/* { dg-do compile } */
+/* { dg-options -O1 -fdump-tree-slsr-details } */
+
+#include stdio.h
+
+#define W1  22725
+#define W2  21407
+#define W3  19266
+#define W6  8867
+
+void idct_row(short *row, int *dst)
+{
+int a0, a1, b0, b1;
+
+a0 = W1 * row[0];
+a1 = a0;
+
+a0 += W2 * row[2];
+a1 += W6 * row[2];
+
+b0 = W1 * row[1];
+b1 = W3 * row[1];
+
+dst[0] = a0 + b0;
+dst[1] = a0 - b0;
+dst[2] = a1 + b1;
+dst[3] = a1 - b1;
+}
+
+static short block[8] = { 1, 2, 3, 4 };
+
+int main(void)
+{
+int out[4];
+int i;
+
+idct_row(block, out);
+
+for (i = 0; i  4; i++)
+printf(%d\n, out[i]);
+
+return !(out[2] == 87858  out[3] == 10794);
+}
+
+/* For now, disable inserting an initializer when the multiplication will
+   take place in a smaller type than originally.  This test may be deleted
+   in future when this case is handled more precisely.  */
+/* { dg-final { scan-tree-dump-times Inserting initializer 0 slsr } } */
+/* { dg-final { cleanup-tree-dump slsr } } */
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 190305)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -1089,6 +1089,32 @@ slsr_process_neg (gimple gs, tree rhs1, bool speed
   add_cand_for_stmt (gs, c);
 }
 
+/* Help function for legal_cast_p, operating on two trees.  Checks
+   whether it's allowable to cast from RHS to LHS.  See legal_cast_p
+   for more details.  */
+
+static bool
+legal_cast_p_1 (tree lhs, tree rhs)
+{
+  tree lhs_type, rhs_type;
+  unsigned lhs_size, rhs_size;
+  bool lhs_wraps, rhs_wraps;
+
+  lhs_type = TREE_TYPE (lhs);
+  rhs_type = TREE_TYPE (rhs);
+  lhs_size = TYPE_PRECISION (lhs_type);
+  rhs_size = TYPE_PRECISION (rhs_type);
+  lhs_wraps = TYPE_OVERFLOW_WRAPS (lhs_type);
+  rhs_wraps = TYPE_OVERFLOW_WRAPS (rhs_type);
+
+  if (lhs_size  rhs_size
+  || (rhs_wraps  !lhs_wraps)
+  || (rhs_wraps  lhs_wraps  rhs_size != lhs_size))
+return false;
+
+  return true;
+}
+
 /* Return TRUE if GS is a statement that defines an SSA name from
a conversion and is legal for us to combine with an add and multiply
in the candidate table.  For example, suppose we have:
@@ -1129,28 +1155,11 @@ slsr_process_neg (gimple gs, tree rhs1, bool speed
 static bool
 legal_cast_p (gimple gs, tree rhs)
 {
-  tree lhs, lhs_type, rhs_type;
-  unsigned lhs_size, rhs_size;
-  bool lhs_wraps, rhs_wraps;
-
   if (!is_gimple_assign (gs)
   || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (gs)))
 return false;
 
-  lhs = gimple_assign_lhs (gs);
-  lhs_type = TREE_TYPE (lhs);
-  rhs_type = TREE_TYPE (rhs);
-  lhs_size = TYPE_PRECISION (lhs_type);
-  rhs_size = TYPE_PRECISION (rhs_type);
-  lhs_wraps = TYPE_OVERFLOW_WRAPS (lhs_type);
-  rhs_wraps = TYPE_OVERFLOW_WRAPS (rhs_type);
-
-  if (lhs_size  rhs_size
-  || (rhs_wraps  !lhs_wraps)
-  || (rhs_wraps  lhs_wraps  rhs_size != lhs_size))
-return false;
-
-  return true;
+  return legal_cast_p_1 (gimple_assign_lhs (gs), rhs);
 }
 
 /* Given GS which is a cast to a scalar integer type, determine whether
@@ -1996,6 +2005,31 @@ analyze_increments (slsr_cand_t first_dep, enum ma
   != POINTER_PLUS_EXPR)))
incr_vec[i].cost = COST_NEUTRAL;
   
+  /* FORNOW: If we need to add an initializer, give up if a cast from
+the candidate's type to its stride's type can lose precision.
+This could eventually be handled better by expressly retaining the
+result of a cast to a wider type in the stride.  Example:
+
+   short int _1;
+  _2 = (int) _1;
+  _3 = _2 * 10;
+  _4 = x + _3;ADD: x + (10 * _1) : int
+  _5 = _2 * 15;
+  _6 = x + _3;ADD: x + (15 * _1) : int
+
+ Right now replacing _6

Re: [PATCH] Strength reduction part 3 of 4: candidates with unknown strides

2012-08-09 Thread William J. Schmidt

On Wed, 2012-08-08 at 19:22 -0700, Janis Johnson wrote:
 On 08/08/2012 06:41 PM, William J. Schmidt wrote:
  On Wed, 2012-08-08 at 15:35 -0700, Janis Johnson wrote:
  On 08/08/2012 03:27 PM, Andrew Pinski wrote:
  On Wed, Aug 8, 2012 at 3:25 PM, H.J. Lu hjl.to...@gmail.com wrote:
  On Wed, Aug 1, 2012 at 10:36 AM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
 
  +/* { dg-do compile } */
  +/* { dg-options -O3 -fdump-tree-dom2 -fwrapv } */
  +/* { dg-skip-if  { ilp32 } { -m32 } {  } } */
  +
 
  This doesn't work on x32 nor Linux/ia32 since -m32
  may not be needed for ILP32.  This patch works for
  me.  OK to install?
 
  This also does not work for mips64 where the options are either
  -mabi=32 or -mabi=n32 for ILP32.
 
  HJL's patch looks correct.
 
  Thanks,
  Andrew
 
  There are GCC targets with 16-bit integers.  What's the actual
  set of targets on which this test is meant to run?  There's a list
  of effective-target names based on data type sizes in
  http://gcc.gnu.org/onlinedocs/gccint/Effective_002dTarget-Keywords.html#Effective_002dTarget-Keywords.
  
  Yes, sorry.  The test really is only valid when int and long have
  different sizes.  So according to that link we should skip ilp32 and
  llp64 at a minimum.  It isn't clear what we should do for int16 since
  the size of long isn't specified, so I suppose we should skip that as
  well.  So, perhaps modify HJ's patch to have
  
  /* { dg-do compile { target { ! { ilp32 llp64 int16 } } } } */
  
  ?
  
  Thanks,
  Bill
 
 That's confusing.  Perhaps what you really need is a new effective
 target for sizeof(int) != sizeof(long).

Good idea.  I'll work up a patch when I get a moment.

Thanks,
Bill

 
 Janis

[PATCH] Fix PR54211

2012-08-09 Thread William J. Schmidt

Fix a thinko in strength reduction.  I was checking the type of the
wrong operand to determine whether address arithmetic should be used in
replacing expressions.  This produced a spurious POINTER_PLUS_EXPR when
an address was converted to an unsigned long and back again.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-09  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR middle-end/54211
* gimple-ssa-strength-reduction.c (analyze_candidates_and_replace):
Use cand_type to determine whether pointer arithmetic will be generated.

gcc/testsuite:

2012-08-09  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR middle-end/54211
* gcc.dg/tree-ssa/pr54211.c: New test.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr54211.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr54211.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr54211.c (revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options -Os } */
+
+int a, b;
+unsigned char e;
+void fn1 ()
+{
+unsigned char *c=0;
+for (;; a++)
+{
+unsigned char d = *(c + b);
+for (; ed; c++)
+goto Found_Top;
+}
+Found_Top:
+if (0)
+goto Empty_Bitmap;
+for (;; a++)
+{
+unsigned char *e = c + b;
+for (; c  e; c++)
+goto Found_Bottom;
+c -= b;
+}
+Found_Bottom:
+Empty_Bitmap:
+;
+}
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 190260)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -2534,7 +2534,7 @@ analyze_candidates_and_replace (void)
  /* Determine whether we'll be generating pointer arithmetic
 when replacing candidates.  */
  address_arithmetic_p = (c-kind == CAND_ADD
-  POINTER_TYPE_P (TREE_TYPE (c-base_expr)));
+  POINTER_TYPE_P (c-cand_type));
 
  /* If all candidates have already been replaced under other
 interpretations, nothing remains to be done.  */

[PATCH, testsuite] New effective target long_neq_int

2012-08-09 Thread William J. Schmidt

As suggested by Janis regarding testsuite/gcc.dg/tree-ssa/slsr-30.c,
this patch adds a new effective target for machines having long and int
of differing sizes.

Tested on powerpc64-unknown-linux-gnu, where the test passes for -m64
and is skipped for -m32.  Ok for trunk?

Thanks,
Bill


doc:

2012-08-09  Bill Schmidt  wschm...@linux.vnet.ibm.com

* sourcebuild.texi: Document long_neq_int effective target.


testsuite:

2012-08-09  Bill Schmidt  wschm...@linux.vnet.ibm.com

* lib/target-supports.exp (check_effective_target_long_neq_int): New.
* gcc.dg/tree-ssa/slsr-30.c: Check for long_neq_int effective target.


Index: gcc/doc/sourcebuild.texi
===
--- gcc/doc/sourcebuild.texi(revision 190260)
+++ gcc/doc/sourcebuild.texi(working copy)
@@ -1303,6 +1303,9 @@ Target has @code{int} that is at 32 bits or longer
 @item int16
 Target has @code{int} that is 16 bits or shorter.
 
+@item long_neq_int
+Target has @code{int} and @code{long} with different sizes.
+
 @item large_double
 Target supports @code{double} that is longer than @code{float}.
 
Index: gcc/testsuite/lib/target-supports.exp
===
--- gcc/testsuite/lib/target-supports.exp   (revision 190260)
+++ gcc/testsuite/lib/target-supports.exp   (working copy)
@@ -1689,6 +1689,15 @@ proc check_effective_target_llp64 { } {
 }]
 }
 
+# Return 1 if long and int have different sizes,
+# 0 otherwise.
+
+proc check_effective_target_long_neq_int { } {
+return [check_no_compiler_messages long_ne_int object {
+   int dummy[sizeof (int) != sizeof (long) ? 1 : -1];
+}]
+}
+
 # Return 1 if the target supports long double larger than double,
 # 0 otherwise.
 
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 190260)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (working copy)
@@ -1,7 +1,7 @@
 /* Verify straight-line strength reduction fails for simple integer addition
with casts thrown in when -fwrapv is used.  */
 
-/* { dg-do compile { target { ! { ilp32 } } } } */
+/* { dg-do compile { target { long_neq_int } } } */
 /* { dg-options -O3 -fdump-tree-dom2 -fwrapv } */
 
 long

Re: [PATCH] Strength reduction part 3 of 4: candidates with unknown strides

2012-08-08 Thread William J. Schmidt

On Wed, 2012-08-08 at 15:35 -0700, Janis Johnson wrote:
 On 08/08/2012 03:27 PM, Andrew Pinski wrote:
  On Wed, Aug 8, 2012 at 3:25 PM, H.J. Lu hjl.to...@gmail.com wrote:
  On Wed, Aug 1, 2012 at 10:36 AM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
  Greetings,
 
  Thanks for the review of part 2!  Here's another chunk of the SLSR code
  (I feel I owe you a few beers at this point).  This performs analysis
  and replacement on groups of related candidates having an SSA name
  (rather than a constant) for a stride.
 
  This leaves only the conditional increment (CAND_PHI) case, which will
  be handled in the last patch of the series.
 
  Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
  regressions.  Ok for trunk?
 
  Thanks,
  Bill
 
 
  gcc:
 
  2012-08-01  Bill Schmidt  wschm...@linux.ibm.com
 
  * gimple-ssa-strength-reduction.c (struct incr_info_d): New 
  struct.
  (incr_vec): New static var.
  (incr_vec_len): Likewise.
  (address_arithmetic_p): Likewise.
  (stmt_cost): Remove dead assignment.
  (dump_incr_vec): New function.
  (cand_abs_increment): Likewise.
  (lazy_create_slsr_reg): Likewise.
  (incr_vec_index): Likewise.
  (count_candidates): Likewise.
  (record_increment): Likewise.
  (record_increments): Likewise.
  (unreplaced_cand_in_tree): Likewise.
  (optimize_cands_for_speed_p): Likewise.
  (lowest_cost_path): Likewise.
  (total_savings): Likewise.
  (analyze_increments): Likewise.
  (ncd_for_two_cands): Likewise.
  (nearest_common_dominator_for_cands): Likewise.
  (profitable_increment_p): Likewise.
  (insert_initializers): Likewise.
  (introduce_cast_before_cand): Likewise.
  (replace_rhs_if_not_dup): Likewise.
  (replace_one_candidate): Likewise.
  (replace_profitable_candidates): Likewise.
  (analyze_candidates_and_replace): Handle candidates with SSA-name
  strides.
 
  gcc/testsuite:
 
  2012-08-01  Bill Schmidt  wschm...@linux.ibm.com
 
  * gcc.dg/tree-ssa/slsr-5.c: New.
  * gcc.dg/tree-ssa/slsr-6.c: New.
  * gcc.dg/tree-ssa/slsr-7.c: New.
  * gcc.dg/tree-ssa/slsr-8.c: New.
  * gcc.dg/tree-ssa/slsr-9.c: New.
  * gcc.dg/tree-ssa/slsr-10.c: New.
  * gcc.dg/tree-ssa/slsr-11.c: New.
  * gcc.dg/tree-ssa/slsr-12.c: New.
  * gcc.dg/tree-ssa/slsr-13.c: New.
  * gcc.dg/tree-ssa/slsr-14.c: New.
  * gcc.dg/tree-ssa/slsr-15.c: New.
  * gcc.dg/tree-ssa/slsr-16.c: New.
  * gcc.dg/tree-ssa/slsr-17.c: New.
  * gcc.dg/tree-ssa/slsr-18.c: New.
  * gcc.dg/tree-ssa/slsr-19.c: New.
  * gcc.dg/tree-ssa/slsr-20.c: New.
  * gcc.dg/tree-ssa/slsr-21.c: New.
  * gcc.dg/tree-ssa/slsr-22.c: New.
  * gcc.dg/tree-ssa/slsr-23.c: New.
  * gcc.dg/tree-ssa/slsr-24.c: New.
  * gcc.dg/tree-ssa/slsr-25.c: New.
  * gcc.dg/tree-ssa/slsr-26.c: New.
  * gcc.dg/tree-ssa/slsr-30.c: New.
  * gcc.dg/tree-ssa/slsr-31.c: New.
 
 
  ==
  --- gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 0)
  +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 0)
  @@ -0,0 +1,25 @@
  +/* Verify straight-line strength reduction fails for simple integer 
  addition
  +   with casts thrown in when -fwrapv is used.  */
  +
  +/* { dg-do compile } */
  +/* { dg-options -O3 -fdump-tree-dom2 -fwrapv } */
  +/* { dg-skip-if  { ilp32 } { -m32 } {  } } */
  +
 
  This doesn't work on x32 nor Linux/ia32 since -m32
  may not be needed for ILP32.  This patch works for
  me.  OK to install?
  
  This also does not work for mips64 where the options are either
  -mabi=32 or -mabi=n32 for ILP32.
  
  HJL's patch looks correct.
  
  Thanks,
  Andrew
 
 There are GCC targets with 16-bit integers.  What's the actual
 set of targets on which this test is meant to run?  There's a list
 of effective-target names based on data type sizes in
 http://gcc.gnu.org/onlinedocs/gccint/Effective_002dTarget-Keywords.html#Effective_002dTarget-Keywords.

Yes, sorry.  The test really is only valid when int and long have
different sizes.  So according to that link we should skip ilp32 and
llp64 at a minimum.  It isn't clear what we should do for int16 since
the size of long isn't specified, so I suppose we should skip that as
well.  So, perhaps modify HJ's patch to have

/* { dg-do compile { target { ! { ilp32 llp64 int16 } } } } */

?

Thanks,
Bill

 
 Janis
 
 
  Thanks.
 
 
  --
  H.J.
  ---
 * gcc.dg/tree-ssa/slsr-30.c: Require non-ilp32.  Remove
 dg-skip-if.
 
  diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c 
  b/gcc/testsuite/gcc.dg/tree
  -ssa/slsr-30.c
  index fbd6897..7921f43 100644
  --- a/gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c
  +++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-30

[PATCH, committed] Fix PR53773

2012-08-03 Thread William J. Schmidt

Change this test case to use the optimized dump so that the unreliable
vect-details dump can't cause different behavior on different targets.
Verified on powerpc64-unknown-linux-gnu, committed as obvious.

Thanks,
Bill


2012-08-03  Bill Schmidt  wschm...@linux.ibm.com

* testsuite/gcc.dg/vect/pr53773.c: Change to use optimized dump.


Index: gcc/testsuite/gcc.dg/vect/pr53773.c
===
--- gcc/testsuite/gcc.dg/vect/pr53773.c (revision 190018)
+++ gcc/testsuite/gcc.dg/vect/pr53773.c (working copy)
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-options -fdump-tree-optimized } */
 
 int
 foo (int integral, int decimal, int power_ten)
@@ -13,7 +14,7 @@ foo (int integral, int decimal, int power_ten)
   return integral+decimal;
 }
 
-/* Two occurrences in annotations, two in code.  */
-/* { dg-final { scan-tree-dump-times \\* 10 4 vect } } */
+/* { dg-final { scan-tree-dump-times \\* 10 2 optimized } } */
 /* { dg-final { cleanup-tree-dump vect } } */
+/* { dg-final { cleanup-tree-dump optimized } } */

[PATCH, committed] Strength reduction clean-up (base name = base expr)

2012-08-01 Thread William J. Schmidt

This cleans up terminology in strength reduction.  What used to be a
base SSA name is now sometimes other tree expressions, so the term base
name is replaced by base expression throughout.

Bootstrapped and tested with no new regressions on
powerpc64-unknown-linux-gnu; committed as obvious.

Thanks,
Bill


2012-08-01  Bill Schmidt  wschm...@linux.ibm.com

* gimple-ssa-strength-reduction.c (struct slsr_cand_d): Change
base_name to base_expr.
(struct cand_chain_d): Likewise.
(base_cand_hash): Likewise.
(base_cand_eq): Likewise.
(record_potential_basis): Likewise.
(alloc_cand_and_find_basis): Likewise.
(create_mul_ssa_cand): Likewise.
(create_mul_imm_cand): Likewise.
(create_add_ssa_cand): Likewise.
(create_add_imm_cand): Likewise.
(slsr_process_cast): Likewise.
(slsr_process_copy): Likewise.
(dump_candidate): Likewise.
(base_cand_dump_callback): Likewise.
(unconditional_cands_with_known_stride_p): Likewise.
(cand_increment): Likewise.


Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 190037)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -166,8 +166,8 @@ struct slsr_cand_d
   /* The candidate statement S1.  */
   gimple cand_stmt;
 
-  /* The base SSA name B.  */
-  tree base_name;
+  /* The base expression B:  often an SSA name, but not always.  */
+  tree base_expr;
 
   /* The stride S.  */
   tree stride;
@@ -175,7 +175,7 @@ struct slsr_cand_d
   /* The index constant i.  */
   double_int index;
 
-  /* The type of the candidate.  This is normally the type of base_name,
+  /* The type of the candidate.  This is normally the type of base_expr,
  but casts may have occurred when combining feeding instructions.
  A candidate can only be a basis for candidates of the same final type.
  (For CAND_REFs, this is the type to be used for operand 1 of the
@@ -216,12 +216,13 @@ typedef struct slsr_cand_d slsr_cand, *slsr_cand_t
 typedef const struct slsr_cand_d *const_slsr_cand_t;
 
 /* Pointers to candidates are chained together as part of a mapping
-   from SSA names to the candidates that use them as a base name.  */
+   from base expressions to the candidates that use them.  */
 
 struct cand_chain_d
 {
-  /* SSA name that serves as a base name for the chain of candidates.  */
-  tree base_name;
+  /* Base expression for the chain of candidates:  often, but not
+ always, an SSA name.  */
+  tree base_expr;
 
   /* Pointer to a candidate.  */
   slsr_cand_t cand;
@@ -253,7 +254,7 @@ static struct pointer_map_t *stmt_cand_map;
 /* Obstack for candidates.  */
 static struct obstack cand_obstack;
 
-/* Hash table embodying a mapping from base names to chains of candidates.  */
+/* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static htab_t base_cand_map;
 
 /* Obstack for candidate chains.  */
@@ -272,7 +273,7 @@ lookup_cand (cand_idx idx)
 static hashval_t
 base_cand_hash (const void *p)
 {
-  tree base_expr = ((const_cand_chain_t) p)-base_name;
+  tree base_expr = ((const_cand_chain_t) p)-base_expr;
   return iterative_hash_expr (base_expr, 0);
 }
 
@@ -291,10 +292,10 @@ base_cand_eq (const void *p1, const void *p2)
 {
   const_cand_chain_t const chain1 = (const_cand_chain_t) p1;
   const_cand_chain_t const chain2 = (const_cand_chain_t) p2;
-  return operand_equal_p (chain1-base_name, chain2-base_name, 0);
+  return operand_equal_p (chain1-base_expr, chain2-base_expr, 0);
 }
 
-/* Use the base name from candidate C to look for possible candidates
+/* Use the base expr from candidate C to look for possible candidates
that can serve as a basis for C.  Each potential basis must also
appear in a block that dominates the candidate statement and have
the same stride and type.  If more than one possible basis exists,
@@ -308,7 +309,7 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
-  mapping_key.base_name = c-base_name;
+  mapping_key.base_expr = c-base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, mapping_key);
 
   for (; chain; chain = chain-next)
@@ -337,8 +338,8 @@ find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base name of C to C itself, indicating that
-   C may potentially serve as a basis using that base name.  */
+/* Record a mapping from the base expression of C to C itself, indicating that
+   C may potentially serve as a basis using that base expression.  */
 
 static void
 record_potential_basis (slsr_cand_t c)
@@ -347,7 +348,7 @@ record_potential_basis (slsr_cand_t c)
   void **slot;
 
   node = (cand_chain_t) obstack_alloc (chain_obstack, sizeof (cand_chain));
-  node-base_name = c-base_name;
+  node-base_expr = c-base_expr;
   node-cand = c;
   node-next = NULL;
   slot = htab_find_slot

[PATCH, rs6000] Vectorizer heuristic

2012-07-31 Thread William J. Schmidt

Now that the vectorizer cost model is set up to facilitate per-target
heuristics, I'm revisiting the density heuristic I submitted
previously.  This allows the vec_permute and vec_promote_demote costs to
be set to their natural values, but inhibits vectorization in cases like
sphinx3 where vectorizing a loop leads to issue stalls from
overcommitted resources.

Bootstrapped on powerpc64-unknown-linux-gnu with no new regressions.
Measured performance on cpu2000 and cpu2006 with no significant changes
in performance.  Ok for trunk?

Thanks,
Bill


2012-07-31  Bill Schmidt  wschm...@linux.ibm.com

* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise
costs for vec_perm and vec_promote_demote down to more natural values.
(struct _rs6000_cost_data): New data structure.
(rs6000_density_test): New function.
(rs6000_init_cost): Change to use rs6000_cost_data.
(rs6000_add_stmt_cost): Likewise.
(rs6000_finish_cost): Perform density test when vectorizing a loop.


Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 189845)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -60,6 +60,7 @@
 #include params.h
 #include tm-constrs.h
 #include opts.h
+#include tree-vectorizer.h
 #if TARGET_XCOFF
 #include xcoffout.h  /* get declarations of xcoff_*_section_name */
 #endif
@@ -3378,13 +3379,13 @@ rs6000_builtin_vectorization_cost (enum vect_cost_
 
   case vec_perm:
if (TARGET_VSX)
- return 4;
+ return 3;
else
  return 1;
 
   case vec_promote_demote:
 if (TARGET_VSX)
-  return 5;
+  return 4;
 else
   return 1;
 
@@ -3520,14 +3521,71 @@ rs6000_preferred_simd_mode (enum machine_mode mode
   return word_mode;
 }
 
+typedef struct _rs6000_cost_data
+{
+  struct loop *loop_info;
+  unsigned cost[3];
+} rs6000_cost_data;
+
+/* Test for likely overcommitment of vector hardware resources.  If a
+   loop iteration is relatively large, and too large a percentage of
+   instructions in the loop are vectorized, the cost model may not
+   adequately reflect delays from unavailable vector resources.
+   Penalize the loop body cost for this case.  */
+
+static void
+rs6000_density_test (rs6000_cost_data *data)
+{
+  const int DENSITY_PCT_THRESHOLD = 85;
+  const int DENSITY_SIZE_THRESHOLD = 70;
+  const int DENSITY_PENALTY = 10;
+  struct loop *loop = data-loop_info;
+  basic_block *bbs = get_loop_body (loop);
+  int nbbs = loop-num_nodes;
+  int vec_cost = data-cost[vect_body], not_vec_cost = 0;
+  int i, density_pct;
+
+  for (i = 0; i  nbbs; i++)
+{
+  basic_block bb = bbs[i];
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi))
+   {
+ gimple stmt = gsi_stmt (gsi);
+ stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+ if (!STMT_VINFO_RELEVANT_P (stmt_info)
+  !STMT_VINFO_IN_PATTERN_P (stmt_info))
+   not_vec_cost++;
+   }
+}
+
+  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
+
+  if (density_pct  DENSITY_PCT_THRESHOLD
+   vec_cost + not_vec_cost  DENSITY_SIZE_THRESHOLD)
+{
+  data-cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
+  if (vect_print_dump_info (REPORT_DETAILS))
+   fprintf (vect_dump,
+density %d%%, cost %d exceeds threshold, penalizing 
+loop body cost by %d%%, density_pct, 
+vec_cost + not_vec_cost, DENSITY_PENALTY);
+}
+}
+
 /* Implement targetm.vectorize.init_cost.  */
 
 static void *
-rs6000_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
+rs6000_init_cost (struct loop *loop_info)
 {
-  unsigned *cost = XNEWVEC (unsigned, 3);
-  cost[vect_prologue] = cost[vect_body] = cost[vect_epilogue] = 0;
-  return cost;
+  rs6000_cost_data *data = XNEW (struct _rs6000_cost_data);
+  data-loop_info = loop_info;
+  data-cost[vect_prologue] = 0;
+  data-cost[vect_body] = 0;
+  data-cost[vect_epilogue] = 0;
+  return data;
 }
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
@@ -3537,7 +3595,7 @@ rs6000_add_stmt_cost (void *data, int count, enum
  struct _stmt_vec_info *stmt_info, int misalign,
  enum vect_cost_model_location where)
 {
-  unsigned *cost = (unsigned *) data;
+  rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
   unsigned retval = 0;
 
   if (flag_vect_cost_model)
@@ -3552,7 +3610,7 @@ rs6000_add_stmt_cost (void *data, int count, enum
count *= 50;  /* FIXME.  */
 
   retval = (unsigned) (count * stmt_cost);
-  cost[where] += retval;
+  cost_data-cost[where] += retval;
 }
 
   return retval;
@@ -3564,10 +3622,14 @@ static void
 rs6000_finish_cost (void *data, unsigned *prologue_cost,
unsigned *body_cost, unsigned *epilogue_cost)
 {
-  unsigned *cost =

[PATCH] Fix PR53733

2012-07-30 Thread William J. Schmidt

This fixes the de-canonicalization of commutative GIMPLE operations in
the vectorizer that occurs when processing reductions.  A loop_vec_info
is flagged for cleanup when a de-canonicalization has occurred in that
loop, and the cleanup is done when the loop_vec_info is destroyed.

Bootstrapped on powerpc64-unknown-linux-gnu with no new regressions.  Ok
for trunk?

Thanks,
Bill


gcc:

2012-07-30  Bill Schmidt  wschm...@linux.ibm.com

PR tree-optimization/53773
* tree-vectorizer.h (struct _loop_vec_info): Add operands_swapped.
(LOOP_VINFO_OPERANDS_SWAPPED): New macro.
* tree-vect-loop.c (new_loop_vec_info): Initialize
LOOP_VINFO_OPERANDS_SWAPPED field.
(destroy_loop_vec_info): Restore canonical form.
(vect_is_slp_reduction): Set LOOP_VINFO_OPERANDS_SWAPPED field.
(vect_is_simple_reduction_1): Likewise.

gcc/testsuite:

2012-07-30  Bill Schmidt  wschm...@linux.ibm.com

PR tree-optimization/53773
* testsuite/gcc.dg/vect/pr53773.c: New test.


Index: gcc/testsuite/gcc.dg/vect/pr53773.c
===
--- gcc/testsuite/gcc.dg/vect/pr53773.c (revision 0)
+++ gcc/testsuite/gcc.dg/vect/pr53773.c (revision 0)
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+
+int
+foo (int integral, int decimal, int power_ten)
+{
+  while (power_ten  0)
+{
+  integral *= 10;
+  decimal *= 10;
+  power_ten--;
+}
+
+  return integral+decimal;
+}
+
+/* Two occurrences in annotations, two in code.  */
+/* { dg-final { scan-tree-dump-times \\* 10 4 vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
Index: gcc/tree-vectorizer.h
===
--- gcc/tree-vectorizer.h   (revision 189938)
+++ gcc/tree-vectorizer.h   (working copy)
@@ -296,6 +296,12 @@ typedef struct _loop_vec_info {
  this.  */
   bool peeling_for_gaps;
 
+  /* Reductions are canonicalized so that the last operand is the reduction
+ operand.  If this places a constant into RHS1, this decanonicalizes
+ GIMPLE for other phases, so we must track when this has occurred and
+ fix it up.  */
+  bool operands_swapped;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -326,6 +332,7 @@ typedef struct _loop_vec_info {
 #define LOOP_VINFO_PEELING_HTAB(L) (L)-peeling_htab
 #define LOOP_VINFO_TARGET_COST_DATA(L) (L)-target_cost_data
 #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)-peeling_for_gaps
+#define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)-operands_swapped
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
 VEC_length (gimple, (L)-may_misalign_stmts)  0
Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c(revision 189938)
+++ gcc/tree-vect-loop.c(working copy)
@@ -853,6 +853,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_HTAB (res) = NULL;
   LOOP_VINFO_TARGET_COST_DATA (res) = init_cost (loop);
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
+  LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
 
   return res;
 }
@@ -873,6 +874,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b
   int j;
   VEC (slp_instance, heap) *slp_instances;
   slp_instance instance;
+  bool swapped;
 
   if (!loop_vinfo)
 return;
@@ -881,6 +883,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b
 
   bbs = LOOP_VINFO_BBS (loop_vinfo);
   nbbs = loop-num_nodes;
+  swapped = LOOP_VINFO_OPERANDS_SWAPPED (loop_vinfo);
 
   if (!clean_stmts)
 {
@@ -905,6 +908,22 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b
   for (si = gsi_start_bb (bb); !gsi_end_p (si); )
 {
   gimple stmt = gsi_stmt (si);
+
+ /* We may have broken canonical form by moving a constant
+into RHS1 of a commutative op.  Fix such occurrences.  */
+ if (swapped  is_gimple_assign (stmt))
+   {
+ enum tree_code code = gimple_assign_rhs_code (stmt);
+
+ if ((code == PLUS_EXPR
+  || code == POINTER_PLUS_EXPR
+  || code == MULT_EXPR)
+  CONSTANT_CLASS_P (gimple_assign_rhs1 (stmt)))
+   swap_tree_operands (stmt,
+   gimple_assign_rhs1_ptr (stmt),
+   gimple_assign_rhs2_ptr (stmt));
+   }
+
  /* Free stmt_vec_info.  */
  free_stmt_vec_info (stmt);
   gsi_next (si);
@@ -1920,6 +1939,9 @@ vect_is_slp_reduction (loop_vec_info loop_info, gi
  gimple_assign_rhs1_ptr (next_stmt),
   gimple_assign_rhs2_ptr (next_stmt));
  update_stmt (next_stmt);
+
+ if (CONSTANT_CLASS_P (gimple_assign_rhs1 (next_stmt)))
+   LOOP_VINFO_OPERANDS_SWAPPED (loop_info) = true;
}
  else
return false;
@@ -2324,6 +2346,9 @@ vect_is_simple_reduction_1

Re: [patch[ Add explanations to sbitmap, bitmap, and sparseset

2012-07-27 Thread William J. Schmidt

On Fri, 2012-07-27 at 15:40 +0200, Richard Guenther wrote:
 On Thu, Jul 26, 2012 at 11:57 AM, Steven Bosscher stevenb@gmail.com 
 wrote:
  On Thu, Jul 26, 2012 at 11:23 AM, Richard Guenther
  richard.guent...@gmail.com wrote:
  Ok!  Thanks for adding this exhaustive documentation.
 
  There's more to come! I want to add some explanations to ebitmap,
  pointer-set, fibheap, and splay-tree as sets, and add a chapter in the
  gccint manual too.
 
  Now if only you'd document those loop changes... ;-)
 
 Eh ...
 
 
  Btw, ebitmap is unused since it was added - maybe we should simply remove
  it ...?
 
  I wouldn't remove it just yet. I'm going to make sure that bitmap.[ch]
  and ebitmap.[ch] provide the same interface and see if there are
  places where ebitmap is a better choice than bitmap or sbitmap (cprop
  and gcse.c come to mind).
 
 Btw, just looking over sparseset.h what needs to be documented is that
 iterating over the set is faster than for an sbitmap but element ordering
 is random!  Also it looks less efficient than sbitmap in the case when
 your main operation is adding to the set and querying the set randomly.
 It's space overhead is really huge - for smaller universes a smaller
 SPARSESET_ELT_TYPE would be nice, templates to the rescue!  I
 wonder in which cases a unsigned HOST_WIDEST_FAST_INT sized
 universe is even useful (but a short instead of an int is probably too
 small ...)

Another option for sparse sets would be a templatized version of Pugh's
skip lists.  Iteration is the same as a linked list and random access is
logarithmic in the size of the set (not the universe).  Space overhead
is also logarithmic.  The potential downside is that it involves
pointers.

Bill

 
 Richard.
 
  Ciao!
  Steven

[PATCH] Change IVOPTS and strength reduction to use expmed cost model

2012-07-25 Thread William J. Schmidt

Per Richard Henderson's suggestion
(http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01370.html), this patch
changes the IVOPTS and straight-line strength reduction passes to make
use of data computed by init_expmed.  This required adding a new
convert_cost array in expmed to store the costs of converting between
various scalar integer modes, and exposing expmed's multiplication hash
table for external use (new function mult_by_coeff_cost).  Richard H,
I'd appreciate it if you could look at what I did there and make sure
it's correct.  Thanks!

I decided it wasn't worth distinguishing between reg-reg add costs and
reg-constant add costs, so I simplified the strength reduction
calculations rather than adding another array to expmed for this
purpose.  But I can make this distinction if that's preferable.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-07-25  Bill Schmidt  wschm...@linux.ibm.com

* tree-ssa-loop-ivopts.c (mbc_entry_hash): Remove.
(mbc_entry_eq): Likewise.
(mult_costs): Likewise.
(cost_tables_exist): Likewise.
(initialize_costs): Likewise.
(finalize_costs): Likewise.
(tree_ssa_iv_optimize_init): Remove call to initialize_costs.
(add_regs_cost): Remove.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.
(struct mbc_entry): Likewise.
(multiply_by_const_cost): Likewise.
(get_address_cost): Change add_regs_cost calls to add_cost lookups;
change multiply_by_const_cost to mult_by_coeff_cost.
(force_expr_to_var_cost): Likewise.
(difference_cost): Change multiply_by_const_cost to mult_by_coeff_cost.
(get_computation_cost_at): Change add_regs_cost calls to add_cost
lookups; change multiply_by_const_cost to mult_by_coeff_cost.
(determine_iv_cost): Change add_regs_cost calls to add_cost lookups.
(tree_ssa_iv_optimize_finalize): Remove call to finalize_costs.
* tree-ssa-address.c (expmed.h): New #include.
(most_expensive_mult_to_index): Change multiply_by_const_cost to
mult_by_coeff_cost.
* gimple-ssa-strength-reduction.c (expmed.h): New #include.
(stmt_cost): Change to use mult_by_coeff_cost, mul_cost, add_cost,
neg_cost, and convert_cost instead of IVOPTS interfaces.
(execute_strength_reduction): Remove calls to initialize_costs and
finalize_costs.
* expmed.c (struct init_expmed_rtl): Add convert rtx_def.
(init_expmed_one_mode): Initialize convert rtx_def; initialize
convert_cost for related modes.
(mult_by_coeff_cost): New function.
* expmed.h (struct target_expmed): Add x_convert_cost matrix.
(convert_cost): New #define.
(mult_by_coeff_cost): New extern decl.
* tree-flow.h (initialize_costs): Remove decl.
(finalize_costs): Likewise.
(multiply_by_const_cost): Likewise.
(add_regs_cost): Likewise.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.


Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 189845)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -88,9 +88,6 @@ along with GCC; see the file COPYING3.  If not see
 #include tree-ssa-propagate.h
 #include expmed.h
 
-static hashval_t mbc_entry_hash (const void *);
-static int mbc_entry_eq (const void*, const void *);
-
 /* FIXME: Expressions are expanded to RTL in this pass to determine the
cost of different addressing modes.  This should be moved to a TBD
interface between the GIMPLE and RTL worlds.  */
@@ -381,11 +378,6 @@ struct iv_ca_delta
 
 static VEC(tree,heap) *decl_rtl_to_reset;
 
-/* Cached costs for multiplies by constants, and a flag to indicate
-   when they're valid.  */
-static htab_t mult_costs[2];
-static bool cost_tables_exist = false;
-
 static comp_cost force_expr_to_var_cost (tree, bool);
 
 /* Number of uses recorded in DATA.  */
@@ -851,26 +843,6 @@ htab_inv_expr_hash (const void *ent)
   return expr-hash;
 }
 
-/* Allocate data structures for the cost model.  */
-
-void
-initialize_costs (void)
-{
-  mult_costs[0] = htab_create (100, mbc_entry_hash, mbc_entry_eq, free);
-  mult_costs[1] = htab_create (100, mbc_entry_hash, mbc_entry_eq, free);
-  cost_tables_exist = true;
-}
-
-/* Release data structures for the cost model.  */
-
-void
-finalize_costs (void)
-{
-  cost_tables_exist = false;
-  htab_delete (mult_costs[0]);
-  htab_delete (mult_costs[1]);
-}
-
 /* Initializes data structures used by the iv optimization pass, stored
in DATA.  */
 
@@ -889,8 +861,6 @@ tree_ssa_iv_optimize_init (struct ivopts_data *dat

Re: [PATCH] Change IVOPTS and strength reduction to use expmed cost model

2012-07-25 Thread William J. Schmidt

On Wed, 2012-07-25 at 09:59 -0700, Richard Henderson wrote:
 On 07/25/2012 09:13 AM, William J. Schmidt wrote:
  Per Richard Henderson's suggestion
  (http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01370.html), this patch
  changes the IVOPTS and straight-line strength reduction passes to make
  use of data computed by init_expmed.  This required adding a new
  convert_cost array in expmed to store the costs of converting between
  various scalar integer modes, and exposing expmed's multiplication hash
  table for external use (new function mult_by_coeff_cost).  Richard H,
  I'd appreciate it if you could look at what I did there and make sure
  it's correct.  Thanks!
 
 Correctness looks good.
 
  I decided it wasn't worth distinguishing between reg-reg add costs and
  reg-constant add costs, so I simplified the strength reduction
  calculations rather than adding another array to expmed for this
  purpose.  But I can make this distinction if that's preferable.
 
 I don't think this is worth thinking about at this level.  This is
 something that some rtl-level optimization ought to be able to fix
 up trivially, e.g. cse.
 
  Index: gcc/expmed.h
  ===
  --- gcc/expmed.h(revision 189845)
  +++ gcc/expmed.h(working copy)
  @@ -155,6 +155,11 @@ struct target_expmed {
 int x_udiv_cost[2][NUM_MACHINE_MODES];
 int x_mul_widen_cost[2][NUM_MACHINE_MODES];
 int x_mul_highpart_cost[2][NUM_MACHINE_MODES];
  +
  +  /* Conversion costs are only defined between two scalar integer modes
  + of different sizes.  The first machine mode is the destination mode,
  + and the second is the source mode.  */
  +  int x_convert_cost[2][NUM_MACHINE_MODES][NUM_MACHINE_MODES];
   };
 
 2 * NUM_MACHINE_MODES is quite large...  I think we could do better with
 
 #define NUM_MODE_INT (MAX_MODE_INT - MIN_MODE_INT + 1)
 
   x_convert_cost[2][NUM_MODE_INT][NUM_MODE_INT];
 
 though really that could be done with all of these fields all at once.
 
 That does suggest it would be better to leave at least inline functions
 to access these elements, rather than open code the array access.
 
 
 r~
 

Thanks for the quick review!  Excellent point about the array size.  The
attached revised patch follows your suggestion to limit the size.

I only did this for the new field, as changing all the existing
accessors to inline functions is more effort than I have time for right
now.  This is left as an exercise for the reader. ;)

Bootstrapped and tested on powepc64-unknown-linux-gnu with no new
failures.  Is this ok?

Thanks,
Bill


2012-07-25  Bill Schmidt  wschm...@linux.ibm.com

* tree-ssa-loop-ivopts.c (mbc_entry_hash): Remove.
(mbc_entry_eq): Likewise.
(mult_costs): Likewise.
(cost_tables_exist): Likewise.
(initialize_costs): Likewise.
(finalize_costs): Likewise.
(tree_ssa_iv_optimize_init): Remove call to initialize_costs.
(add_regs_cost): Remove.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.
(struct mbc_entry): Likewise.
(multiply_by_const_cost): Likewise.
(get_address_cost): Change add_regs_cost calls to add_cost lookups;
change multiply_by_const_cost to mult_by_coeff_cost.
(force_expr_to_var_cost): Likewise.
(difference_cost): Change multiply_by_const_cost to mult_by_coeff_cost.
(get_computation_cost_at): Change add_regs_cost calls to add_cost
lookups; change multiply_by_const_cost to mult_by_coeff_cost.
(determine_iv_cost): Change add_regs_cost calls to add_cost lookups.
(tree_ssa_iv_optimize_finalize): Remove call to finalize_costs.
* tree-ssa-address.c (expmed.h): New #include.
(most_expensive_mult_to_index): Change multiply_by_const_cost to
mult_by_coeff_cost.
* gimple-ssa-strength-reduction.c (expmed.h): New #include.
(stmt_cost): Change to use mult_by_coeff_cost, mul_cost, add_cost,
neg_cost, and convert_cost instead of IVOPTS interfaces.
(execute_strength_reduction): Remove calls to initialize_costs and
finalize_costs.
* expmed.c (struct init_expmed_rtl): Add convert rtx_def.
(init_expmed_one_mode): Initialize convert rtx_def; initialize
x_convert_cost for related modes.
(mult_by_coeff_cost): New function.
* expmed.h (NUM_MODE_INT): New #define.
(struct target_expmed): Add x_convert_cost matrix.
(set_convert_cost): New inline function.
(convert_cost): Likewise.
(mult_by_coeff_cost): New extern decl.
* tree-flow.h (initialize_costs): Remove decl.
(finalize_costs): Likewise.
(multiply_by_const_cost): Likewise.
(add_regs_cost): Likewise.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise

Re: [PING] Re: [RFC, ivopts] fix bugs in ivopts address cost computation

2012-07-25 Thread William J. Schmidt

On Wed, 2012-07-25 at 13:39 -0600, Sandra Loosemore wrote:
 On 07/17/2012 05:22 AM, Richard Guenther wrote:
  On Wed, Jul 4, 2012 at 6:35 PM, Sandra Loosemore
  san...@codesourcery.com  wrote:
 
  Ping?  Original post with patch is here:
 
  http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html
 
  Can you update the patch and numbers based on what Bill did for
  straight-line strength reduction which re-uses this analysis/caching part?
 
 I will try to take another look at this once Bill has finished his work 
 that touches on this; it's been hard for me to track a moving target.  I 
 was wondering if it might be more consistent with Bill's work to defer 
 some of the address cost computation to new target hooks, after all.
 
 -Sandra
 

Hi Sandra,

I apologize for the mess.  I should be done causing distress to this
part of the code as soon as the patch I submitted today is committed.

Sorry!
Bill

Re: [PATCH] Vectorizer cost model outside-cost changes

2012-07-24 Thread William J. Schmidt

On Tue, 2012-07-24 at 10:57 +0200, Richard Guenther wrote:
 On Mon, 23 Jul 2012, William J. Schmidt wrote:
 
  This patch completes the conversion of the vectorizer cost model to use
  target hooks for recording vectorization information and calculating
  costs.  Previous work handled the costs inside the loop body or basic
  block being vectorized.  This patch similarly converts the prologue and
  epilogue costs.
  
  As before, I first verified that the new model provides the same results
  as the old model on the regression testsuite and on SPEC CPU2006.  I
  then removed the old model, rather than submitting an intermediate patch
  with both present.  I have a patch that shows both if it's needed for
  reference.
  
  Also as before, I found an error in the old cost model wherein prologue
  costs of phi reduction statements were not being considered during the
  final vectorization decision.  I have fixed this in the new model; thus,
  this version of the cost model will be slightly more conservative than
  the original.  I am currently running SPEC tests to ensure there aren't
  any resulting degradations.
  
  One thing that could be done in future for further cleanup would be to
  handle the scalar iteration cost in a similar manner.  Right now this is
  dealt with by recording N scalar_stmts, where N is the length of the
  scalar iteration; as with the old model, there is no attempt to
  differentiate between different scalar statements.  This results in some
  hackish stuff in, e.g., tree-vect-stmts.c:record_stmt_cost (), where we
  have to deal with the fact that we may not have a stmt_info for the
  statement being recorded.  This is only true for these aggregated
  scalar_stmt costs.
  
  Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
  regressions.  Assuming the SPEC performance tests come out ok, is this
  ok for trunk?
 
 So all costs we query from the backend even for the prologue/epilogue
 are costs for vector stmts (like inits of invariant vectors or
 outer-loop parts in outer loop vectorization)?

Yes, with the exception of copies of scalar iterations introduced by
loop peeling (the N * scalar_stmt business).

There are comments in several places indicating opportunities for
improvement in the modeling, including for the outer-loop case, but for
now your statement holds otherwise.

Thanks,
Bill

 
 Ok in that case.
 
 Thanks,
 Richard.
 
  Thanks!
  Bill

Ping: [PATCH] Fix PR46556 (straight-line strength reduction, part 2)

2012-07-22 Thread William J. Schmidt

Ping...

On Thu, 2012-06-28 at 16:45 -0500, William J. Schmidt wrote:
 Here's a relatively small piece of strength reduction that solves that
 pesky addressing bug that got me looking at this in the first place...
 
 The main part of the code is the stuff that was reviewed last year, but
 which needed to find a good home.  So hopefully that's in pretty good
 shape.  I recast base_cand_map as an htab again since I now need to look
 up trees other than SSA names.  I plan to put together a follow-up patch
 to change code and commentary references so that base_name becomes
 base_expr.  Doing that now would clutter up the patch too much.
 
 Bootstrapped and tested on powerpc64-linux-gnu with no new regressions.
 Ok for trunk?
 
 Thanks,
 Bill
 
 
 gcc:
 
   PR tree-optimization/46556
   * gimple-ssa-strength-reduction.c (enum cand_kind): Add CAND_REF.
   (base_cand_map): Change to hash table.
   (base_cand_hash): New function.
   (base_cand_free): Likewise.
   (base_cand_eq): Likewise.
   (lookup_cand): Change base_cand_map to hash table.
   (find_basis_for_candidate): Likewise.
   (base_cand_from_table): Exclude CAND_REF.
   (restructure_reference): New function.
   (slsr_process_ref): Likewise.
   (find_candidates_in_block): Call slsr_process_ref.
   (dump_candidate): Handle CAND_REF.
   (base_cand_dump_callback): New function.
   (dump_cand_chains): Change base_cand_map to hash table.
   (replace_ref): New function.
   (replace_refs): Likewise.
   (analyze_candidates_and_replace): Call replace_refs.
   (execute_strength_reduction): Change base_cand_map to hash table.
 
 gcc/testsuite:
 
   PR tree-optimization/46556
   * testsuite/gcc.dg/tree-ssa/slsr-27.c: New.
   * testsuite/gcc.dg/tree-ssa/slsr-28.c: New.
   * testsuite/gcc.dg/tree-ssa/slsr-29.c: New.
 
 
 Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c
 ===
 --- gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c   (revision 0)
 +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c   (revision 0)
 @@ -0,0 +1,22 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-dom2 } */
 +
 +struct x
 +{
 +  int a[16];
 +  int b[16];
 +  int c[16];
 +};
 +
 +extern void foo (int, int, int);
 +
 +void
 +f (struct x *p, unsigned int n)
 +{
 +  foo (p-a[n], p-c[n], p-b[n]);
 +}
 +
 +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */
 +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */
 +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 3 dom2 } 
 } */
 +/* { dg-final { cleanup-tree-dump dom2 } } */
 Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c
 ===
 --- gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c   (revision 0)
 +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c   (revision 0)
 @@ -0,0 +1,26 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-dom2 } */
 +
 +struct x
 +{
 +  int a[16];
 +  int b[16];
 +  int c[16];
 +};
 +
 +extern void foo (int, int, int);
 +
 +void
 +f (struct x *p, unsigned int n)
 +{
 +  foo (p-a[n], p-c[n], p-b[n]);
 +  if (n  12)
 +foo (p-a[n], p-c[n], p-b[n]);
 +  else if (n  3)
 +foo (p-b[n], p-a[n], p-c[n]);
 +}
 +
 +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */
 +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */
 +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } 
 } */
 +/* { dg-final { cleanup-tree-dump dom2 } } */
 Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c
 ===
 --- gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c   (revision 0)
 +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c   (revision 0)
 @@ -0,0 +1,28 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-dom2 } */
 +
 +struct x
 +{
 +  int a[16];
 +  int b[16];
 +  int c[16];
 +};
 +
 +extern void foo (int, int, int);
 +
 +void
 +f (struct x *p, unsigned int n)
 +{
 +  foo (p-a[n], p-c[n], p-b[n]);
 +  if (n  3)
 +{
 +  foo (p-a[n], p-c[n], p-b[n]);
 +  if (n  12)
 + foo (p-b[n], p-a[n], p-c[n]);
 +}
 +}
 +
 +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */
 +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */
 +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } 
 } */
 +/* { dg-final { cleanup-tree-dump dom2 } } */
 Index: gcc/gimple-ssa-strength-reduction.c
 ===
 --- gcc/gimple-ssa-strength-reduction.c   (revision 189025)
 +++ gcc/gimple-ssa-strength-reduction.c   (working copy)
 @@ -32,7 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 2) Explicit multiplies, unknown constant multipliers,
no conditional increments. (data gathering complete,
replacements pending)
 -   3) Implicit multiplies in addressing expressions. (pending)
 +   3

Re: [PATCH] Add flag to control straight-line strength reduction

2012-07-18 Thread William J. Schmidt

On Wed, 2012-07-18 at 11:01 +0200, Richard Guenther wrote:
 On Wed, 18 Jul 2012, Steven Bosscher wrote:
 
  On Wed, Jul 18, 2012 at 9:59 AM, Richard Guenther rguent...@suse.de wrote:
   On Tue, 17 Jul 2012, William J. Schmidt wrote:
  
   I overlooked adding a pass-control flag for strength reduction, added
   here.  I named it -ftree-slsr for consistency with other -ftree- flags,
   but could change it to -fgimple-slsr if you prefer that for a pass named
   gimple-ssa-...
  
   Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
   regressions.  Ok for trunk?
  
   The switch needs documentation in doc/invoke.texi.  Other than that
   it's fine to stick with -ftree-..., even that exposes details to our
   users that are not necessary (RTL passes didn't have -frtl-... either).
   So in the end, why not re-use -fstrength-reduce that is already available
   (but stubbed out)?
  
  In the past, -fstrength-reduce applied to loop strength reduction in
  loop.c. I don't think it should be re-used for a completely different
  code transformation.
 
 Ok.  I suppose -ftree-slsr is ok then.

It turns out I was looking at a very old copy of the manual, and the
-ftree... stuff is not as prevalent now as it once was.  I'll just go
with -fslsr to be consistent with -fgcse, -fipa-sra, etc.

Thanks for the pointer to doc/invoke.texi -- it appears I also failed to
document -fhoist-adjacent-loads, so I will go ahead and do that as well.

Thanks!
Bill

 
 Thanks,
 Richard.

Re: [PATCH] Add flag to control straight-line strength reduction

2012-07-18 Thread William J. Schmidt

On Wed, 2012-07-18 at 08:24 -0500, William J. Schmidt wrote:
 On Wed, 2012-07-18 at 11:01 +0200, Richard Guenther wrote:
  On Wed, 18 Jul 2012, Steven Bosscher wrote:
  
   On Wed, Jul 18, 2012 at 9:59 AM, Richard Guenther rguent...@suse.de 
   wrote:
On Tue, 17 Jul 2012, William J. Schmidt wrote:
   
I overlooked adding a pass-control flag for strength reduction, added
here.  I named it -ftree-slsr for consistency with other -ftree- flags,
but could change it to -fgimple-slsr if you prefer that for a pass 
named
gimple-ssa-...
   
Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
regressions.  Ok for trunk?
   
The switch needs documentation in doc/invoke.texi.  Other than that
it's fine to stick with -ftree-..., even that exposes details to our
users that are not necessary (RTL passes didn't have -frtl-... either).
So in the end, why not re-use -fstrength-reduce that is already 
available
(but stubbed out)?
   
   In the past, -fstrength-reduce applied to loop strength reduction in
   loop.c. I don't think it should be re-used for a completely different
   code transformation.
  
  Ok.  I suppose -ftree-slsr is ok then.
 
 It turns out I was looking at a very old copy of the manual, and the
 -ftree... stuff is not as prevalent now as it once was.  I'll just go
 with -fslsr to be consistent with -fgcse, -fipa-sra, etc.

Well, posted too fast.  Paging down I see that isn't true, sorry.  I'll
use the tree- for consistency even though it is useless information.

Thanks,
Bill

 
 Thanks for the pointer to doc/invoke.texi -- it appears I also failed to
 document -fhoist-adjacent-loads, so I will go ahead and do that as well.
 
 Thanks!
 Bill
 
  
  Thanks,
  Richard.

Re: [PATCH] Add flag to control straight-line strength reduction

2012-07-18 Thread William J. Schmidt

Here's the patch with documentation changes included.  I also cleaned up
missing work from a couple of my previous patches, so
-fhoist-adjacent-loads is documented now, and -fvect-cost-model is added
to the list of options on by default at -O3.

Ok for trunk?

Thanks,
Bill


2012-07-18  Bill Schmidt  wschm...@linux.ibm.com

* doc/invoke.texi: Add -fhoist-adjacent-loads and -ftree-slsr to list
of flags controlling optimization; add -ftree-slsr to list of flags
enabled by default at -O; add -fhoist-adjacent-loads to list of flags
enabled by default at -O2; add -fvect-cost-model to list of flags
enabled by default at -O3; document -fhoist-adjacent-loads and
-ftree-slsr.
* opts.c (default_option): Make -ftree-slsr default at -O1 and above.
* gimple-ssa-strength-reduction.c (gate_strength_reduction): Use
flag_tree_slsr.
* common.opt: Add -ftree-slsr with flag_tree_slsr.


Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi (revision 189574)
+++ gcc/doc/invoke.texi (working copy)
@@ -364,7 +364,8 @@ Objective-C and Objective-C++ Dialects}.
 -ffast-math -ffinite-math-only -ffloat-store -fexcess-precision=@var{style} 
@gol
 -fforward-propagate -ffp-contract=@var{style} -ffunction-sections @gol
 -fgcse -fgcse-after-reload -fgcse-las -fgcse-lm -fgraphite-identity @gol
--fgcse-sm -fif-conversion -fif-conversion2 -findirect-inlining @gol
+-fgcse-sm -fhoist-adjacent-loads -fif-conversion @gol
+-fif-conversion2 -findirect-inlining @gol
 -finline-functions -finline-functions-called-once -finline-limit=@var{n} @gol
 -finline-small-functions -fipa-cp -fipa-cp-clone -fipa-matrix-reorg @gol
 -fipa-pta -fipa-profile -fipa-pure-const -fipa-reference @gol
@@ -413,8 +414,8 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol
 -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
--ftree-reassoc @gol
--ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
+-ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
@@ -6259,6 +6260,7 @@ compilation time.
 -ftree-forwprop @gol
 -ftree-fre @gol
 -ftree-phiprop @gol
+-ftree-slsr @gol
 -ftree-sra @gol
 -ftree-pta @gol
 -ftree-ter @gol
@@ -6286,6 +6288,7 @@ also turns on the following optimization flags:
 -fdevirtualize @gol
 -fexpensive-optimizations @gol
 -fgcse  -fgcse-lm  @gol
+-fhoist-adjacent-loads @gol
 -finline-small-functions @gol
 -findirect-inlining @gol
 -fipa-sra @gol
@@ -6311,6 +6314,7 @@ Optimize yet more.  @option{-O3} turns on all opti
 by @option{-O2} and also turns on the @option{-finline-functions},
 @option{-funswitch-loops}, @option{-fpredictive-commoning},
 @option{-fgcse-after-reload}, @option{-ftree-vectorize},
+@option{-fvect-cost-model},
 @option{-ftree-partial-pre} and @option{-fipa-cp-clone} options.
 
 @item -O0
@@ -7129,6 +7133,13 @@ This flag is enabled by default at @option{-O} and
 Perform hoisting of loads from conditional pointers on trees.  This
 pass is enabled by default at @option{-O} and higher.
 
+@item -fhoist-adjacent-loads
+@opindex hoist-adjacent-loads
+Speculatively hoist loads from both branches of an if-then-else if the
+loads are from adjacent locations in the same structure and the target
+architecture has a conditional move instruction.  This flag is enabled
+by default at @option{-O2} and higher.
+
 @item -ftree-copy-prop
 @opindex ftree-copy-prop
 Perform copy propagation on trees.  This pass eliminates unnecessary
@@ -7529,6 +7540,13 @@ defining expression.  This results in non-GIMPLE c
 much more complex trees to work on resulting in better RTL generation.  This is
 enabled by default at @option{-O} and higher.
 
+@item -ftree-slsr
+@opindex ftree-slsr
+Perform straight-line strength reduction on trees.  This recognizes related
+expressions involving multiplications and replaces them by less expensive
+calculations when possible.  This is enabled by default at @option{-O} and
+higher.
+
 @item -ftree-vectorize
 @opindex ftree-vectorize
 Perform loop vectorization on trees. This flag is enabled by default at
@@ -7550,7 +7568,8 @@ except at level @option{-Os} where it is disabled.
 
 @item -fvect-cost-model
 @opindex fvect-cost-model
-Enable cost model for vectorization.
+Enable cost model for vectorization.  This option is enabled by default at
+@option{-O3}.
 
 @item -ftree-vrp
 @opindex ftree-vrp
Index: gcc/opts.c
===
--- gcc/opts.c  (revision 189574)
+++ gcc/opts.c  (working copy)
@@ -452,6 +452,7 @@

[PATCH] Add flag to control straight-line strength reduction

2012-07-17 Thread William J. Schmidt

I overlooked adding a pass-control flag for strength reduction, added
here.  I named it -ftree-slsr for consistency with other -ftree- flags,
but could change it to -fgimple-slsr if you prefer that for a pass named
gimple-ssa-...

Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-07-17  Bill Schmidt  wschm...@linux.ibm.com

* opts.c (default_option): Make -ftree-slsr default at -O1 and above.
* gimple-ssa-strength-reduction.c (gate_strength_reduction): Use
flag_tree_slsr.
* common.opt: Add -ftree-slsr with flag_tree_slsr.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 189574)
+++ gcc/opts.c  (working copy)
@@ -452,6 +452,7 @@ static const struct default_options default_option
 { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
 { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
 { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
+{ OPT_LEVELS_1_PLUS, OPT_ftree_slsr, NULL, 1 },
 
 /* -O2 optimizations.  */
 { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 189574)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -1501,7 +1501,7 @@ execute_strength_reduction (void)
 static bool
 gate_strength_reduction (void)
 {
-  return optimize  0;
+  return flag_tree_slsr;
 }
 
 struct gimple_opt_pass pass_strength_reduction =
Index: gcc/common.opt
===
--- gcc/common.opt  (revision 189574)
+++ gcc/common.opt  (working copy)
@@ -2080,6 +2080,10 @@ ftree-sink
 Common Report Var(flag_tree_sink) Optimization
 Enable SSA code sinking on trees
 
+ftree-slsr
+Common Report Var(flag_tree_slsr) Optimization
+Perform straight-line strength reduction
+
 ftree-sra
 Common Report Var(flag_tree_sra) Optimization
 Perform scalar replacement of aggregates

[PATCH] Enable vectorizer cost model by default at -O3

2012-07-15 Thread William J. Schmidt

The auto-vectorizer is overly aggressive when not constrained by the
vectorizer cost model.  Although the cost model is by no means perfect,
it does a reasonable job of avoiding many poor vectorization decisions.
Since the auto-vectorizer is enabled by default at -O3 and above, we
should also enable the vectorizer cost model by default at -O3 and
above.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-07-15  Bill Schmidt  wschm...@linux.ibm.com

* opts.c (default_option): Add -fvect-cost-model to default options
at -O3 and above.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 189481)
+++ gcc/opts.c  (working copy)
@@ -501,6 +501,7 @@ static const struct default_options default_option
 { OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_vectorize, NULL, 1 },
+{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },

[PATCH, committed] Fix PR53955

2012-07-13 Thread William J. Schmidt

Configure with --disable-build-poststage1-with-cxx exposed functions
that should have been marked static.  Bootstrapped on
powerpc-unknown-linux-gnu, committed as obvious.

Thanks,
Bill


2012-07-13  Bill Schmidt  wschm...@linux.ibm.com

PR bootstrap/53955
* config/spu/spu.c (spu_init_cost): Mark static.
(spu_add_stmt_cost): Likewise.
(spu_finish_cost): Likewise.
(spu_destroy_cost_data): Likewise.
* config/i386/i386.c (ix86_init_cost): Mark static.
(ix86_add_stmt_cost): Likewise.
(ix86_finish_cost): Likewise.
(ix86_destroy_cost_data): Likewise.
* config/rs6000/rs6000.c (rs6000_init_cost): Mark static.
(rs6000_add_stmt_cost): Likewise.
(rs6000_finish_cost): Likewise.
(rs6000_destroy_cost_data): Likewise.


Index: gcc/config/spu/spu.c
===
--- gcc/config/spu/spu.c(revision 189460)
+++ gcc/config/spu/spu.c(working copy)
@@ -6919,7 +6919,7 @@ spu_builtin_vectorization_cost (enum vect_cost_for
 
 /* Implement targetm.vectorize.init_cost.  */
 
-void *
+static void *
 spu_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEW (unsigned);
@@ -6929,7 +6929,7 @@ spu_init_cost (struct loop *loop_info ATTRIBUTE_UN
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
-unsigned
+static unsigned
 spu_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
   struct _stmt_vec_info *stmt_info, int misalign)
 {
@@ -6956,7 +6956,7 @@ spu_add_stmt_cost (void *data, int count, enum vec
 
 /* Implement targetm.vectorize.finish_cost.  */
 
-unsigned
+static unsigned
 spu_finish_cost (void *data)
 {
   return *((unsigned *) data);
@@ -6964,7 +6964,7 @@ spu_finish_cost (void *data)
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
 
-void
+static void
 spu_destroy_cost_data (void *data)
 {
   free (data);
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 189460)
+++ gcc/config/i386/i386.c  (working copy)
@@ -40066,7 +40066,7 @@ ix86_autovectorize_vector_sizes (void)
 
 /* Implement targetm.vectorize.init_cost.  */
 
-void *
+static void *
 ix86_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEW (unsigned);
@@ -40076,7 +40076,7 @@ ix86_init_cost (struct loop *loop_info ATTRIBUTE_U
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
-unsigned
+static unsigned
 ix86_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
struct _stmt_vec_info *stmt_info, int misalign)
 {
@@ -40103,7 +40103,7 @@ ix86_add_stmt_cost (void *data, int count, enum ve
 
 /* Implement targetm.vectorize.finish_cost.  */
 
-unsigned
+static unsigned
 ix86_finish_cost (void *data)
 {
   return *((unsigned *) data);
@@ -40111,7 +40111,7 @@ ix86_finish_cost (void *data)
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
 
-void
+static void
 ix86_destroy_cost_data (void *data)
 {
   free (data);
Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 189460)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -3522,7 +3522,7 @@ rs6000_preferred_simd_mode (enum machine_mode mode
 
 /* Implement targetm.vectorize.init_cost.  */
 
-void *
+static void *
 rs6000_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEW (unsigned);
@@ -3532,7 +3532,7 @@ rs6000_init_cost (struct loop *loop_info ATTRIBUTE
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
-unsigned
+static unsigned
 rs6000_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
  struct _stmt_vec_info *stmt_info, int misalign)
 {
@@ -3559,7 +3559,7 @@ rs6000_add_stmt_cost (void *data, int count, enum
 
 /* Implement targetm.vectorize.finish_cost.  */
 
-unsigned
+static unsigned
 rs6000_finish_cost (void *data)
 {
   return *((unsigned *) data);
@@ -3567,7 +3567,7 @@ rs6000_finish_cost (void *data)
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
 
-void
+static void
 rs6000_destroy_cost_data (void *data)
 {
   free (data);

[PATCH] Fix PR46556 (straight-line strength reduction, part 2)

2012-06-28 Thread William J. Schmidt

Here's a relatively small piece of strength reduction that solves that
pesky addressing bug that got me looking at this in the first place...

The main part of the code is the stuff that was reviewed last year, but
which needed to find a good home.  So hopefully that's in pretty good
shape.  I recast base_cand_map as an htab again since I now need to look
up trees other than SSA names.  I plan to put together a follow-up patch
to change code and commentary references so that base_name becomes
base_expr.  Doing that now would clutter up the patch too much.

Bootstrapped and tested on powerpc64-linux-gnu with no new regressions.
Ok for trunk?

Thanks,
Bill


gcc:

PR tree-optimization/46556
* gimple-ssa-strength-reduction.c (enum cand_kind): Add CAND_REF.
(base_cand_map): Change to hash table.
(base_cand_hash): New function.
(base_cand_free): Likewise.
(base_cand_eq): Likewise.
(lookup_cand): Change base_cand_map to hash table.
(find_basis_for_candidate): Likewise.
(base_cand_from_table): Exclude CAND_REF.
(restructure_reference): New function.
(slsr_process_ref): Likewise.
(find_candidates_in_block): Call slsr_process_ref.
(dump_candidate): Handle CAND_REF.
(base_cand_dump_callback): New function.
(dump_cand_chains): Change base_cand_map to hash table.
(replace_ref): New function.
(replace_refs): Likewise.
(analyze_candidates_and_replace): Call replace_refs.
(execute_strength_reduction): Change base_cand_map to hash table.

gcc/testsuite:

PR tree-optimization/46556
* testsuite/gcc.dg/tree-ssa/slsr-27.c: New.
* testsuite/gcc.dg/tree-ssa/slsr-28.c: New.
* testsuite/gcc.dg/tree-ssa/slsr-29.c: New.


Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-dom2 } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p-a[n], p-c[n], p-b[n]);
+}
+
+/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */
+/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */
+/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 3 dom2 } } 
*/
+/* { dg-final { cleanup-tree-dump dom2 } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0)
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-dom2 } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p-a[n], p-c[n], p-b[n]);
+  if (n  12)
+foo (p-a[n], p-c[n], p-b[n]);
+  else if (n  3)
+foo (p-b[n], p-a[n], p-c[n]);
+}
+
+/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */
+/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */
+/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } } 
*/
+/* { dg-final { cleanup-tree-dump dom2 } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-dom2 } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p-a[n], p-c[n], p-b[n]);
+  if (n  3)
+{
+  foo (p-a[n], p-c[n], p-b[n]);
+  if (n  12)
+   foo (p-b[n], p-a[n], p-c[n]);
+}
+}
+
+/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */
+/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */
+/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } } 
*/
+/* { dg-final { cleanup-tree-dump dom2 } } */
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 189025)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -32,7 +32,7 @@ along with GCC; see the file COPYING3.  If not see
2) Explicit multiplies, unknown constant multipliers,
   no conditional increments. (data gathering complete,
   replacements pending)
-   3) Implicit multiplies in addressing expressions. (pending)
+   3) Implicit multiplies in addressing expressions. (complete)
4) Explicit multiplies, conditional increments. (pending)
 
It would also be possible to apply strength

[PATCH] Strength reduction

2012-06-25 Thread William J. Schmidt

Here's a new version of the main strength reduction patch, addressing
previous comments.  A couple of quick notes:

* I opened PR53773 and PR53774 for the cases where commutative
operations were encountered with a constant in rhs1.  This version of
the patch still has the gcc_asserts in place to catch those cases, but
I'll plan to remove those once the patch is approved.

 * You previously asked:


 +static slsr_cand_t
 +base_cand_from_table (tree base_in)
 +{
 +  slsr_cand mapping_key;
 +
 +  gimple def = SSA_NAME_DEF_STMT (base_in);
 +  if (!def)
 +return (slsr_cand_t) NULL;
 +
 +  mapping_key.cand_stmt = def;
 +  return (slsr_cand_t) htab_find (stmt_cand_map, mapping_key);

 isn't that reachable via the base-name - chain mapping for base_in?

I had to review this a bit, but the answer is no.  If you look at one of
the algebraic manipulations in create_mul_ssa_cand as an example,
base_in corresponds to Y.  base_cand_from_table is looking for a
candidate that has Y for its LHS.  The base-name - chain mapping is
used to find all candidates that have B as the base_name.

 * I added a detailed explanation of what's going on with legal_cast_p.
Hopefully this will be easier to understand now.

I've bootstrapped this on powerpc64-unknown-linux-gnu with three new
regressions (for which I opened the two bug reports).  Ok for trunk
after removing the asserts?

Thanks,
Bill



gcc:

2012-06-25  Bill Schmidt  wschm...@linux.ibm.com

* tree-pass.h (pass_strength_reduction): New decl.
* tree-ssa-loop-ivopts.c (initialize_costs): Make non-static.
(finalize_costs): Likewise.
* timevar.def (TV_TREE_SLSR): New timevar.
* gimple-ssa-strength-reduction.c: New.
* tree-flow.h (initialize_costs): New decl.
(finalize_costs): Likewise.
* Makefile.in (tree-ssa-strength-reduction.o): New dependencies.
* passes.c (init_optimization_passes): Add pass_strength_reduction.

gcc/testsuite:

2012-06-25  Bill Schmidt  wschm...@linux.ibm.com

* gcc.dg/tree-ssa/slsr-1.c: New test.
* gcc.dg/tree-ssa/slsr-2.c: Likewise.
* gcc.dg/tree-ssa/slsr-3.c: Likewise.
* gcc.dg/tree-ssa/slsr-4.c: Likewise.



Index: gcc/tree-pass.h
===
--- gcc/tree-pass.h (revision 188890)
+++ gcc/tree-pass.h (working copy)
@@ -452,6 +452,7 @@ extern struct gimple_opt_pass pass_tm_memopt;
 extern struct gimple_opt_pass pass_tm_edges;
 extern struct gimple_opt_pass pass_split_functions;
 extern struct gimple_opt_pass pass_feedback_split_functions;
+extern struct gimple_opt_pass pass_strength_reduction;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c  (revision 0)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -fdump-tree-optimized } */
+
+extern void foo (int);
+
+void
+f (int *p, unsigned int n)
+{
+  foo (*(p + n * 4));
+  foo (*(p + 32 + n * 4));
+  if (n  3)
+foo (*(p + 16 + n * 4));
+  else
+foo (*(p + 48 + n * 4));
+}
+
+/* { dg-final { scan-tree-dump-times \\+ 128 1 optimized } } */
+/* { dg-final { scan-tree-dump-times \\+ 64 1 optimized } } */
+/* { dg-final { scan-tree-dump-times \\+ 192 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c  (revision 0)
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -fdump-tree-optimized } */
+
+extern void foo (int);
+
+void
+f (int *p, int n)
+{
+  foo (*(p + n++ * 4));
+  foo (*(p + 32 + n++ * 4));
+  foo (*(p + 16 + n * 4));
+}
+
+/* { dg-final { scan-tree-dump-times \\+ 144 1 optimized } } */
+/* { dg-final { scan-tree-dump-times \\+ 96 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c  (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -fdump-tree-optimized } */
+
+int
+foo (int a[], int b[], int i)
+{
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times \\* 4 1 optimized } } */
+/* { dg-final { scan-tree-dump-times \\+ 4 2 optimized } } */
+/* { dg-final { scan-tree-dump-times \\+ 8 1 optimized } } */
+/* { dg-final { scan-tree-dump-times \\+ 12 1 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index:

Re: [PATCH] Strength reduction preliminaries

2012-06-22 Thread William J. Schmidt

On Fri, 2012-06-22 at 10:44 +0200, Richard Guenther wrote:
 On Thu, 21 Jun 2012, William J. Schmidt wrote:
 
  As promised, this breaks out the changes to the IVOPTS cost model and
  the added function in double-int.c.  Please let me know if you would
  rather see me attempt to consolidate the IVOPTS logic into expmed.c per
  Richard H's suggestion.
 
 If we start to use it from multiple places that definitely makes sense,
 but you can move the stuff as a followup.

OK, I'll put it on my list.

 
  I ran into a glitch with multiply_by_const_cost.  The original code
  declared a static htab_t in the function and allocated it on demand.
  When I tried adding a second one in the same manner, I ran into a
  locking problem in the memory management library code during a call to
  delete_htab.  The original implementation seemed a bit dicey to me
  anyway, so I changed this to explicitly allocate and deallocate the hash
  tables on (entry to/exit from) IVOPTS.
 
 Huh.  That's weird and should not happen.  Still it makes sense to
 move this to a per-function cache given that its size is basically
 unbound.
 
 Can you introduce a initialize_costs () / finalize_costs () function
 pair that allocates / frees the tables and sets a global flag that
 you can then assert in the functions using those tables?

Ok.

 
  +  if (speed)
  +speed = 1;
 
 I suppose this is because bool is not bool when building with a
 C compiler?  It really looks weird and if such is necessary I'd
 prefer something like
 
  +add_regs_cost (enum machine_mode mode, bool speed)
   {
  +  static unsigned costs[NUM_MACHINE_MODES][2];
 rtx seq;
 unsigned cost;
  unsigned sidx = speed ? 0 : 1;
   
  +  if (costs[mode][sidx])
  +return costs[mode][sidx];
  +
 
 instead.

I'm always paranoid about misuse of bools in C, but I suppose this is
overkill.  I'll just remove the code.

Thanks,
Bill

 
 Otherwise the patch is ok.
 
 Thanks,
 Richard.
 
  This reduces the scope of the hash table from a compilation unit to each
  individual function.  If it's preferred to maintain compilation unit
  scope, then the initialization/finalization of the htabs can be pushed
  out to do_compile.  But I doubt it's worth that.
  
  Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
  regressions.  Ok for trunk?
  
  Thanks,
  Bill
  
  
  2012-06-21  Bill Schmidt  wschm...@linux.ibm.com
  
  * double-int.c (double_int_multiple_of): New function.
  * double-int.h (double_int_multiple_of): New decl.
  * tree-ssa-loop-ivopts.c (add_cost, zero_cost): Remove undefs.
  (mbc_entry_hash): New forward decl.
  (mbc_entry_eq): Likewise.
  (zero_cost): Change to no_cost.
  (mult_costs): New static var.
  (tree_ssa_iv_optimize_init): Initialize mult_costs.
  (add_cost): Change to add_regs_cost; distinguish costs by speed.
  (multiply_regs_cost): New function.
  (add_const_cost): Likewise.
  (extend_or_trunc_reg_cost): Likewise.
  (negate_reg_cost): Likewise.
  (multiply_by_cost): Change to multiply_by_const_cost; distinguish
  costs by speed.
  (get_address_cost): Change add_cost to add_regs_cost; change
  multiply_by_cost to multiply_by_const_cost.
  (force_expr_to_var_cost): Change zero_cost to no_cost; change
  add_cost to add_regs_cost; change multiply_by_cost to
  multiply_by_const_cost.
  (split_cost): Change zero_cost to no_cost.
  (ptr_difference_cost): Likewise.
  (difference_cost): Change zero_cost to no_cost; change multiply_by_cost
  to multiply_by_const_cost.
  (get_computation_cost_at): Change add_cost to add_regs_cost; change
  multiply_by_cost to multiply_by_const_cost.
  (determine_use_iv_cost_generic): Change zero_cost to no_cost.
  (determine_iv_cost): Change add_cost to add_regs_cost.
  (iv_ca_new): Change zero_cost to no_cost.
  (tree_ssa_iv_optimize_finalize): Release storage for mult_costs.
  * tree-ssa-address.c (most_expensive_mult_to_index): Change
  multiply_by_cost to multiply_by_const_cost.
  * tree-flow.h (multiply_by_cost): Change to multiply_by_const_cost.
  (add_regs_cost): New decl.
  (multiply_regs_cost): Likewise.
  (add_const_cost): Likewise.
  (extend_or_trunc_reg_cost): Likewise.
  (negate_reg_cost): Likewise.
  
  
  Index: gcc/double-int.c
  ===
  --- gcc/double-int.c(revision 188839)
  +++ gcc/double-int.c(working copy)
  @@ -865,6 +865,26 @@ double_int_umod (double_int a, double_int b, unsig
 return double_int_mod (a, b, true, code);
   }
   
  +/* Return TRUE iff PRODUCT is an integral multiple of FACTOR, and return
  +   the multiple in *MULTIPLE.  Otherwise return FALSE and leave *MULTIPLE
  +   unchanged.  */
  +
  +bool
  +double_int_multiple_of (double_int product, double_int factor,
  +   bool unsigned_p, double_int *multiple)
  +{
  +  double_int

Re: [PATCH] Strength reduction preliminaries

2012-06-22 Thread William J. Schmidt

On Fri, 2012-06-22 at 10:44 +0200, Richard Guenther wrote:
 On Thu, 21 Jun 2012, William J. Schmidt wrote:

  I ran into a glitch with multiply_by_const_cost.  The original code
  declared a static htab_t in the function and allocated it on demand.
  When I tried adding a second one in the same manner, I ran into a
  locking problem in the memory management library code during a call to
  delete_htab.  The original implementation seemed a bit dicey to me
  anyway, so I changed this to explicitly allocate and deallocate the hash
  tables on (entry to/exit from) IVOPTS.
 
 Huh.  That's weird and should not happen.  Still it makes sense to
 move this to a per-function cache given that its size is basically
 unbound.
 

Hm, this appears not to be related to my changes.  I ran into the same
issue when bootstrapping some other change without any of the IVOPTS
changes committed.  In both cases the stuck lock occurred when compiling
tree-vect-stmts.c.  I'll try to debug this when I get some time, unless
somebody else figures it out sooner.

Bill

Re: [PATCH] Add vector cost model density heuristic

2012-06-21 Thread William J. Schmidt

On Tue, 2012-06-19 at 16:20 +0200, Richard Guenther wrote:
 On Tue, 19 Jun 2012, William J. Schmidt wrote:
 
  On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
   On Tue, 19 Jun 2012, William J. Schmidt wrote:
   
I remember having this discussion, and I was looking for it to check on
the details, but I can't seem to find it either in my inbox or in the
archives.  Can you please point me to that again?  Sorry for the bother.
   
   It was in the Correct cost model for strided loads thread.
  
  Ah, right, thanks.  I think it will be best to make that a separate
  patch in the series.  Like so:
  
  (1) Add calls to the new interface without disturbing existing logic;
  modify the profitability algorithms to query the new model for inside
  costs.  Default algorithm for the model is to just sum costs as is done
  today.

Just FYI, this is not quite as straightforward as I thought.  There is
some code in tree-vect-data-refs.c that computes costs for various
peeling options and picks one of them.  In most other places we can just
pass the instructions to the back end at the same place that the costs
are currently calculated, but not here.  This will require some more
major surgery to save the instructions needed from each peeling option
and only pass along the ones that end up being chosen.

The upside is the same sort of delayed emit is needed for the SLP
ordering problem, so the infrastructure for this will be reusable for
that problem.

Grumble.

Bill

  (1a) Split up the cost hooks (one for loads/stores with misalign parm,
  one for vector_stmt with tree_code, etc.).
  (x) Add heuristics to target models as desired.
  (2) Handle the SLP ordering problem.
  (3) Handle outside costs in the target model.
  (4) Remove the now unnecessary cost fields and the calls that set them.
  
  I'll start work on this series of patches as I have time between other
  projects.
 
 Thanks!
 Richard.

[PATCH] Strength reduction preliminaries

2012-06-21 Thread William J. Schmidt

As promised, this breaks out the changes to the IVOPTS cost model and
the added function in double-int.c.  Please let me know if you would
rather see me attempt to consolidate the IVOPTS logic into expmed.c per
Richard H's suggestion.

I ran into a glitch with multiply_by_const_cost.  The original code
declared a static htab_t in the function and allocated it on demand.
When I tried adding a second one in the same manner, I ran into a
locking problem in the memory management library code during a call to
delete_htab.  The original implementation seemed a bit dicey to me
anyway, so I changed this to explicitly allocate and deallocate the hash
tables on (entry to/exit from) IVOPTS.

This reduces the scope of the hash table from a compilation unit to each
individual function.  If it's preferred to maintain compilation unit
scope, then the initialization/finalization of the htabs can be pushed
out to do_compile.  But I doubt it's worth that.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-06-21  Bill Schmidt  wschm...@linux.ibm.com

* double-int.c (double_int_multiple_of): New function.
* double-int.h (double_int_multiple_of): New decl.
* tree-ssa-loop-ivopts.c (add_cost, zero_cost): Remove undefs.
(mbc_entry_hash): New forward decl.
(mbc_entry_eq): Likewise.
(zero_cost): Change to no_cost.
(mult_costs): New static var.
(tree_ssa_iv_optimize_init): Initialize mult_costs.
(add_cost): Change to add_regs_cost; distinguish costs by speed.
(multiply_regs_cost): New function.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.
(multiply_by_cost): Change to multiply_by_const_cost; distinguish
costs by speed.
(get_address_cost): Change add_cost to add_regs_cost; change
multiply_by_cost to multiply_by_const_cost.
(force_expr_to_var_cost): Change zero_cost to no_cost; change
add_cost to add_regs_cost; change multiply_by_cost to
multiply_by_const_cost.
(split_cost): Change zero_cost to no_cost.
(ptr_difference_cost): Likewise.
(difference_cost): Change zero_cost to no_cost; change multiply_by_cost
to multiply_by_const_cost.
(get_computation_cost_at): Change add_cost to add_regs_cost; change
multiply_by_cost to multiply_by_const_cost.
(determine_use_iv_cost_generic): Change zero_cost to no_cost.
(determine_iv_cost): Change add_cost to add_regs_cost.
(iv_ca_new): Change zero_cost to no_cost.
(tree_ssa_iv_optimize_finalize): Release storage for mult_costs.
* tree-ssa-address.c (most_expensive_mult_to_index): Change
multiply_by_cost to multiply_by_const_cost.
* tree-flow.h (multiply_by_cost): Change to multiply_by_const_cost.
(add_regs_cost): New decl.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.


Index: gcc/double-int.c
===
--- gcc/double-int.c(revision 188839)
+++ gcc/double-int.c(working copy)
@@ -865,6 +865,26 @@ double_int_umod (double_int a, double_int b, unsig
   return double_int_mod (a, b, true, code);
 }
 
+/* Return TRUE iff PRODUCT is an integral multiple of FACTOR, and return
+   the multiple in *MULTIPLE.  Otherwise return FALSE and leave *MULTIPLE
+   unchanged.  */
+
+bool
+double_int_multiple_of (double_int product, double_int factor,
+   bool unsigned_p, double_int *multiple)
+{
+  double_int remainder;
+  double_int quotient = double_int_divmod (product, factor, unsigned_p,
+  TRUNC_DIV_EXPR, remainder);
+  if (double_int_zero_p (remainder))
+{
+  *multiple = quotient;
+  return true;
+}
+
+  return false;
+}
+
 /* Set BITPOS bit in A.  */
 double_int
 double_int_setbit (double_int a, unsigned bitpos)
Index: gcc/double-int.h
===
--- gcc/double-int.h(revision 188839)
+++ gcc/double-int.h(working copy)
@@ -150,6 +150,8 @@ double_int double_int_divmod (double_int, double_i
 double_int double_int_sdivmod (double_int, double_int, unsigned, double_int *);
 double_int double_int_udivmod (double_int, double_int, unsigned, double_int *);
 
+bool double_int_multiple_of (double_int, double_int, bool, double_int *);
+
 double_int double_int_setbit (double_int, unsigned);
 int double_int_ctz (double_int);
 
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 188839)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -89,13 +89,11 @@ along with GCC; see the file COPYING3.  If not see
 #include target.h
 #include

Re: [Patch ping] Strength reduction

2012-06-20 Thread William J. Schmidt

On Wed, 2012-06-20 at 13:11 +0200, Richard Guenther wrote:
 On Thu, Jun 14, 2012 at 3:21 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  Pro forma ping. :)
 
 ;)
 
 I notice (with all of these functions)
 
 +unsigned
 +negate_cost (enum machine_mode mode, bool speed)
 +{
 +  static unsigned costs[NUM_MACHINE_MODES];
 +  rtx seq;
 +  unsigned cost;
 +
 +  if (costs[mode])
 +return costs[mode];
 +
 +  start_sequence ();
 +  force_operand (gen_rtx_fmt_e (NEG, mode,
 + gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1)),
 +  NULL_RTX);
 +  seq = get_insns ();
 +  end_sequence ();
 +
 +  cost = seq_cost (seq, speed);
 +  if (!cost)
 +cost = 1;
 
 that the cost[] array is independent on the speed argument.  Thus whatever
 comes first determines the cost.  Odd, and probably not good.  A fix
 would be appreciated (even for the current code ...) - simply make the
 array costs[NUM_MACHINE_MODES][2].
 
 As for the renaming - can you name the functions consistently?  Thus
 the above would be negate_reg_cost?  And maybe rename the other
 FIXME function, too?

I agree with all this.  I'll prepare all the cost model changes as a
separate preliminaries patch.

 
 Index: gcc/tree-ssa-strength-reduction.c
 ===
 --- gcc/tree-ssa-strength-reduction.c (revision 0)
 +++ gcc/tree-ssa-strength-reduction.c (revision 0)
 @@ -0,0 +1,1611 @@
 +/* Straight-line strength reduction.
 +   Copyright (C) 2012  Free Software Foundation, Inc.
 
 I know we have these 'tree-ssa-' names, but really this is gimple-ssa now ;)
 So, please name it gimple-ssa-strength-reduction.c.

Will do.  Vive la revolution? ;)

 
 +  /* Access to the statement for subsequent modification.  Cached to
 + save compile time.  */
 +  gimple_stmt_iterator cand_gsi;
 
 this is a iterator for cand_stmt?  Then caching it is no longer necessary
 as the iterator is the stmt itself after recent infrastructure changes.

Oh yeah, I remember seeing that go by.  Nice.  Will change.

 
 +/* Hash table embodying a mapping from statements to candidates.  */
 +static htab_t stmt_cand_map;
 ...
 +static hashval_t
 +stmt_cand_hash (const void *p)
 +{
 +  return htab_hash_pointer (((const_slsr_cand_t) p)-cand_stmt);
 +}
 
 use a pointer-map instead.
 
 +/* Callback to produce a hash value for a candidate chain header.  */
 +
 +static hashval_t
 +base_cand_hash (const void *p)
 +{
 +  tree ssa_name = ((const_cand_chain_t) p)-base_name;
 +
 +  if (TREE_CODE (ssa_name) != SSA_NAME)
 +return (hashval_t) 0;
 +
 +  return (hashval_t) SSA_NAME_VERSION (ssa_name);
 +}
 
 does it ever happen that ssa_name is not an SSA_NAME?  

Not in this patch, but when I introduce CAND_REF in a later patch it
could happen since the base field of a CAND_REF is a MEM_REF.  It's a
safety valve in case of misuse.  I'll think about this some more.

 I'm not sure
 the memory savings over simply using a fixed-size (num_ssa_names)
 array indexed by SSA_NAME_VERSION pointing to the chain is worth
 using a hashtable for this?

That's reasonable.  I'll do that.

 
 +  node = (cand_chain_t) pool_alloc (chain_pool);
 +  node-base_name = c-base_name;
 
 If you never free pool entries it's more efficient to use an obstack.
 alloc-pool
 only pays off if you get freed item re-use.

OK.  I'll change both cand_pool and chain_pool to obstacks.

 
 +  switch (gimple_assign_rhs_code (gs))
 +{
 +case MULT_EXPR:
 +  rhs2 = gimple_assign_rhs2 (gs);
 +
 +  if (TREE_CODE (rhs2) == INTEGER_CST)
 + return multiply_by_cost (TREE_INT_CST_LOW (rhs2), lhs_mode, speed);
 +
 +  if (TREE_CODE (rhs1) == INTEGER_CST)
 + return multiply_by_cost (TREE_INT_CST_LOW (rhs1), lhs_mode, speed);
 
 In theory all commutative statements should have constant operands only
 at rhs2 ...

I'm glad I'm not the only one who thought that was the theory. ;)  I
wasn't sure, and I've seen violations of this come up in practice.
Should I assert when that happens instead, and track down the offending
optimizations?

 
 Also you do not verify that the constant fits in a host-wide-int - but maybe
 you do not care?  Thus, I'd do
 
if (host_integerp (rhs2, 0))
  return multiply_by_cost (TREE_INT_CST_LOW (rhs2), lhs_mode, speed);
 
 or make multiply_by[_const?]_cost take a double-int instead.  Likewise below
 for add.

Ok.  Name change looks good also, I'll include that in the cost model
changes.

 
 +case MODIFY_EXPR:
 +  /* Be suspicious of assigning costs to copies that may well go away.  
 */
 +  return 0;
 
 MODIFY_EXPR is never a gimple_assign_rhs_code.  Simple copies have
 a code of SSA_NAME for example.  But as you assert if you get to an
 unhandled code I wonder why you needed the above ...

I'll remove this, and document that we are deliberately not touching
copies (which was my original intent).

 
 +static slsr_cand_t
 +base_cand_from_table (tree base_in)
 +{
 +  slsr_cand mapping_key

Re: [Patch ping] Strength reduction

2012-06-20 Thread William J. Schmidt

On Wed, 2012-06-20 at 11:52 -0700, Richard Henderson wrote:
 On 06/20/2012 04:11 AM, Richard Guenther wrote:
  I notice (with all of these functions)
  
  +unsigned
  +negate_cost (enum machine_mode mode, bool speed)
  +{
  +  static unsigned costs[NUM_MACHINE_MODES];
  +  rtx seq;
  +  unsigned cost;
  +
  +  if (costs[mode])
  +return costs[mode];
  +
  +  start_sequence ();
  +  force_operand (gen_rtx_fmt_e (NEG, mode,
  +   gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1)),
  +NULL_RTX);
 
 I don't suppose there's any way to share data with what init_expmed computes?
 
 Not, strictly speaking, the cleanest thing to include expmed.h here, but 
 surely
 a tad better than re-computing identical data (and without the clever rtl
 garbage avoidance tricks).

Interesting.  I was building on what ivopts already has; not sure of the
history there.  It looks like there is some overlap in function, but
expmed doesn't have everything ivopts uses today (particularly the hash
table of costs for multiplies by various constants).  The stuff I need
for type promotion/demotion is also not present (which I'm computing on
demand for whatever mode pairs are encountered).  Not sure how great it
would be to precompute that for all pairs, and obviously precomputing
costs of multiplying by all constants isn't going to work.  So if the
two functionalities were to be combined, it would seem to require some
modification to how expmed works.

Thanks,
Bill
 
 
 r~

Re: [PATCH] Add vector cost model density heuristic

2012-06-19 Thread William J. Schmidt

On Tue, 2012-06-19 at 12:08 +0200, Richard Guenther wrote:
 On Mon, 18 Jun 2012, William J. Schmidt wrote:
 
  On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
   On Fri, 8 Jun 2012, William J. Schmidt wrote:
   
  snip
   
   Hmm.  I don't like this patch or its general idea too much.  Instead
   I'd like us to move more of the cost model detail to the target, giving
   it a chance to look at the whole loop before deciding on a cost.  ISTR
   posting the overall idea at some point, but let me repeat it here instead
   of trying to find that e-mail.
   
   The basic interface of the cost model should be, in targetm.vectorize
   
 /* Tell the target to start cost analysis of a loop or a basic-block
(if the loop argument is NULL).  Returns an opaque pointer to
target-private data.  */
 void *init_cost (struct loop *loop);
   
 /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
 void add_stmt_cost (void *data, unsigned n,
   vectorized-stmt-kind,
 enum machine_mode vector_mode);
   
 /* Tell the target to compute and return the cost of the accumulated
statements and free any target-private data.  */
 unsigned finish_cost (void *data);
   
   with eventually slightly different signatures for add_stmt_cost
   (like pass in the original scalar stmt?).
   
   It allows the target, at finish_cost time, to evaluate things like
   register pressure and resource utilization.
   
   Thanks,
   Richard.
  
  I've been looking at this in between other projects.  I wanted to be
  sure I understood the SLP infrastructure and whether it would cause any
  problems.  It looks to me like it will be mostly ok.  One issue I
  noticed is a possible difference in the order in which SLP instructions
  are analyzed and the order in which the instructions are issued during
  transformation.
  
  For both loop analysis and basic block analysis, SLP trees are
  constructed and analyzed prior to examining other vectorizable
  instructions.  Their costs are calculated and stored in the SLP trees at
  this time.  Later, when transforming statements to their vector
  equivalents, instructions in the block (or loop body) are processed in
  order until the first instruction that's part of an SLP tree is
  encountered.  At that point, every instruction that's part of any SLP
  tree is transformed; then the vectorizer continues with the remaining
  non-SLP vectorizable statements.
  
  So if we do the natural and easy thing of placing calls to add_stmt_cost
  everywhere that costs are calculated today, the order that those costs
  are presented to the back end model will possibly be different than the
  order they are actually emitted.
 
 Interesting.  But I suppose this is similar to how pattern statements
 are handled?  Thus, the whole pattern sequence is processed when
 we encounter the main pattern statement?

Yes, but the difference is that both vect_analyze_stmt and
vect_transform_loop handle the pattern statements in the same order
(thankfully -- I would hate to have to deal with the pattern mess).
With SLP, all SLP statements are analyzed ahead of time, but they aren't
transformed until one of them is encountered in the statement walk.

 
  For a first cut at this, I suggest ignoring the problem other than to
  document it as an opportunity for improvement.  Later we could improve
  it by using an add_stmt_slp_cost () interface (or adding an is_slp
  flag), and another interface to be called at the time during analysis
  when the SLP statements will be issued during transformation.  This
  would allow the back end model to queue up the SLP costs in a separate
  vector and later place them in its internal structures at the
  appropriate place.
 
  It should eventually be possible to remove these fields/accessors:
  
   * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
   * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
   * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST
  
  However, I think this should be delayed until we have the basic
  infrastructure in place for the new model and well-tested.
 
 Indeed.
 
  The other issue is that we should have the model track both the inside
  and outside costs if we're going to get everything into the target
  model.  For a first pass we can ignore this and keep the existing logic
  for the outside costs.  Later we should add some interfaces analogous to
  add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
  the model can track this stuff as carefully as it wants to.
 
 Outside costs are merely added to the niter * inner-cost metric to
 be compared with the scalar cost niter * scalar-cost, right?  Thus
 they would be tracked completely separate - eventually similar to
 how we compute the cost of the scalar loop.

Yes, that's the way they're used today, and probably nobody will ever
want to get fancier than that.  But as you say, the idea would be to let
them be tracked similarly

Re: [PATCH] Add vector cost model density heuristic

2012-06-19 Thread William J. Schmidt

On Tue, 2012-06-19 at 12:10 +0200, Richard Guenther wrote:
 On Mon, 18 Jun 2012, William J. Schmidt wrote:
 
  On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
   On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
On Fri, 8 Jun 2012, William J. Schmidt wrote:

   snip

Hmm.  I don't like this patch or its general idea too much.  Instead
I'd like us to move more of the cost model detail to the target, giving
it a chance to look at the whole loop before deciding on a cost.  ISTR
posting the overall idea at some point, but let me repeat it here 
instead
of trying to find that e-mail.

The basic interface of the cost model should be, in targetm.vectorize

  /* Tell the target to start cost analysis of a loop or a basic-block
 (if the loop argument is NULL).  Returns an opaque pointer to
 target-private data.  */
  void *init_cost (struct loop *loop);

  /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
  void add_stmt_cost (void *data, unsigned n,
  vectorized-stmt-kind,
  enum machine_mode vector_mode);

  /* Tell the target to compute and return the cost of the accumulated
 statements and free any target-private data.  */
  unsigned finish_cost (void *data);
  
  By the way, I don't see much point in passing the void *data around
  here.  Too many levels of interfaces that we'd have to pass it around in
  the vectorizer, so it would just sit in a static variable.  Might as
  well let the data be wholly private to the target.
 
 Ok, so you'd have void init_cost (struct loop *) and
 unsigned finish_cost (void); then?  Static variables are of couse
 not properly abstracted so we can't ever compute two set of costs
 at the same time ... but that's true all-over-the-place in GCC ...

It's a fair point, and perhaps I'll decide to pass the data pointer
around anyway to keep that option open.  We'll see which looks uglier.

 
 With previous discussion the add_stmt_cost hook would be split up
 to also allow passing the operation code for example.

I remember having this discussion, and I was looking for it to check on
the details, but I can't seem to find it either in my inbox or in the
archives.  Can you please point me to that again?  Sorry for the bother.

Thanks,
Bill

 
 Richard.

Re: [PATCH] Add vector cost model density heuristic

2012-06-19 Thread William J. Schmidt

On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
 On Tue, 19 Jun 2012, William J. Schmidt wrote:
 
  I remember having this discussion, and I was looking for it to check on
  the details, but I can't seem to find it either in my inbox or in the
  archives.  Can you please point me to that again?  Sorry for the bother.
 
 It was in the Correct cost model for strided loads thread.

Ah, right, thanks.  I think it will be best to make that a separate
patch in the series.  Like so:

(1) Add calls to the new interface without disturbing existing logic;
modify the profitability algorithms to query the new model for inside
costs.  Default algorithm for the model is to just sum costs as is done
today.
(1a) Split up the cost hooks (one for loads/stores with misalign parm,
one for vector_stmt with tree_code, etc.).
(x) Add heuristics to target models as desired.
(2) Handle the SLP ordering problem.
(3) Handle outside costs in the target model.
(4) Remove the now unnecessary cost fields and the calls that set them.

I'll start work on this series of patches as I have time between other
projects.

Thanks,
Bill

 
 Richard.

Re: [PATCH] Add vector cost model density heuristic

2012-06-18 Thread William J. Schmidt

On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
 On Fri, 8 Jun 2012, William J. Schmidt wrote:
 
snip
 
 Hmm.  I don't like this patch or its general idea too much.  Instead
 I'd like us to move more of the cost model detail to the target, giving
 it a chance to look at the whole loop before deciding on a cost.  ISTR
 posting the overall idea at some point, but let me repeat it here instead
 of trying to find that e-mail.
 
 The basic interface of the cost model should be, in targetm.vectorize
 
   /* Tell the target to start cost analysis of a loop or a basic-block
  (if the loop argument is NULL).  Returns an opaque pointer to
  target-private data.  */
   void *init_cost (struct loop *loop);
 
   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
   void add_stmt_cost (void *data, unsigned n,
 vectorized-stmt-kind,
   enum machine_mode vector_mode);
 
   /* Tell the target to compute and return the cost of the accumulated
  statements and free any target-private data.  */
   unsigned finish_cost (void *data);
 
 with eventually slightly different signatures for add_stmt_cost
 (like pass in the original scalar stmt?).
 
 It allows the target, at finish_cost time, to evaluate things like
 register pressure and resource utilization.
 
 Thanks,
 Richard.

I've been looking at this in between other projects.  I wanted to be
sure I understood the SLP infrastructure and whether it would cause any
problems.  It looks to me like it will be mostly ok.  One issue I
noticed is a possible difference in the order in which SLP instructions
are analyzed and the order in which the instructions are issued during
transformation.

For both loop analysis and basic block analysis, SLP trees are
constructed and analyzed prior to examining other vectorizable
instructions.  Their costs are calculated and stored in the SLP trees at
this time.  Later, when transforming statements to their vector
equivalents, instructions in the block (or loop body) are processed in
order until the first instruction that's part of an SLP tree is
encountered.  At that point, every instruction that's part of any SLP
tree is transformed; then the vectorizer continues with the remaining
non-SLP vectorizable statements.

So if we do the natural and easy thing of placing calls to add_stmt_cost
everywhere that costs are calculated today, the order that those costs
are presented to the back end model will possibly be different than the
order they are actually emitted.

For a first cut at this, I suggest ignoring the problem other than to
document it as an opportunity for improvement.  Later we could improve
it by using an add_stmt_slp_cost () interface (or adding an is_slp
flag), and another interface to be called at the time during analysis
when the SLP statements will be issued during transformation.  This
would allow the back end model to queue up the SLP costs in a separate
vector and later place them in its internal structures at the
appropriate place.

It should eventually be possible to remove these fields/accessors:

 * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
 * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
 * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST

However, I think this should be delayed until we have the basic
infrastructure in place for the new model and well-tested.

The other issue is that we should have the model track both the inside
and outside costs if we're going to get everything into the target
model.  For a first pass we can ignore this and keep the existing logic
for the outside costs.  Later we should add some interfaces analogous to
add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
the model can track this stuff as carefully as it wants to.

So, I'd propose going at this in several phases:

(1) Add calls to the new interface without disturbing existing logic;
modify the profitability algorithms to query the new model for inside
costs.  Default algorithm for the model is to just sum costs as is done
today.
(x) Add heuristics to target models as desired.
(2) Handle the SLP ordering problem.
(3) Handle outside costs in the target model.
(4) Remove the now unnecessary cost fields and the calls that set them.

Item (x) can happen anytime after item (1).

I don't think this work is terribly difficult, just a bit tedious.  The
only really time-consuming aspect of it will be in very careful testing
to keep from changing existing behavior.

All comments welcome -- please let me know what you think.

Thanks,
Bill

Re: [PATCH] Add vector cost model density heuristic

2012-06-18 Thread William J. Schmidt

On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
 On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
  On Fri, 8 Jun 2012, William J. Schmidt wrote:
  
 snip
  
  Hmm.  I don't like this patch or its general idea too much.  Instead
  I'd like us to move more of the cost model detail to the target, giving
  it a chance to look at the whole loop before deciding on a cost.  ISTR
  posting the overall idea at some point, but let me repeat it here instead
  of trying to find that e-mail.
  
  The basic interface of the cost model should be, in targetm.vectorize
  
/* Tell the target to start cost analysis of a loop or a basic-block
   (if the loop argument is NULL).  Returns an opaque pointer to
   target-private data.  */
void *init_cost (struct loop *loop);
  
/* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
void add_stmt_cost (void *data, unsigned n,
vectorized-stmt-kind,
enum machine_mode vector_mode);
  
/* Tell the target to compute and return the cost of the accumulated
   statements and free any target-private data.  */
unsigned finish_cost (void *data);

By the way, I don't see much point in passing the void *data around
here.  Too many levels of interfaces that we'd have to pass it around in
the vectorizer, so it would just sit in a static variable.  Might as
well let the data be wholly private to the target.
  
  with eventually slightly different signatures for add_stmt_cost
  (like pass in the original scalar stmt?).
  
  It allows the target, at finish_cost time, to evaluate things like
  register pressure and resource utilization.
  
  Thanks,
  Richard.

[PATCH] Fix PR53703

2012-06-17 Thread William J. Schmidt

The test case exposes a bug that occurs only when a diamond control flow
pattern has the arguments of the joining phi in a different order from
the successor arcs of the entry block.  My logic for setting
bb_for_def[12] was just brain-dead.  This cleans that up and also
prevents wasting time examining phis of virtual ops, which I noticed
happening while debugging this.

Bootstrapped and regtested on powerpc64-unknown-linux-gnu with no new
failures.  Ok for trunk?

Thanks,
Bill


gcc:

2012-06-17  Bill Schmidt  wschm...@linux.ibm.com

PR tree-optimization/53703
* tree-ssa-phiopt.c (hoist_adjacent_loads): Skip virtual phis;
correctly set bb_for_def[12].

gcc/testsuite:

2012-06-17  Bill Schmidt  wschm...@linux.ibm.com

PR tree-optimization/53703
* gcc.dg/torture/pr53703.c: New test.


Index: gcc/testsuite/gcc.dg/torture/pr53703.c
===
--- gcc/testsuite/gcc.dg/torture/pr53703.c  (revision 0)
+++ gcc/testsuite/gcc.dg/torture/pr53703.c  (revision 0)
@@ -0,0 +1,149 @@
+/* Reduced test case from PR53703.  Used to ICE.  */
+
+/* { dg-do compile } */
+/* { dg-options -w } */
+
+typedef long unsigned int size_t;
+typedef unsigned short int sa_family_t;
+struct sockaddr   {};
+typedef unsigned char __u8;
+typedef unsigned short __u16;
+typedef unsigned int __u32;
+struct nlmsghdr {
+  __u32 nlmsg_len;
+  __u16 nlmsg_type;
+};
+struct ifaddrmsg {
+  __u8 ifa_family;
+};
+enum {
+  IFA_ADDRESS,
+  IFA_LOCAL,
+};
+enum {
+  RTM_NEWLINK = 16,
+  RTM_NEWADDR = 20,
+};
+struct rtattr {
+  unsigned short rta_len;
+  unsigned short rta_type;
+};
+struct ifaddrs {
+  struct ifaddrs *ifa_next;
+  unsigned short ifa_flags;
+};
+typedef unsigned short int uint16_t;
+typedef unsigned int uint32_t;
+struct nlmsg_list {
+  struct nlmsg_list *nlm_next;
+  int size;
+};
+struct rtmaddr_ifamap {
+  void *address;
+  void *local;
+  int address_len;
+  int local_len;
+};
+int usagi_getifaddrs (struct ifaddrs **ifap)
+{
+  struct nlmsg_list *nlmsg_list, *nlmsg_end, *nlm;
+  size_t dlen, xlen, nlen;
+  int build;
+  for (build = 0; build = 1; build++)
+{
+  struct ifaddrs *ifl = ((void *)0), *ifa = ((void *)0);
+  struct nlmsghdr *nlh, *nlh0;
+  uint16_t *ifflist = ((void *)0);
+  struct rtmaddr_ifamap ifamap;
+  for (nlm = nlmsg_list; nlm; nlm = nlm-nlm_next)
+   {
+ int nlmlen = nlm-size;
+ for (nlh = nlh0;
+  ((nlmlen) = (int)sizeof(struct nlmsghdr)
+(nlh)-nlmsg_len = sizeof(struct nlmsghdr)
+(nlh)-nlmsg_len = (nlmlen));
+  nlh = ((nlmlen) -= ( (((nlh)-nlmsg_len)+4U -1)  ~(4U -1) ),
+ (struct nlmsghdr*)(((char*)(nlh))
++ ( (((nlh)-nlmsg_len)+4U -1)
+ ~(4U -1) 
+   {
+ struct ifinfomsg *ifim = ((void *)0);
+ struct ifaddrmsg *ifam = ((void *)0);
+ struct rtattr *rta;
+ sa_family_t nlm_family = 0;
+ uint32_t nlm_scope = 0, nlm_index = 0;
+ memset (ifamap, 0, sizeof (ifamap));
+ switch (nlh-nlmsg_type)
+   {
+   case RTM_NEWLINK:
+ ifim = (struct ifinfomsg *)
+   ((void*)(((char*)nlh)
++ ((0)+( int)
+( ((sizeof(struct nlmsghdr))+4U -1)
+   ~(4U -1) )))+4U -1)
+  ~(4U -1) ;
+   case RTM_NEWADDR:
+ ifam = (struct ifaddrmsg *)
+   ((void*)(((char*)nlh)
++ ((0)+( int)
+( ((sizeof(struct nlmsghdr))+4U -1)
+   ~(4U -1) )))+4U -1)
+  ~(4U -1) ;
+ nlm_family = ifam-ifa_family;
+ if (build)
+   ifa-ifa_flags = ifflist[nlm_index];
+ break;
+   default:
+ continue;
+   }
+ if (!build)
+   {
+ void *rtadata = ((void*)(((char*)(rta))
+  + (( ((sizeof(struct rtattr))+4 -1)
+~(4 -1) ) + (0;
+ size_t rtapayload = ((int)((rta)-rta_len)
+  - (( ((sizeof(struct rtattr))+4 -1)
+~(4 -1) ) + (0)));
+ switch (nlh-nlmsg_type)
+   {
+   case RTM_NEWLINK:
+ break;
+   case RTM_NEWADDR:
+ if (nlm_family == 17)
+   break;
+ switch (rta-rta_type)
+   {
+

Re: [PATCH, RFC] First cut at using vec_construct for strided loads

2012-06-13 Thread William J. Schmidt

On Wed, 2012-06-13 at 11:26 +0200, Richard Guenther wrote:
 On Tue, 12 Jun 2012, William J. Schmidt wrote:
 
  This patch is a follow-up to the discussion generated by
  http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00546.html.  I've added
  vec_construct to the cost model for use in vect_model_load_cost, and
  implemented a cost calculation that makes sense to me for PowerPC.  I'm
  less certain about the default, i386, and spu implementations.  I took a
  guess at i386 from the discussions we had, and used the same calculation
  for the default and for spu.  I'm hoping you or others can fill in the
  blanks if I guessed badly.
  
  The i386 cost for vec_construct is different from all the others, which
  are parameterized for each processor description.  This should probably
  be parameterized in some way as well, but thought you'd know better than
  I how that should be.  Perhaps instead of
  
  elements / 2 + 1
  
  it should be
  
  (elements / 2) * X + Y
  
  where X and Y are taken from the processor description, and represent
  the cost of a merge and a permute, respectively.  Let me know what you
  think.
 
 Looks good to me with the gcc_asserts removed - TYPE_VECTOR_SUBPARTS
 might be 1 for V1TImode for example (heh, not that the vectorizer would
 vectorize to that).  But I don't see any possible breakage with
 elements == 1, do you?

No, that was some unnecessary sanity testing I was doing for my own
curiosity.  I'll pull them out and pop this in today.  Thanks for the
review!

Bill

 
 Target maintainers can improve on the cost calculation if they wish,
 the default looks sensible to me.
 
 Thanks,
 Richard.
 
  Thanks,
  Bill
  
  
  2012-06-12  Bill Schmidt  wschm...@linux.ibm.com
  
  * targhooks.c (default_builtin_vectorized_conversion): Handle
  vec_construct, using vectype to base cost on subparts.
  * target.h (enum vect_cost_for_stmt): Add vec_construct.
  * tree-vect-stmts.c (vect_model_load_cost): Use vec_construct
  instead of scalar_to-vec.
  * config/spu/spu.c (spu_builtin_vectorization_cost): Handle
  vec_construct in same way as default for now.
  * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
  * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
  Handle vec_construct, including special case for 32-bit loads.
  
  
  Index: gcc/targhooks.c
  ===
  --- gcc/targhooks.c (revision 188482)
  +++ gcc/targhooks.c (working copy)
  @@ -499,9 +499,11 @@ default_builtin_vectorized_conversion (unsigned in
   
   int
   default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
  -tree vectype ATTRIBUTE_UNUSED,
  +tree vectype,
   int misalign ATTRIBUTE_UNUSED)
   {
  +  unsigned elements;
  +
 switch (type_of_cost)
   {
 case scalar_stmt:
  @@ -524,6 +526,11 @@ default_builtin_vectorization_cost (enum vect_cost
 case cond_branch_taken:
   return 3;
   
  +  case vec_construct:
  +   elements = TYPE_VECTOR_SUBPARTS (vectype);
  +   gcc_assert (elements  1);
  +   return elements / 2 + 1;
  +
 default:
   gcc_unreachable ();
   }
  Index: gcc/target.h
  ===
  --- gcc/target.h(revision 188482)
  +++ gcc/target.h(working copy)
  @@ -146,7 +146,8 @@ enum vect_cost_for_stmt
 cond_branch_not_taken,
 cond_branch_taken,
 vec_perm,
  -  vec_promote_demote
  +  vec_promote_demote,
  +  vec_construct
   };
   
   /* The target structure.  This holds all the backend hooks.  */
  Index: gcc/tree-vect-stmts.c
  ===
  --- gcc/tree-vect-stmts.c   (revision 188482)
  +++ gcc/tree-vect-stmts.c   (working copy)
  @@ -1031,11 +1031,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
 /* The loads themselves.  */
 if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
   {
  -  /* N scalar loads plus gathering them into a vector.
  - ???  scalar_to_vec isn't the cost for that.  */
  +  /* N scalar loads plus gathering them into a vector.  */
  +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
 inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
  - * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
  -  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
  + * TYPE_VECTOR_SUBPARTS (vectype));
  +  inside_cost += ncopies
  +   * targetm.vectorize.builtin_vectorization_cost (vec_construct,
  +   vectype, 0);
   }
 else
   vect_get_load_cost (first_dr, ncopies,
  Index: gcc/config/spu/spu.c
  ===
  --- gcc/config/spu/spu.c(revision

[PATCH, committed] Fix PR53647

2012-06-13 Thread William J. Schmidt

It turns out we have some old machine descriptions that have no L1
cache, so we must account for a zero line size.  Regstrapped on
powerpc64-linux-unknown-gnu with no new failures, committed as obvious.

Thanks,
Bill


2012-06-13  Bill Schmidt  wschm...@linux.ibm.com

PR tree-optimization/53647
* tree-ssa-phiopt.c (gate_hoist_loads): Skip transformation for
targets with no defined cache line size.


Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 188482)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -1976,12 +1976,14 @@ hoist_adjacent_loads (basic_block bb0, basic_block
 /* Determine whether we should attempt to hoist adjacent loads out of
diamond patterns in pass_phiopt.  Always hoist loads if
-fhoist-adjacent-loads is specified and the target machine has
-   a conditional move instruction.  */
+   both a conditional move instruction and a defined cache line size.  */
 
 static bool
 gate_hoist_loads (void)
 {
-  return (flag_hoist_adjacent_loads == 1  HAVE_conditional_move);
+  return (flag_hoist_adjacent_loads == 1
+  PARAM_VALUE (PARAM_L1_CACHE_LINE_SIZE)
+  HAVE_conditional_move);
 }
 
 /* Always do these optimizations if we have SSA

[PATCH] Some vector cost model cleanup

2012-06-13 Thread William J. Schmidt

This is just some general maintenance to the vectorizer's cost model
code:

 * Corrects a typo in a function name;
 * Eliminates an unnecessary function;
 * Combines some duplicate inline functions.

Bootstrapped and tested on powerpc64-unknown-linux-gnu, no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-06-13  Bill Schmidt  wschm...@linux.ibm.com

* tree-vectorizer.h (vect_get_stmt_cost): Move from tree-vect-stmts.c.
(cost_for_stmt): Remove decl.
(vect_get_single_scalar_iteration_cost): Correct typo in name.
* tree-vect-loop.c (vect_get_cost): Remove.
(vect_get_single_scalar_iteration_cost): Correct typo in name; use
vect_get_stmt_cost rather than vect_get_cost.
(vect_get_known_peeling_cost): Use vect_get_stmt_cost rather than
vect_get_cost.
(vect_estimate_min_profitable_iters): Correct typo in call to
vect_get_single_scalar_iteration_cost; use vect_get_stmt_cost rather
than vect_get_cost.
(vect_model_reduction_cost): Use vect_get_stmt_cost rather than
vect_get_cost.
(vect_model_induction_cost): Likewise.
* tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Correct
typo in call to vect_get_single_scalar_iteration_cost.
* tree-vect-stmts.c (vect_get_stmt_cost): Move to tree-vectorizer.h.
(cost_for_stmt): Remove unnecessary function.
* Makefile.in (TREE_VECTORIZER_H): Update dependencies.


Index: gcc/tree-vectorizer.h
===
--- gcc/tree-vectorizer.h   (revision 188507)
+++ gcc/tree-vectorizer.h   (working copy)
@@ -23,6 +23,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_TREE_VECTORIZER_H
 
 #include tree-data-ref.h
+#include target.h
 
 typedef source_location LOC;
 #define UNKNOWN_LOC UNKNOWN_LOCATION
@@ -769,6 +770,18 @@ vect_pow2 (int x)
   return res;
 }
 
+/* Get cost by calling cost target builtin.  */
+
+static inline
+int vect_get_stmt_cost (enum vect_cost_for_stmt type_of_cost)
+{
+  tree dummy_type = NULL;
+  int dummy = 0;
+
+  return targetm.vectorize.builtin_vectorization_cost (type_of_cost,
+   dummy_type, dummy);
+}
+
 /*-*/
 /* Info on data references alignment.  */
 /*-*/
@@ -843,7 +856,6 @@ extern void vect_model_load_cost (stmt_vec_info, i
 extern void vect_finish_stmt_generation (gimple, gimple,
  gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
-extern int cost_for_stmt (gimple);
 extern tree vect_get_vec_def_for_operand (tree, gimple, tree *);
 extern tree vect_init_vector (gimple, tree, tree,
   gimple_stmt_iterator *);
@@ -919,7 +931,7 @@ extern int vect_estimate_min_profitable_iters (loo
 extern tree get_initial_def_for_reduction (gimple, tree, tree *);
 extern int vect_min_worthwhile_factor (enum tree_code);
 extern int vect_get_known_peeling_cost (loop_vec_info, int, int *, int);
-extern int vect_get_single_scalar_iteraion_cost (loop_vec_info);
+extern int vect_get_single_scalar_iteration_cost (loop_vec_info);
 
 /* In tree-vect-slp.c.  */
 extern void vect_free_slp_instance (slp_instance);
Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c(revision 188507)
+++ gcc/tree-vect-loop.c(working copy)
@@ -1201,19 +1201,6 @@ vect_analyze_loop_form (struct loop *loop)
 }
 
 
-/* Get cost by calling cost target builtin.  */
-
-static inline int
-vect_get_cost (enum vect_cost_for_stmt type_of_cost)
-{
-  tree dummy_type = NULL;
-  int dummy = 0;
-
-  return targetm.vectorize.builtin_vectorization_cost (type_of_cost,
-   dummy_type, dummy);
-}
-
- 
 /* Function vect_analyze_loop_operations.
 
Scan the loop stmts and make sure they are all vectorizable.  */
@@ -2385,7 +2372,7 @@ vect_force_simple_reduction (loop_vec_info loop_in
 
 /* Calculate the cost of one scalar iteration of the loop.  */
 int
-vect_get_single_scalar_iteraion_cost (loop_vec_info loop_vinfo)
+vect_get_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
@@ -2434,12 +2421,12 @@ int
   if (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)))
 {
   if (DR_IS_READ (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt
-   stmt_cost = vect_get_cost (scalar_load);
+   stmt_cost = vect_get_stmt_cost (scalar_load);
  else
-   stmt_cost = vect_get_cost (scalar_store);
+   stmt_cost = vect_get_stmt_cost (scalar_store);
 }

Re: [PATCH] Correct cost model for strided loads

2012-06-12 Thread William J. Schmidt

On Tue, 2012-06-12 at 12:59 +0200, Richard Guenther wrote:

 Btw, with PR53533 I now have a case where multiplications of v4si are
 really expensive on x86 without SSE 4.1.  But we only have vect_stmt_cost
 and no further subdivision ...
 
 Thus we'd need a tree_code argument to the cost hook.  Though it gets
 quite overloaded then, so maybe splitting it into one handling loads/stores
 (and get the misalign parameter) and one handling only vector_stmt but
 with a tree_code argument.  Or splitting it even further, seeing
 cond_branch_taken ...

Yes, I think subdividing the hook for the vector_stmt kind is pretty
much inevitable -- more situations like this expensive multiply will
arise.  I agree with the interface starting to get messy also.
Splitting it is probably the way to go -- a little painful but keeping
it all in one hook is going to get ugly.

Bill

 
 Richard.

[PATCH, RFC] First cut at using vec_construct for strided loads

2012-06-12 Thread William J. Schmidt

This patch is a follow-up to the discussion generated by
http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00546.html.  I've added
vec_construct to the cost model for use in vect_model_load_cost, and
implemented a cost calculation that makes sense to me for PowerPC.  I'm
less certain about the default, i386, and spu implementations.  I took a
guess at i386 from the discussions we had, and used the same calculation
for the default and for spu.  I'm hoping you or others can fill in the
blanks if I guessed badly.

The i386 cost for vec_construct is different from all the others, which
are parameterized for each processor description.  This should probably
be parameterized in some way as well, but thought you'd know better than
I how that should be.  Perhaps instead of

elements / 2 + 1

it should be

(elements / 2) * X + Y

where X and Y are taken from the processor description, and represent
the cost of a merge and a permute, respectively.  Let me know what you
think.

Thanks,
Bill


2012-06-12  Bill Schmidt  wschm...@linux.ibm.com

* targhooks.c (default_builtin_vectorized_conversion): Handle
vec_construct, using vectype to base cost on subparts.
* target.h (enum vect_cost_for_stmt): Add vec_construct.
* tree-vect-stmts.c (vect_model_load_cost): Use vec_construct
instead of scalar_to-vec.
* config/spu/spu.c (spu_builtin_vectorization_cost): Handle
vec_construct in same way as default for now.
* config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
Handle vec_construct, including special case for 32-bit loads.


Index: gcc/targhooks.c
===
--- gcc/targhooks.c (revision 188482)
+++ gcc/targhooks.c (working copy)
@@ -499,9 +499,11 @@ default_builtin_vectorized_conversion (unsigned in
 
 int
 default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
-tree vectype ATTRIBUTE_UNUSED,
+tree vectype,
 int misalign ATTRIBUTE_UNUSED)
 {
+  unsigned elements;
+
   switch (type_of_cost)
 {
   case scalar_stmt:
@@ -524,6 +526,11 @@ default_builtin_vectorization_cost (enum vect_cost
   case cond_branch_taken:
 return 3;
 
+  case vec_construct:
+   elements = TYPE_VECTOR_SUBPARTS (vectype);
+   gcc_assert (elements  1);
+   return elements / 2 + 1;
+
   default:
 gcc_unreachable ();
 }
Index: gcc/target.h
===
--- gcc/target.h(revision 188482)
+++ gcc/target.h(working copy)
@@ -146,7 +146,8 @@ enum vect_cost_for_stmt
   cond_branch_not_taken,
   cond_branch_taken,
   vec_perm,
-  vec_promote_demote
+  vec_promote_demote,
+  vec_construct
 };
 
 /* The target structure.  This holds all the backend hooks.  */
Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 188482)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -1031,11 +1031,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
   /* The loads themselves.  */
   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
 {
-  /* N scalar loads plus gathering them into a vector.
- ???  scalar_to_vec isn't the cost for that.  */
+  /* N scalar loads plus gathering them into a vector.  */
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
   inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
- * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
-  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
+ * TYPE_VECTOR_SUBPARTS (vectype));
+  inside_cost += ncopies
+   * targetm.vectorize.builtin_vectorization_cost (vec_construct,
+   vectype, 0);
 }
   else
 vect_get_load_cost (first_dr, ncopies,
Index: gcc/config/spu/spu.c
===
--- gcc/config/spu/spu.c(revision 188482)
+++ gcc/config/spu/spu.c(working copy)
@@ -6908,9 +6908,11 @@ spu_builtin_mask_for_load (void)
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
 static int 
 spu_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
-tree vectype ATTRIBUTE_UNUSED,
+tree vectype,
 int misalign ATTRIBUTE_UNUSED)
 {
+  unsigned elements;
+
   switch (type_of_cost)
 {
   case scalar_stmt:
@@ -6937,6 +6939,11 @@ spu_builtin_vectorization_cost (enum vect_cost_for
   case cond_branch_taken:
 return 6;
 
+  case vec_construct:
+   elements = TYPE_VECTOR_SUBPARTS (vectype);
+

Re: [PATCH] Hoist adjacent pointer loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 13:28 +0200, Richard Guenther wrote:
 On Mon, Jun 4, 2012 at 3:45 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  Hi Richard,
 
  Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
  delay since the last revision, but my performance testing has been
  blocked waiting for a fix to PR53487.  I ended up applying a test
  version of the patch to 4.7 and ran performance numbers with that
  instead, with no degradations.
 
  In addition to addressing your comments, this patch contains one bug fix
  where local_mem_dependence was called on the wrong blocks after swapping
  def1 and def2.
 
  Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
  this version ok for trunk?  I won't commit it until I can do final
  testing on trunk in conjunction with a fix for PR53487.
 
  Thanks,
  Bill
 
 
  2012-06-04  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
 * opts.c: Add -fhoist_adjacent_loads to -O2 and above.
 * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
 declaration.
 (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
 (tree_ssa_phiopt): Call gate_hoist_loads.
 (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
 (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
 hoist_adjacent_loads.
 (local_mem_dependence): New function.
 (hoist_adjacent_loads): Likewise.
 (gate_hoist_loads): Likewise.
 * common.opt (fhoist-adjacent-loads): New switch.
 * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
 * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
 
 
  Index: gcc/opts.c
  ===
  --- gcc/opts.c  (revision 187805)
  +++ gcc/opts.c  (working copy)
  @@ -489,6 +489,7 @@ static const struct default_options default_option
  { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
  { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
  { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
  +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
 
  /* -O3 optimizations.  */
  { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
  Index: gcc/tree-ssa-phiopt.c
  ===
  --- gcc/tree-ssa-phiopt.c   (revision 187805)
  +++ gcc/tree-ssa-phiopt.c   (working copy)
  @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
   #include cfgloop.h
   #include tree-data-ref.h
   #include tree-pretty-print.h
  +#include gimple-pretty-print.h
  +#include insn-config.h
  +#include expr.h
  +#include optabs.h
 
  +#ifndef HAVE_conditional_move
  +#define HAVE_conditional_move (0)
  +#endif
  +
   static unsigned int tree_ssa_phiopt (void);
  -static unsigned int tree_ssa_phiopt_worker (bool);
  +static unsigned int tree_ssa_phiopt_worker (bool, bool);
   static bool conditional_replacement (basic_block, basic_block,
  edge, edge, gimple, tree, tree);
   static int value_replacement (basic_block, basic_block,
  @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
   static bool cond_if_else_store_replacement (basic_block, basic_block, 
  basic_block);
   static struct pointer_set_t * get_non_trapping (void);
   static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
  tree);
  +static void hoist_adjacent_loads (basic_block, basic_block,
  + basic_block, basic_block);
  +static bool gate_hoist_loads (void);
 
   /* This pass tries to replaces an if-then-else block with an
 assignment.  We have four kinds of transformations.  Some of these
  @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
   bb2:
 x = PHI x' (bb0), ...;
 
  -   A similar transformation is done for MAX_EXPR.  */
  +   A similar transformation is done for MAX_EXPR.
 
  +
  +   This pass also performs a fifth transformation of a slightly different
  +   flavor.
  +
  +   Adjacent Load Hoisting
  +   --
  +
  +   This transformation replaces
  +
  + bb0:
  +   if (...) goto bb2; else goto bb1;
  + bb1:
  +   x1 = (expr).field1;
  +   goto bb3;
  + bb2:
  +   x2 = (expr).field2;
  + bb3:
  +   # x = PHI x1, x2;
  +
  +   with
  +
  + bb0:
  +   x1 = (expr).field1;
  +   x2 = (expr).field2;
  +   if (...) goto bb2; else goto bb1;
  + bb1:
  +   goto bb3;
  + bb2:
  + bb3:
  +   # x = PHI x1, x2;
  +
  +   The purpose of this transformation is to enable generation of 
  conditional
  +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
  +   the loads is speculative, the transformation is restricted to very
  +   specific cases to avoid introducing a page fault.  We are looking

Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread William J. Schmidt



On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
 On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  The fix for PR53331 caused a degradation to 187.facerec on
  powerpc64-unknown-linux-gnu.  The following simple patch reverses the
  degradation without otherwise affecting SPEC cpu2000 or cpu2006.
  Bootstrapped and regtested on that platform with no new regressions.  Ok
  for trunk?
 
 Well, would the real cost not be subparts * scalar_to_vec plus
 subparts * vec_perm?
 At least vec_perm isn't the cost for building up a vector from N scalar 
 elements
 either (it might be close enough if subparts == 2).  What's the case
 with facerec
 here?  Does it have subparts == 2?  

In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
PowerPC, this requires two merge instructions and a permute instruction
to get the four 32-bit quantities into the right place in a 128-bit
register.  Currently this is modeled as a vec_perm in other parts of the
vectorizer per Ira's earlier patches, so I naively changed this to do
the same thing.

The types of vectorizer instructions aren't documented, and I can't
infer much from the i386.c cost model, so I need a little education.
What semantics are represented by scalar_to_vec?

On PowerPC, we have this mapping of the floating-point registers and
vector float registers where they overlap (the low-order half of each of
the first 32 vector float regs corresponds to a scalar float reg).  So
in this case we have four scalar loads that place things in the bottom
half of four vector registers, two vector merge instructions that
collapse the four registers into two vector registers, and a vector
permute that gets things in the right order.(*)  I wonder if what we
refer to as a merge instruction is similar to scalar_to_vec.

If so, then maybe we need something like

 subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
 inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
 inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
 inside_cost += ncopies * vect_get_stmt_cost (vec_perm);

But then we'd have to change how vec_perm is modeled elsewhere for
PowerPC based on Ira's earlier patches.  As I said, it's difficult for
me to figure out all the intent of cost model decisions that have been
made in the past, using current documentation.

 I really wanted to pessimize this case
 for say AVX and char elements, thus building up a vector from 32 scalars which
 certainly does not cost a mere vec_perm.  So, maybe special-case the
 subparts == 2 case and assume vec_perm would match the cost only in that
 case.

(I'm a little confused by this as what you have at the moment is a
single scalar_to_vec per copy, which has a cost of 1 on most i386
targets (occasionally 2).  The subparts multiplier is only applied to
the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

(*) There are actually a couple more instructions here to convert 64-bit
values to 32-bit values, since on PowerPC 32-bit loads are converted to
64-bit values in scalar float registers and they have to be coerced back
to 32-bit.  Very ugly.  The cost model currently doesn't represent this
at all, which I'll have to look at fixing at some point in some way that
isn't too nasty for the other targets.  The cost model for PowerPC seems
to need a lot of TLC.

Thanks,
Bill

 
 Thanks,
 Richard.
 
  Thanks,
  Bill
 
 
  2012-06-10  Bill Schmidt  wschm...@linux.ibm.com
 
 * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
 for strided loads.
 
 
  Index: gcc/tree-vect-stmts.c
  ===
  --- gcc/tree-vect-stmts.c   (revision 188341)
  +++ gcc/tree-vect-stmts.c   (working copy)
  @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
/* The loads themselves.  */
if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
  {
  -  /* N scalar loads plus gathering them into a vector.
  - ???  scalar_to_vec isn't the cost for that.  */
  +  /* N scalar loads plus gathering them into a vector.  */
inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
   * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
  (stmt_info)));
  -  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
  +  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
  }
else
  vect_get_load_cost (first_dr, ncopies,

Re: [PATCH] Add vector cost model density heuristic

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
 On Fri, 8 Jun 2012, William J. Schmidt wrote:
 
  This patch adds a heuristic to the vectorizer when estimating the
  minimum profitable number of iterations.  The heuristic is
  target-dependent, and is currently disabled for all targets except
  PowerPC.  However, the intent is to make it general enough to be useful
  for other targets that want to opt in.
  
  A previous patch addressed some PowerPC SPEC degradations by modifying
  the vector cost model values for vec_perm and vec_promote_demote.  The
  values were set a little higher than their natural values because the
  natural values were not sufficient to prevent a poor vectorization
  choice.  However, this is not the right long-term solution, since it can
  unnecessarily constrain other vectorization choices involving permute
  instructions.
  
  Analysis of the badly vectorized loop (in sphinx3) showed that the
  problem was overcommitment of vector resources -- too many vector
  instructions issued without enough non-vector instructions available to
  cover the delays.  The vector cost model assumes that instructions
  always have a constant cost, and doesn't have a way of judging this kind
  of density of vector instructions.
  
  The present patch adds a heuristic to recognize when a loop is likely to
  overcommit resources, and adds a small penalty to the inside-loop cost
  to account for the expected stalls.  The heuristic is parameterized with
  three target-specific values:
  
   * Density threshold: The heuristic will apply only when the
 percentage of inside-loop cost attributable to vectorized
 instructions exceeds this value.
  
   * Size threshold: The heuristic will apply only when the
 inside-loop cost exceeds this value.
  
   * Penalty: The inside-loop cost will be increased by this
 percentage value when the heuristic applies.
  
  Thus only reasonably large loop bodies that are mostly vectorized
  instructions will be affected.
  
  By applying only a small percentage bump to the inside-loop cost, the
  heuristic will only turn off vectorization for loops that were
  considered barely profitable to begin with (such as the sphinx3 loop).
  So the heuristic is quite conservative and should not affect the vast
  majority of vectorization decisions.
  
  Together with the new heuristic, this patch reduces the vec_perm and
  vec_promote_demote costs for PowerPC to their natural values.
  
  I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
  and verified that no performance regressions occur on SPEC cpu2006.  Is
  this ok for trunk?
 
 Hmm.  I don't like this patch or its general idea too much.  Instead
 I'd like us to move more of the cost model detail to the target, giving
 it a chance to look at the whole loop before deciding on a cost.  ISTR
 posting the overall idea at some point, but let me repeat it here instead
 of trying to find that e-mail.
 
 The basic interface of the cost model should be, in targetm.vectorize
 
   /* Tell the target to start cost analysis of a loop or a basic-block
  (if the loop argument is NULL).  Returns an opaque pointer to
  target-private data.  */
   void *init_cost (struct loop *loop);
 
   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
   void add_stmt_cost (void *data, unsigned n,
 vectorized-stmt-kind,
   enum machine_mode vector_mode);
 
   /* Tell the target to compute and return the cost of the accumulated
  statements and free any target-private data.  */
   unsigned finish_cost (void *data);
 
 with eventually slightly different signatures for add_stmt_cost
 (like pass in the original scalar stmt?).
 
 It allows the target, at finish_cost time, to evaluate things like
 register pressure and resource utilization.

OK, I'm trying to understand how you would want this built into the
present structure.  Taking just the loop case for now:

Judging by your suggested API, we would have to call add_stmt_cost ()
everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
now this would be an additional call, not a replacement, though maybe
the other goes away eventually.  This allows the target to save more
data about the vectorized instructions than just an accumulated cost
number (order and quantity of various kinds of instructions can be
maintained for better modeling).  Presumably the call to finish_cost
would be done within vect_estimate_min_profitable_iters () to produce
the final value of inside_cost for the loop.

The default target hook for add_stmt_cost would duplicate what we
currently do for calculating the inside_cost of a statement, and the
default target hook for finish_cost would just return the sum.

I'll have to go hunting where the similar code would fit for SLP in a
basic block.

If I read you correctly, you don't object to a density heuristic such as
the one I implemented here, but you want

Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread William J. Schmidt



On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
 On Mon, 11 Jun 2012, William J. Schmidt wrote:
 
  
  
  On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
   On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  Ok
for trunk?
   
   Well, would the real cost not be subparts * scalar_to_vec plus
   subparts * vec_perm?
   At least vec_perm isn't the cost for building up a vector from N scalar 
   elements
   either (it might be close enough if subparts == 2).  What's the case
   with facerec
   here?  Does it have subparts == 2?  
  
  In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
  PowerPC, this requires two merge instructions and a permute instruction
  to get the four 32-bit quantities into the right place in a 128-bit
  register.  Currently this is modeled as a vec_perm in other parts of the
  vectorizer per Ira's earlier patches, so I naively changed this to do
  the same thing.
 
 I see.
 
  The types of vectorizer instructions aren't documented, and I can't
  infer much from the i386.c cost model, so I need a little education.
  What semantics are represented by scalar_to_vec?
 
 It's a vector splat, thus x - { x, x, x, ... }.  You can create
 { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
 as VEC_PERM_EXPR, takes two input vectors).  That's by far not
 the most efficient way to build up such a vector of course (with AVX
 you could do one splat plus N - 1 inserts for example).  The cost
 is of course dependent on the number of vector elements, so a
 simple new enum vect_cost_for_stmt kind does not cover it but
 the target would have to look at the vector type passed and do
 some reasonable guess.

Ah, splat!  Yes, that's lingo I understand.  I see the intent now.

 
  On PowerPC, we have this mapping of the floating-point registers and
  vector float registers where they overlap (the low-order half of each of
  the first 32 vector float regs corresponds to a scalar float reg).  So
  in this case we have four scalar loads that place things in the bottom
  half of four vector registers, two vector merge instructions that
  collapse the four registers into two vector registers, and a vector
  permute that gets things in the right order.(*)  I wonder if what we
  refer to as a merge instruction is similar to scalar_to_vec.
 
 Looks similar to x86 SSE then.
 
  If so, then maybe we need something like
  
   subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
   inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
   inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts 
  / 2;
   inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
  
  But then we'd have to change how vec_perm is modeled elsewhere for
  PowerPC based on Ira's earlier patches.  As I said, it's difficult for
  me to figure out all the intent of cost model decisions that have been
  made in the past, using current documentation.
 
 Heh, usually the intent was to make the changes simple, not to compute
 a proper cost.
 
 I think we simply need a new scalars_to_vec cost kind.

That works.  Maybe vec_construct gets the point across a little better?
I think we need to use the full builtin_vectorization_cost interface
instead of vect_get_stmt_cost here, so the targets can parameterize on
type.  Then we can just do one cost calculation for vec_construct that
covers the full costs of getting the vector in order after the loads.

 
   I really wanted to pessimize this case
   for say AVX and char elements, thus building up a vector from 32 scalars 
   which
   certainly does not cost a mere vec_perm.  So, maybe special-case the
   subparts == 2 case and assume vec_perm would match the cost only in that
   case.
  
  (I'm a little confused by this as what you have at the moment is a
  single scalar_to_vec per copy, which has a cost of 1 on most i386
  targets (occasionally 2).  The subparts multiplier is only applied to
  the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
 
 Oh, I somehow read the patch as you were removing the multiplication
 by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
 to reflect it with N scalar loads plus N splats plus N - 1 permutes
 originally.  You could also model it with N scalar loads plus N
 inserts (but we don't have a vec_insert cost either).  I think adding
 a scalars_to_vec or vec_init or however we want to call it - basically
 what the cost of a vector CONSTRUCTOR would be - is best.
 
  (*) There are actually a couple more instructions here to convert 64-bit
  values to 32-bit values, since on PowerPC 32-bit loads

Re: [PATCH] Add vector cost model density heuristic

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 16:58 +0200, Richard Guenther wrote:
 On Mon, 11 Jun 2012, Richard Guenther wrote:
 
  On Mon, 11 Jun 2012, William J. Schmidt wrote:
  
   On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
On Fri, 8 Jun 2012, William J. Schmidt wrote:

 This patch adds a heuristic to the vectorizer when estimating the
 minimum profitable number of iterations.  The heuristic is
 target-dependent, and is currently disabled for all targets except
 PowerPC.  However, the intent is to make it general enough to be 
 useful
 for other targets that want to opt in.
 
 A previous patch addressed some PowerPC SPEC degradations by modifying
 the vector cost model values for vec_perm and vec_promote_demote.  The
 values were set a little higher than their natural values because the
 natural values were not sufficient to prevent a poor vectorization
 choice.  However, this is not the right long-term solution, since it 
 can
 unnecessarily constrain other vectorization choices involving permute
 instructions.
 
 Analysis of the badly vectorized loop (in sphinx3) showed that the
 problem was overcommitment of vector resources -- too many vector
 instructions issued without enough non-vector instructions available 
 to
 cover the delays.  The vector cost model assumes that instructions
 always have a constant cost, and doesn't have a way of judging this 
 kind
 of density of vector instructions.
 
 The present patch adds a heuristic to recognize when a loop is likely 
 to
 overcommit resources, and adds a small penalty to the inside-loop cost
 to account for the expected stalls.  The heuristic is parameterized 
 with
 three target-specific values:
 
  * Density threshold: The heuristic will apply only when the
percentage of inside-loop cost attributable to vectorized
instructions exceeds this value.
 
  * Size threshold: The heuristic will apply only when the
inside-loop cost exceeds this value.
 
  * Penalty: The inside-loop cost will be increased by this
percentage value when the heuristic applies.
 
 Thus only reasonably large loop bodies that are mostly vectorized
 instructions will be affected.
 
 By applying only a small percentage bump to the inside-loop cost, the
 heuristic will only turn off vectorization for loops that were
 considered barely profitable to begin with (such as the sphinx3 
 loop).
 So the heuristic is quite conservative and should not affect the vast
 majority of vectorization decisions.
 
 Together with the new heuristic, this patch reduces the vec_perm and
 vec_promote_demote costs for PowerPC to their natural values.
 
 I've regstrapped this with no regressions on 
 powerpc64-unknown-linux-gnu
 and verified that no performance regressions occur on SPEC cpu2006.  
 Is
 this ok for trunk?

Hmm.  I don't like this patch or its general idea too much.  Instead
I'd like us to move more of the cost model detail to the target, giving
it a chance to look at the whole loop before deciding on a cost.  ISTR
posting the overall idea at some point, but let me repeat it here 
instead
of trying to find that e-mail.

The basic interface of the cost model should be, in targetm.vectorize

  /* Tell the target to start cost analysis of a loop or a basic-block
 (if the loop argument is NULL).  Returns an opaque pointer to
 target-private data.  */
  void *init_cost (struct loop *loop);

  /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
  void add_stmt_cost (void *data, unsigned n,
  vectorized-stmt-kind,
  enum machine_mode vector_mode);

  /* Tell the target to compute and return the cost of the accumulated
 statements and free any target-private data.  */
  unsigned finish_cost (void *data);

with eventually slightly different signatures for add_stmt_cost
(like pass in the original scalar stmt?).

It allows the target, at finish_cost time, to evaluate things like
register pressure and resource utilization.
   
   OK, I'm trying to understand how you would want this built into the
   present structure.  Taking just the loop case for now:
   
   Judging by your suggested API, we would have to call add_stmt_cost ()
   everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
   now this would be an additional call, not a replacement, though maybe
   the other goes away eventually.  This allows the target to save more
   data about the vectorized instructions than just an accumulated cost
   number (order and quantity of various kinds of instructions can be
   maintained for better modeling).  Presumably the call to finish_cost
   would

Re: [PATCH] Add vector cost model density heuristic

2012-06-11 Thread William J. Schmidt



On Mon, 2012-06-11 at 11:09 -0400, David Edelsohn wrote:
 On Mon, Jun 11, 2012 at 10:55 AM, Richard Guenther rguent...@suse.de wrote:
 
  Well, they are at least magic numbers and heuristics that apply
  generally and not only to the single issue in sphinx.  And in
  fact how it works for sphinx _is_ magic.
 
  Second, I suggest that you need to rephrase I can make you and
  re-send your reply.
 
  Sorry for my bad english.  Consider it meaning that I'd rather have
  you think about a more proper solution.  That's what patch review
  is about after all, no?  Sometimes a complete re-write (which gets
  more difficult which each of the patches enhancing the not ideal
  current state) is the best thing to do.
 
 Richard,
 
 The values of the heuristics may be magic, but Bill believes the
 heuristics are testing the important characteristics.  The heuristics
 themselves are controlled by hooks, so the target can set the correct
 values for their own requirements.
 
 The concern is that a general cost infrastructure is too general.
 And, based on history, all ports simply will copy the boilerplate from
 the first implementation. It also may cause more problems because the
 target has relatively little information to be able to judge
 heuristics at that point in the middle-end. If the targets start to
 get too cute or too complicated, it may cause more problems or more
 confusion about why more complicated heuristics are not effective and
 not producing the expected results.
 
 I worry about creating another machine dependent reorg catch-all pass.
 
 Maybe an incremental pre- and/or post- cost hook would be more
 effective. I will let Bill comment.

Thanks David,

I can see both sides of this, and it's hard to judge the future from
where I stand.  My belief is that the number of heuristics targets will
implement will be fairly limited, since judgments about cycle-level
costs are not accurately predictable during the middle end.  All we can
do is come up with a few things that seem to make sense.  Doing too much
in the back end seems impractical.

The interesting question to me is whether cost model heuristics are
general enough to be reusable.  What I saw in this case was what I
considered to be a somewhat target-neutral problem:  overwhelming those
assets of the processor that implement vectorization.  It seemed
reasonable to provide hooks for others to use the idea if they encounter
similar issues.  If reusing the heuristic is useful, then having to copy
the logic from one target to another isn't the best approach.  If nobody
else will ever use it, then embedding it in the back end is reasonable.
Unfortunately my crystal ball has been on the fritz for several decades,
so I can't tell you for sure which is right...

Richard, my biggest question is whether you think other targets are
likely to take advantage of a more general back-end interface, or
whether this will end up just being a PowerPC wart.  If you know of ways
this will be useful for i386, that would be helpful to know.  Perhaps
this requires your crystal ball as well; not sure how well yours
works...

If we look at just this one issue in isolation, then changing all the
code in the vectorizer that calculates inside/outside loop costs and
moving it to targetm seems more invasive than adding the few hooks.  But
if this will really be a useful feature for the community as a whole I
am certainly willing to tackle it.

Thanks,
Bill

 
 Thanks, David

Re: [PATCH] Hoist adjacent pointer loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 14:59 +0200, Richard Guenther wrote:
 On Mon, 11 Jun 2012, William J. Schmidt wrote:
 
  On Mon, 2012-06-11 at 13:28 +0200, Richard Guenther wrote:
   On Mon, Jun 4, 2012 at 3:45 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
Hi Richard,
   
Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
delay since the last revision, but my performance testing has been
blocked waiting for a fix to PR53487.  I ended up applying a test
version of the patch to 4.7 and ran performance numbers with that
instead, with no degradations.
   
In addition to addressing your comments, this patch contains one bug fix
where local_mem_dependence was called on the wrong blocks after swapping
def1 and def2.
   
Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
this version ok for trunk?  I won't commit it until I can do final
testing on trunk in conjunction with a fix for PR53487.
   
Thanks,
Bill
   
   
2012-06-04  Bill Schmidt  wschm...@linux.vnet.ibm.com
   
   * opts.c: Add -fhoist_adjacent_loads to -O2 and above.
   * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to 
forward
   declaration.
   (hoist_adjacent_loads, gate_hoist_loads): New forward 
declarations.
   (tree_ssa_phiopt): Call gate_hoist_loads.
   (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
   (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; 
call
   hoist_adjacent_loads.
   (local_mem_dependence): New function.
   (hoist_adjacent_loads): Likewise.
   (gate_hoist_loads): Likewise.
   * common.opt (fhoist-adjacent-loads): New switch.
   * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
   * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
   
   
Index: gcc/opts.c
===
--- gcc/opts.c  (revision 187805)
+++ gcc/opts.c  (working copy)
@@ -489,6 +489,7 @@ static const struct default_options default_option
{ OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
{ OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
   
/* -O3 optimizations.  */
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187805)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include cfgloop.h
 #include tree-data-ref.h
 #include tree-pretty-print.h
+#include gimple-pretty-print.h
+#include insn-config.h
+#include expr.h
+#include optabs.h
   
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
   
 /* This pass tries to replaces an if-then-else block with an
   assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
 bb2:
   x = PHI x' (bb0), ...;
   
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
   
+
+   This pass also performs a fifth transformation of a slightly 
different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = (expr).field1;
+   goto bb3;
+ bb2:
+   x2 = (expr).field2;
+ bb3:
+   # x = PHI x1, x2;
+
+   with
+
+ bb0:
+   x1 = (expr).field1;
+   x2 = (expr).field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto

Re: [PATCH] Hoist adjacent pointer loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 12:11 -0500, William J. Schmidt wrote:

 I found this parameter that seems to correspond to well-predicted
 conditional jumps:
 
 /* When branch is predicted to be taken with probability lower than this
threshold (in percent), then it is considered well predictable. */
 DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME,
 predictable-branch-outcome,
 Maximal estimated outcome of branch considered predictable,
 2, 0, 50)
 
...which has an interface predictable_edge_p () in predict.c, so that's
what I'll use.

Thanks,
Bill

Re: [PATCH] Hoist adjacent loads

2012-06-11 Thread William J. Schmidt

OK, once more with feeling... :)

This patch differs from the previous one in two respects:  It disables
the optimization when either the then or else edge is well-predicted;
and it now uses the existing l1-cache-line-size parameter instead of a
new one (with updated commentary).

Bootstraps and tests with no new regressions on
powerpc64-unknown-linux-gnu.  One last performance run is underway, but
I don't expect any surprises since both changes are more conservative.
The original benchmark issue is still resolved.

Is this version ok for trunk?

Thanks,
Bill


2012-06-11  Bill Schmidt  wschm...@linux.vnet.ibm.com

* opts.c: Add -fhoist-adjacent-loads to -O2 and above.
* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_mem_dependence): New function.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 188390)
+++ gcc/opts.c  (working copy)
@@ -489,6 +489,7 @@ static const struct default_options default_option
 { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 188390)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include cfgloop.h
 #include tree-data-ref.h
 #include tree-pretty-print.h
+#include gimple-pretty-print.h
+#include insn-config.h
+#include expr.h
+#include optabs.h
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI x' (bb0), ...;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = (expr).field1;
+   goto bb3;
+ bb2:
+   x2 = (expr).field2;
+ bb3:
+   # x = PHI x1, x2;
+
+   with
+
+ bb0:
+   x1 = (expr).field1;
+   x2 = (expr).field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI x1, x2;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y-left;
+ else
+   x = y-right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
 static unsigned int
 tree_ssa_cs_elim (void)
 {
-  return tree_ssa_phiopt_worker (true);
+

[PATCH] Correct cost model for strided loads

2012-06-10 Thread William J. Schmidt

The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  Ok
for trunk?

Thanks,
Bill


2012-06-10  Bill Schmidt  wschm...@linux.ibm.com

* tree-vect-stmts.c (vect_model_load_cost):  Change cost model
for strided loads.


Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 188341)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
   /* The loads themselves.  */
   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
 {
-  /* N scalar loads plus gathering them into a vector.
- ???  scalar_to_vec isn't the cost for that.  */
+  /* N scalar loads plus gathering them into a vector.  */
   inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
  * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
-  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
+  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
 }
   else
 vect_get_load_cost (first_dr, ncopies,

Re: [PATCH] Hoist adjacent pointer loads

2012-06-06 Thread William J. Schmidt

On Mon, 2012-06-04 at 08:45 -0500, William J. Schmidt wrote:
 Hi Richard,
 
 Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
 delay since the last revision, but my performance testing has been
 blocked waiting for a fix to PR53487.  I ended up applying a test
 version of the patch to 4.7 and ran performance numbers with that
 instead, with no degradations.
 
 In addition to addressing your comments, this patch contains one bug fix
 where local_mem_dependence was called on the wrong blocks after swapping
 def1 and def2.
 
 Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
 this version ok for trunk?  I won't commit it until I can do final
 testing on trunk in conjunction with a fix for PR53487.

Final performance tests are complete and show no degradations on SPEC
cpu2006 on powerpc64-unknown-linux-gnu.

Is the patch ok for trunk?

Thanks!
Bill
 
 Thanks,
 Bill
 
 
 2012-06-04  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
   * opts.c: Add -fhoist_adjacent_loads to -O2 and above.
   * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
   declaration.
   (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
   (tree_ssa_phiopt): Call gate_hoist_loads.
   (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
   (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
   hoist_adjacent_loads.
   (local_mem_dependence): New function.
   (hoist_adjacent_loads): Likewise.
   (gate_hoist_loads): Likewise.
   * common.opt (fhoist-adjacent-loads): New switch.
   * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
   * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
 
 
 Index: gcc/opts.c
 ===
 --- gcc/opts.c(revision 187805)
 +++ gcc/opts.c(working copy)
 @@ -489,6 +489,7 @@ static const struct default_options default_option
  { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
  { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
  { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
 +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
  
  /* -O3 optimizations.  */
  { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
 Index: gcc/tree-ssa-phiopt.c
 ===
 --- gcc/tree-ssa-phiopt.c (revision 187805)
 +++ gcc/tree-ssa-phiopt.c (working copy)
 @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
  #include cfgloop.h
  #include tree-data-ref.h
  #include tree-pretty-print.h
 +#include gimple-pretty-print.h
 +#include insn-config.h
 +#include expr.h
 +#include optabs.h
  
 +#ifndef HAVE_conditional_move
 +#define HAVE_conditional_move (0)
 +#endif
 +
  static unsigned int tree_ssa_phiopt (void);
 -static unsigned int tree_ssa_phiopt_worker (bool);
 +static unsigned int tree_ssa_phiopt_worker (bool, bool);
  static bool conditional_replacement (basic_block, basic_block,
edge, edge, gimple, tree, tree);
  static int value_replacement (basic_block, basic_block,
 @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
  static bool cond_if_else_store_replacement (basic_block, basic_block, 
 basic_block);
  static struct pointer_set_t * get_non_trapping (void);
  static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
 +static void hoist_adjacent_loads (basic_block, basic_block,
 +   basic_block, basic_block);
 +static bool gate_hoist_loads (void);
  
  /* This pass tries to replaces an if-then-else block with an
 assignment.  We have four kinds of transformations.  Some of these
 @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
   bb2:
 x = PHI x' (bb0), ...;
  
 -   A similar transformation is done for MAX_EXPR.  */
 +   A similar transformation is done for MAX_EXPR.
  
 +
 +   This pass also performs a fifth transformation of a slightly different
 +   flavor.
 +
 +   Adjacent Load Hoisting
 +   --
 +   
 +   This transformation replaces
 +
 + bb0:
 +   if (...) goto bb2; else goto bb1;
 + bb1:
 +   x1 = (expr).field1;
 +   goto bb3;
 + bb2:
 +   x2 = (expr).field2;
 + bb3:
 +   # x = PHI x1, x2;
 +
 +   with
 +
 + bb0:
 +   x1 = (expr).field1;
 +   x2 = (expr).field2;
 +   if (...) goto bb2; else goto bb1;
 + bb1:
 +   goto bb3;
 + bb2:
 + bb3:
 +   # x = PHI x1, x2;
 +
 +   The purpose of this transformation is to enable generation of conditional
 +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
 +   the loads is speculative, the transformation is restricted to very
 +   specific cases to avoid introducing a page fault.  We are looking for
 +   the common idiom

Re: [PATCH] Hoist adjacent pointer loads

2012-06-04 Thread William J. Schmidt

Hi Richard,

Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
delay since the last revision, but my performance testing has been
blocked waiting for a fix to PR53487.  I ended up applying a test
version of the patch to 4.7 and ran performance numbers with that
instead, with no degradations.

In addition to addressing your comments, this patch contains one bug fix
where local_mem_dependence was called on the wrong blocks after swapping
def1 and def2.

Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
this version ok for trunk?  I won't commit it until I can do final
testing on trunk in conjunction with a fix for PR53487.

Thanks,
Bill


2012-06-04  Bill Schmidt  wschm...@linux.vnet.ibm.com

* opts.c: Add -fhoist_adjacent_loads to -O2 and above.
* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_mem_dependence): New function.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 187805)
+++ gcc/opts.c  (working copy)
@@ -489,6 +489,7 @@ static const struct default_options default_option
 { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187805)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include cfgloop.h
 #include tree-data-ref.h
 #include tree-pretty-print.h
+#include gimple-pretty-print.h
+#include insn-config.h
+#include expr.h
+#include optabs.h
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI x' (bb0), ...;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = (expr).field1;
+   goto bb3;
+ bb2:
+   x2 = (expr).field2;
+ bb3:
+   # x = PHI x1, x2;
+
+   with
+
+ bb0:
+   x1 = (expr).field1;
+   x2 = (expr).field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI x1, x2;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y-left;
+ else
+   x = y-right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to

Re: [PATCH] Hoist adjacent pointer loads

2012-05-23 Thread William J. Schmidt

On Wed, 2012-05-23 at 13:25 +0200, Richard Guenther wrote:
 On Tue, 22 May 2012, William J. Schmidt wrote:
 
  Here's a revision of the hoist-adjacent-loads patch.  Besides hopefully
  addressing all your comments, I added a gate of at least -O2 for this
  transformation.  Let me know if you prefer a different minimum opt
  level.
  
  I'm still running SPEC tests to make sure there are no regressions when
  opening this up to non-pointer arguments.  The code bootstraps on
  powerpc64-unknown-linux-gnu with no regressions.  Assuming the SPEC
  numbers come out as expected, is this ok?
  
  Thanks,
  Bill
  
  
  2012-05-22  Bill Schmidt  wschm...@linux.vnet.ibm.com
  
  * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
  declaration.
  (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
  (tree_ssa_phiopt): Call gate_hoist_loads.
  (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
  (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
  hoist_adjacent_loads.
  (local_mem_dependence): New function.
  (hoist_adjacent_loads): Likewise.
  (gate_hoist_loads): Likewise.
  * common.opt (fhoist-adjacent-loads): New switch.
  * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
  * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
  
  
  Index: gcc/tree-ssa-phiopt.c
  ===
  --- gcc/tree-ssa-phiopt.c   (revision 187728)
  +++ gcc/tree-ssa-phiopt.c   (working copy)
  @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
   #include cfgloop.h
   #include tree-data-ref.h
   #include tree-pretty-print.h
  +#include gimple-pretty-print.h
  +#include insn-config.h
  +#include expr.h
  +#include optabs.h
   
  +#ifndef HAVE_conditional_move
  +#define HAVE_conditional_move (0)
  +#endif
  +
   static unsigned int tree_ssa_phiopt (void);
  -static unsigned int tree_ssa_phiopt_worker (bool);
  +static unsigned int tree_ssa_phiopt_worker (bool, bool);
   static bool conditional_replacement (basic_block, basic_block,
   edge, edge, gimple, tree, tree);
   static int value_replacement (basic_block, basic_block,
  @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
   static bool cond_if_else_store_replacement (basic_block, basic_block, 
  basic_block);
   static struct pointer_set_t * get_non_trapping (void);
   static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
  tree);
  +static void hoist_adjacent_loads (basic_block, basic_block,
  + basic_block, basic_block);
  +static bool gate_hoist_loads (void);
   
   /* This pass tries to replaces an if-then-else block with an
  assignment.  We have four kinds of transformations.  Some of these
  @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
bb2:
  x = PHI x' (bb0), ...;
   
  -   A similar transformation is done for MAX_EXPR.  */
  +   A similar transformation is done for MAX_EXPR.
   
  +
  +   This pass also performs a fifth transformation of a slightly different
  +   flavor.
  +
  +   Adjacent Load Hoisting
  +   --
  +   
  +   This transformation replaces
  +
  + bb0:
  +   if (...) goto bb2; else goto bb1;
  + bb1:
  +   x1 = (expr).field1;
  +   goto bb3;
  + bb2:
  +   x2 = (expr).field2;
  + bb3:
  +   # x = PHI x1, x2;
  +
  +   with
  +
  + bb0:
  +   x1 = (expr).field1;
  +   x2 = (expr).field2;
  +   if (...) goto bb2; else goto bb1;
  + bb1:
  +   goto bb3;
  + bb2:
  + bb3:
  +   # x = PHI x1, x2;
  +
  +   The purpose of this transformation is to enable generation of 
  conditional
  +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
  +   the loads is speculative, the transformation is restricted to very
  +   specific cases to avoid introducing a page fault.  We are looking for
  +   the common idiom:
  +
  + if (...)
  +   x = y-left;
  + else
  +   x = y-right;
  +
  +   where left and right are typically adjacent pointers in a tree 
  structure.  */
  +
   static unsigned int
   tree_ssa_phiopt (void)
   {
  -  return tree_ssa_phiopt_worker (false);
  +  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
   }
   
   /* This pass tries to transform conditional stores into unconditional
  @@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
   static unsigned int
   tree_ssa_cs_elim (void)
   {
  -  return tree_ssa_phiopt_worker (true);
  +  return tree_ssa_phiopt_worker (true, false);
   }
   
   /* Return the singleton PHI in the SEQ of PHIs for edges E0 and E1. */
  @@ -227,9 +282,11 @@ static tree condstoretemp;
   /* The core routine of conditional store replacement and normal
  phi optimizations.  Both share much of the infrastructure in how
  to match applicable basic block

Re: [PATCH] Hoist adjacent pointer loads

2012-05-22 Thread William J. Schmidt

Here's a revision of the hoist-adjacent-loads patch.  Besides hopefully
addressing all your comments, I added a gate of at least -O2 for this
transformation.  Let me know if you prefer a different minimum opt
level.

I'm still running SPEC tests to make sure there are no regressions when
opening this up to non-pointer arguments.  The code bootstraps on
powerpc64-unknown-linux-gnu with no regressions.  Assuming the SPEC
numbers come out as expected, is this ok?

Thanks,
Bill


2012-05-22  Bill Schmidt  wschm...@linux.vnet.ibm.com

* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_mem_dependence): New function.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.


Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187728)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include cfgloop.h
 #include tree-data-ref.h
 #include tree-pretty-print.h
+#include gimple-pretty-print.h
+#include insn-config.h
+#include expr.h
+#include optabs.h
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI x' (bb0), ...;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = (expr).field1;
+   goto bb3;
+ bb2:
+   x2 = (expr).field2;
+ bb3:
+   # x = PHI x1, x2;
+
+   with
+
+ bb0:
+   x1 = (expr).field1;
+   x2 = (expr).field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI x1, x2;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y-left;
+ else
+   x = y-right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
 static unsigned int
 tree_ssa_cs_elim (void)
 {
-  return tree_ssa_phiopt_worker (true);
+  return tree_ssa_phiopt_worker (true, false);
 }
 
 /* Return the singleton PHI in the SEQ of PHIs for edges E0 and E1. */
@@ -227,9 +282,11 @@ static tree condstoretemp;
 /* The core routine of conditional store replacement and normal
phi optimizations.  Both share much of the infrastructure in how
to match applicable basic block patterns.  DO_STORE_ELIM is true
-   when we want to do conditional store replacement, false otherwise.  */
+   when we want to do conditional store replacement, false otherwise.
+   DO_HOIST_LOADS is true when we want to hoist adjacent loads out 
+   of diamond control flow patterns, false otherwise.  */
 static unsigned int

Re: [PATCH] Hoist adjacent pointer loads

2012-05-21 Thread William J. Schmidt

On Mon, 2012-05-21 at 14:17 +0200, Richard Guenther wrote:
 On Thu, May 3, 2012 at 4:33 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  This patch was posted for comment back in February during stage 4.  It
  addresses a performance issue noted in the EEMBC routelookup benchmark
  on a common idiom:
 
   if (...)
 x = y-left;
   else
 x = y-right;
 
  If the two loads can be hoisted out of the if/else, the if/else can be
  replaced by a conditional move instruction on architectures that support
  one.  Because this speculates one of the loads, the patch constrains the
  optimization to avoid introducing page faults.
 
  Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
  new failures.  The patch provides significant improvement to the
  routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.
 
  One question is what optimization level should be required for this.
  Because of the speculation, -O3 might be in order.  I don't believe
  -Ofast is required as there is no potential correctness issue involved.
  Right now the patch doesn't check the optimization level (like the rest
  of the phi-opt transforms), which is likely a poor choice.
 
  Ok for trunk?
 
  Thanks,
  Bill
 
 
  2012-05-03  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
 * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
 declaration.
 (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
 (tree_ssa_phiopt): Call gate_hoist_loads.
 (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
 (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
 hoist_adjacent_loads.
 (local_reg_dependence): New function.
 (local_mem_dependence): Likewise.
 (hoist_adjacent_loads): Likewise.
 (gate_hoist_loads): Likewise.
 * common.opt (fhoist-adjacent-loads): New switch.
 * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
 * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
 
 
  Index: gcc/tree-ssa-phiopt.c
  ===
  --- gcc/tree-ssa-phiopt.c   (revision 187057)
  +++ gcc/tree-ssa-phiopt.c   (working copy)
  @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
   #include cfgloop.h
   #include tree-data-ref.h
   #include tree-pretty-print.h
  +#include gimple-pretty-print.h
  +#include insn-config.h
  +#include expr.h
  +#include optabs.h
 
  +#ifndef HAVE_conditional_move
  +#define HAVE_conditional_move (0)
  +#endif
  +
   static unsigned int tree_ssa_phiopt (void);
  -static unsigned int tree_ssa_phiopt_worker (bool);
  +static unsigned int tree_ssa_phiopt_worker (bool, bool);
   static bool conditional_replacement (basic_block, basic_block,
  edge, edge, gimple, tree, tree);
   static int value_replacement (basic_block, basic_block,
  @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
   static bool cond_if_else_store_replacement (basic_block, basic_block, 
  basic_block);
   static struct pointer_set_t * get_non_trapping (void);
   static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
  tree);
  +static void hoist_adjacent_loads (basic_block, basic_block,
  + basic_block, basic_block);
  +static bool gate_hoist_loads (void);
 
   /* This pass tries to replaces an if-then-else block with an
 assignment.  We have four kinds of transformations.  Some of these
  @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
   bb2:
 x = PHI x' (bb0), ...;
 
  -   A similar transformation is done for MAX_EXPR.  */
  +   A similar transformation is done for MAX_EXPR.
 
  +
  +   This pass also performs a fifth transformation of a slightly different
  +   flavor.
  +
  +   Adjacent Load Hoisting
  +   --
  +
  +   This transformation replaces
  +
  + bb0:
  +   if (...) goto bb2; else goto bb1;
  + bb1:
  +   x1 = (expr).field1;
  +   goto bb3;
  + bb2:
  +   x2 = (expr).field2;
  + bb3:
  +   # x = PHI x1, x2;
  +
  +   with
  +
  + bb0:
  +   x1 = (expr).field1;
  +   x2 = (expr).field2;
  +   if (...) goto bb2; else goto bb1;
  + bb1:
  +   goto bb3;
  + bb2:
  + bb3:
  +   # x = PHI x1, x2;
  +
  +   The purpose of this transformation is to enable generation of 
  conditional
  +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
  +   the loads is speculative, the transformation is restricted to very
  +   specific cases to avoid introducing a page fault.  We are looking for
  +   the common idiom:
  +
  + if (...)
  +   x = y-left;
  + else
  +   x = y-right;
  +
  +   where left and right are typically adjacent pointers in a tree 
  structure.  */
  +
   static unsigned int
   tree_ssa_phiopt

[PATCH, rs6000] Fix PR53385

2012-05-18 Thread William J. Schmidt

This repairs the bootstrap issue due to unsafe signed overflow
assumptions.  Bootstrapped and tested on powerpc64-unknown-linux-gnu
with no new regressions.  Ok for trunk?

Thanks,
Bill


2012-05-18  Bill Schmidt  wschm...@linux.vnet.ibm.com

* config/rs6000/rs6000.c (print_operand): Revise code that unsafely
relied on signed overflow behavior.


Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 187651)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -14679,7 +14679,6 @@ void
 print_operand (FILE *file, rtx x, int code)
 {
   int i;
-  HOST_WIDE_INT val;
   unsigned HOST_WIDE_INT uval;
 
   switch (code)
@@ -15120,34 +15119,17 @@ print_operand (FILE *file, rtx x, int code)
 
 case 'W':
   /* MB value for a PowerPC64 rldic operand.  */
-  val = (GET_CODE (x) == CONST_INT
-? INTVAL (x) : CONST_DOUBLE_HIGH (x));
+  i = clz_hwi (GET_CODE (x) == CONST_INT
+  ? INTVAL (x) : CONST_DOUBLE_HIGH (x));
 
-  if (val  0)
-   i = -1;
-  else
-   for (i = 0; i  HOST_BITS_PER_WIDE_INT; i++)
- if ((val = 1)  0)
-   break;
-
 #if HOST_BITS_PER_WIDE_INT == 32
-  if (GET_CODE (x) == CONST_INT  i = 0)
+  if (GET_CODE (x) == CONST_INT  i  0)
i += 32;  /* zero-extend high-part was all 0's */
   else if (GET_CODE (x) == CONST_DOUBLE  i == 32)
-   {
- val = CONST_DOUBLE_LOW (x);
-
- gcc_assert (val);
- if (val  0)
-   --i;
- else
-   for ( ; i  64; i++)
- if ((val = 1)  0)
-   break;
-   }
+   i = clz_hwi (CONST_DOUBLE_LOW (x)) + 32;
 #endif
 
-  fprintf (file, %d, i + 1);
+  fprintf (file, %d, i);
   return;
 
 case 'x':

[PATCH] Simplify attempt_builtin_powi logic

2012-05-17 Thread William J. Schmidt

This patch gives up on using the reassociation rank algorithm to
correctly place __builtin_powi calls and their feeding multiplies.  In
the end this proved to introduce more complexity than it saved, due in
part to the poor fit of introducing DAG expressions into the
reassociated operand tree.  This patch returns to generating explicit
multiplies to bind the builtin calls together and to the results of the
expression tree rewrite.  I feel this version is smaller, easier to
understand, and less fragile than the existing code.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-05-17  Bill Schmidt  wschm...@linux.vnet.ibm.com

* tree-ssa-reassoc.c (bip_map): Remove decl.
(completely_remove_stmt): Remove function.
(remove_def_if_absorbed_call): Remove function.
(remove_visited_stmt_chain): Remove __builtin_powi handling.
(possibly_move_powi): Remove function.
(rewrite_expr_tree): Remove calls to possibly_move_powi.
(rewrite_expr_tree_parallel): Likewise.
(attempt_builtin_powi): Build multiplies explicitly rather than
relying on the ops vector and rank system.
(transform_stmt_to_copy): New function.
(transform_stmt_to_multiply): Likewise.
(reassociate_bb): Handle leftover operations after __builtin_powi
optimization; build a final multiply if necessary.


Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 187626)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -200,10 +200,6 @@ static long *bb_rank;
 /* Operand-rank hashtable.  */
 static struct pointer_map_t *operand_rank;
 
-/* Map from inserted __builtin_powi calls to multiply chains that
-   feed them.  */
-static struct pointer_map_t *bip_map;
-
 /* Forward decls.  */
 static long get_rank (tree);
 
@@ -2184,32 +2180,6 @@ is_phi_for_stmt (gimple stmt, tree operand)
   return false;
 }
 
-/* Remove STMT, unlink its virtual defs, and release its SSA defs.  */
-
-static inline void
-completely_remove_stmt (gimple stmt)
-{
-  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
-  gsi_remove (gsi, true);
-  unlink_stmt_vdef (stmt);
-  release_defs (stmt);
-}
-
-/* If OP is defined by a builtin call that has been absorbed by
-   reassociation, remove its defining statement completely.  */
-
-static inline void
-remove_def_if_absorbed_call (tree op)
-{
-  gimple stmt;
-
-  if (TREE_CODE (op) == SSA_NAME
-   has_zero_uses (op)
-   is_gimple_call ((stmt = SSA_NAME_DEF_STMT (op)))
-   gimple_visited_p (stmt))
-completely_remove_stmt (stmt);
-}
-
 /* Remove def stmt of VAR if VAR has zero uses and recurse
on rhs1 operand if so.  */
 
@@ -2218,7 +2188,6 @@ remove_visited_stmt_chain (tree var)
 {
   gimple stmt;
   gimple_stmt_iterator gsi;
-  tree var2;
 
   while (1)
 {
@@ -2228,95 +2197,15 @@ remove_visited_stmt_chain (tree var)
   if (is_gimple_assign (stmt)  gimple_visited_p (stmt))
{
  var = gimple_assign_rhs1 (stmt);
- var2 = gimple_assign_rhs2 (stmt);
  gsi = gsi_for_stmt (stmt);
  gsi_remove (gsi, true);
  release_defs (stmt);
- /* A multiply whose operands are both fed by builtin pow/powi
-calls must check whether to remove rhs2 as well.  */
- remove_def_if_absorbed_call (var2);
}
-  else if (is_gimple_call (stmt)  gimple_visited_p (stmt))
-   {
- completely_remove_stmt (stmt);
- return;
-   }
   else
return;
 }
 }
 
-/* If OP is an SSA name, find its definition and determine whether it
-   is a call to __builtin_powi.  If so, move the definition prior to
-   STMT.  Only do this during early reassociation.  */
-
-static void
-possibly_move_powi (gimple stmt, tree op)
-{
-  gimple stmt2, *mpy;
-  tree fndecl;
-  gimple_stmt_iterator gsi1, gsi2;
-
-  if (!first_pass_instance
-  || !flag_unsafe_math_optimizations
-  || TREE_CODE (op) != SSA_NAME)
-return;
-  
-  stmt2 = SSA_NAME_DEF_STMT (op);
-
-  if (!is_gimple_call (stmt2)
-  || !has_single_use (gimple_call_lhs (stmt2)))
-return;
-
-  fndecl = gimple_call_fndecl (stmt2);
-
-  if (!fndecl
-  || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL)
-return;
-
-  switch (DECL_FUNCTION_CODE (fndecl))
-{
-CASE_FLT_FN (BUILT_IN_POWI):
-  break;
-default:
-  return;
-}
-
-  /* Move the __builtin_powi.  */
-  gsi1 = gsi_for_stmt (stmt);
-  gsi2 = gsi_for_stmt (stmt2);
-  gsi_move_before (gsi2, gsi1);
-
-  /* See if there are multiplies feeding the __builtin_powi base
- argument that must also be moved.  */
-  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL)
-{
-  /* If we've already moved this statement, we're done.  This is
- identified by a NULL entry for the statement in bip_map.  */
-  gimple *next = (gimple *)

Re: PING: [PATCH] Fix PR53217

2012-05-16 Thread William J. Schmidt

On Wed, 2012-05-16 at 11:45 +0200, Richard Guenther wrote:
 On Tue, 15 May 2012, William J. Schmidt wrote:
 
  Ping.
 
 I don't like it too much - but pondering a bit over it I can't find
 a nicer solution.
 
 So, ok.
 
 Thanks,
 Richard.
 
Agreed.  I'm not fond of it either, and I feel it's a bit fragile.

An alternative would be to go back to handling the exponentiation
expressions outside of the ops list (generating an explicit multiply to
hook them up with the results of normal linear/parallel expansion).  In
hindsight, placing the exponentiation results in the ops list and
letting the rank order handle things introduces some complexity as well
as saving some.  The DAG'd nature of the exponentiation expressions
isn't a perfect fit for the pure tree form of the reassociated
multiplies.

Let me know if you'd like me to pursue that instead.

Thanks,
Bill

  Thanks,
  Bill
  
  On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote:
   This fixes another statement-placement issue when reassociating
   expressions with repeated factors.  Multiplies feeding into
   __builtin_powi calls were not getting placed properly ahead of them in
   some cases.
   
   Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
   regressions.  I've also run SPEC cpu2006 with no build or correctness
   issues.  OK for trunk?
   
   Thanks,
   Bill
   
   
   gcc:
   
   2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com
   
 PR tree-optimization/53217
 * tree-ssa-reassoc.c (bip_map): New static variable.
 (possibly_move_powi): Move feeding multiplies with __builtin_powi call.
 (attempt_builtin_powi): Save feeding multiplies on a stack.
 (reassociate_bb): Create and destroy bip_map.
   
   gcc/testsuite:
   
   2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com
   
 PR tree-optimization/53217
 * gfortran.dg/pr53217.f90: New test.
   
   
   Index: gcc/testsuite/gfortran.dg/pr53217.f90
   ===
   --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
   +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
   @@ -0,0 +1,28 @@
   +! { dg-do compile }
   +! { dg-options -O1 -ffast-math }
   +
   +! This tests only for compile-time failure, which formerly occurred
   +! when statements were emitted out of order, failing verify_ssa.
   +
   +MODULE xc_cs1
   +  INTEGER, PARAMETER :: dp=KIND(0.0D0)
   +  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, 
   +  c = 0.2533_dp, 
   +  d = 0.349_dp
   +CONTAINS
   +  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, 
   e_ndrho_ndrho,
   +   npoints, error)
   +REAL(KIND=dp), DIMENSION(*), 
   +  INTENT(INOUT)  :: e_rho_rho, e_rho_ndrho, 
   +e_ndrho_ndrho
   +DO ip = 1, npoints
   +  IF ( rho(ip)  eps_rho ) THEN
   + oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
   + d2rF4 = c4p*f13*f23*g**4*r3/r * 
   (193*d*r**5*r3*r3+90*d*d*r**5*r3 
   + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 
   + +104*r**6)*od**3*oc**4
   + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
   +  END IF
   +END DO
   +  END SUBROUTINE cs1_u_2
   +END MODULE xc_cs1
   Index: gcc/tree-ssa-reassoc.c
   ===
   --- gcc/tree-ssa-reassoc.c(revision 187117)
   +++ gcc/tree-ssa-reassoc.c(working copy)
   @@ -200,6 +200,10 @@ static long *bb_rank;
/* Operand-rank hashtable.  */
static struct pointer_map_t *operand_rank;
   
   +/* Map from inserted __builtin_powi calls to multiply chains that
   +   feed them.  */
   +static struct pointer_map_t *bip_map;
   +
/* Forward decls.  */
static long get_rank (tree);
   
   @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var)
static void
possibly_move_powi (gimple stmt, tree op)
{
   -  gimple stmt2;
   +  gimple stmt2, *mpy;
  tree fndecl;
  gimple_stmt_iterator gsi1, gsi2;
   
   @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op)
  return;
}
   
   +  /* Move the __builtin_powi.  */
  gsi1 = gsi_for_stmt (stmt);
  gsi2 = gsi_for_stmt (stmt2);
  gsi_move_before (gsi2, gsi1);
   +
   +  /* See if there are multiplies feeding the __builtin_powi base
   + argument that must also be moved.  */
   +  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != 
   NULL)
   +{
   +  /* If we've already moved this statement, we're done.  This is
   + identified by a NULL entry for the statement in bip_map.  */
   +  gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy);
   +  if (next  !*next)
   + return;
   +
   +  stmt = stmt2;
   +  stmt2 = *mpy;
   +  gsi1 = gsi_for_stmt (stmt);
   +  gsi2 = gsi_for_stmt (stmt2);
   +  gsi_move_before

Re: PING: [PATCH] Fix PR53217

2012-05-16 Thread William J. Schmidt

On Wed, 2012-05-16 at 14:05 +0200, Richard Guenther wrote:
 On Wed, 16 May 2012, William J. Schmidt wrote:
 
  On Wed, 2012-05-16 at 11:45 +0200, Richard Guenther wrote:
   On Tue, 15 May 2012, William J. Schmidt wrote:
   
Ping.
   
   I don't like it too much - but pondering a bit over it I can't find
   a nicer solution.
   
   So, ok.
   
   Thanks,
   Richard.
   
  Agreed.  I'm not fond of it either, and I feel it's a bit fragile.
  
  An alternative would be to go back to handling the exponentiation
  expressions outside of the ops list (generating an explicit multiply to
  hook them up with the results of normal linear/parallel expansion).  In
  hindsight, placing the exponentiation results in the ops list and
  letting the rank order handle things introduces some complexity as well
  as saving some.  The DAG'd nature of the exponentiation expressions
  isn't a perfect fit for the pure tree form of the reassociated
  multiplies.
 
 True.
 
  Let me know if you'd like me to pursue that instead.
 
 You can try - if the result looks better I'm all for it ;)
 
OK. :)  I'll commit this for now to deal with the fallout, and work on
the alternative version in my spare time.

Thanks,
Bill

 Thanks,
 Richard.
 
  Thanks,
  Bill
  
Thanks,
Bill

On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote:
 This fixes another statement-placement issue when reassociating
 expressions with repeated factors.  Multiplies feeding into
 __builtin_powi calls were not getting placed properly ahead of them in
 some cases.
 
 Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
 regressions.  I've also run SPEC cpu2006 with no build or correctness
 issues.  OK for trunk?
 
 Thanks,
 Bill
 
 
 gcc:
 
 2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
   PR tree-optimization/53217
   * tree-ssa-reassoc.c (bip_map): New static variable.
   (possibly_move_powi): Move feeding multiplies with 
 __builtin_powi call.
   (attempt_builtin_powi): Save feeding multiplies on a stack.
   (reassociate_bb): Create and destroy bip_map.
 
 gcc/testsuite:
 
 2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
   PR tree-optimization/53217
   * gfortran.dg/pr53217.f90: New test.
 
 
 Index: gcc/testsuite/gfortran.dg/pr53217.f90
 ===
 --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
 +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
 @@ -0,0 +1,28 @@
 +! { dg-do compile }
 +! { dg-options -O1 -ffast-math }
 +
 +! This tests only for compile-time failure, which formerly occurred
 +! when statements were emitted out of order, failing verify_ssa.
 +
 +MODULE xc_cs1
 +  INTEGER, PARAMETER :: dp=KIND(0.0D0)
 +  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, 
 +  c = 0.2533_dp, 
 +  d = 0.349_dp
 +CONTAINS
 +  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, 
 e_ndrho_ndrho,
 +   npoints, error)
 +REAL(KIND=dp), DIMENSION(*), 
 +  INTENT(INOUT)  :: e_rho_rho, 
 e_rho_ndrho, 
 +e_ndrho_ndrho
 +DO ip = 1, npoints
 +  IF ( rho(ip)  eps_rho ) THEN
 + oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
 + d2rF4 = c4p*f13*f23*g**4*r3/r * 
 (193*d*r**5*r3*r3+90*d*d*r**5*r3 
 + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 
 + +104*r**6)*od**3*oc**4
 + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
 +  END IF
 +END DO
 +  END SUBROUTINE cs1_u_2
 +END MODULE xc_cs1
 Index: gcc/tree-ssa-reassoc.c
 ===
 --- gcc/tree-ssa-reassoc.c(revision 187117)
 +++ gcc/tree-ssa-reassoc.c(working copy)
 @@ -200,6 +200,10 @@ static long *bb_rank;
  /* Operand-rank hashtable.  */
  static struct pointer_map_t *operand_rank;
 
 +/* Map from inserted __builtin_powi calls to multiply chains that
 +   feed them.  */
 +static struct pointer_map_t *bip_map;
 +
  /* Forward decls.  */
  static long get_rank (tree);
 
 @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var)
  static void
  possibly_move_powi (gimple stmt, tree op)
  {
 -  gimple stmt2;
 +  gimple stmt2, *mpy;
tree fndecl;
gimple_stmt_iterator gsi1, gsi2;
 
 @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op)
return;
  }
 
 +  /* Move the __builtin_powi.  */
gsi1 = gsi_for_stmt (stmt);
gsi2 = gsi_for_stmt (stmt2);
gsi_move_before (gsi2, gsi1

Ping: [PATCH] Hoist adjacent pointer loads

2012-05-16 Thread William J. Schmidt

Ping.

Thanks,
Bill

On Thu, 2012-05-03 at 09:33 -0500, William J. Schmidt wrote:
 This patch was posted for comment back in February during stage 4.  It
 addresses a performance issue noted in the EEMBC routelookup benchmark
 on a common idiom:
 
   if (...)
 x = y-left;
   else
 x = y-right;
 
 If the two loads can be hoisted out of the if/else, the if/else can be
 replaced by a conditional move instruction on architectures that support
 one.  Because this speculates one of the loads, the patch constrains the
 optimization to avoid introducing page faults.
 
 Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
 new failures.  The patch provides significant improvement to the
 routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.
 
 One question is what optimization level should be required for this.
 Because of the speculation, -O3 might be in order.  I don't believe
 -Ofast is required as there is no potential correctness issue involved.
 Right now the patch doesn't check the optimization level (like the rest
 of the phi-opt transforms), which is likely a poor choice.
 
 Ok for trunk?
 
 Thanks,
 Bill
 
 
 2012-05-03  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
   * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
   declaration.
   (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
   (tree_ssa_phiopt): Call gate_hoist_loads.
   (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
   (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
   hoist_adjacent_loads.
   (local_reg_dependence): New function.
   (local_mem_dependence): Likewise.
   (hoist_adjacent_loads): Likewise.
   (gate_hoist_loads): Likewise.
   * common.opt (fhoist-adjacent-loads): New switch.
   * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
   * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
 
 
 Index: gcc/tree-ssa-phiopt.c
 ===
 --- gcc/tree-ssa-phiopt.c (revision 187057)
 +++ gcc/tree-ssa-phiopt.c (working copy)
 @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
  #include cfgloop.h
  #include tree-data-ref.h
  #include tree-pretty-print.h
 +#include gimple-pretty-print.h
 +#include insn-config.h
 +#include expr.h
 +#include optabs.h
 
 +#ifndef HAVE_conditional_move
 +#define HAVE_conditional_move (0)
 +#endif
 +
  static unsigned int tree_ssa_phiopt (void);
 -static unsigned int tree_ssa_phiopt_worker (bool);
 +static unsigned int tree_ssa_phiopt_worker (bool, bool);
  static bool conditional_replacement (basic_block, basic_block,
edge, edge, gimple, tree, tree);
  static int value_replacement (basic_block, basic_block,
 @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
  static bool cond_if_else_store_replacement (basic_block, basic_block, 
 basic_block);
  static struct pointer_set_t * get_non_trapping (void);
  static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
 +static void hoist_adjacent_loads (basic_block, basic_block,
 +   basic_block, basic_block);
 +static bool gate_hoist_loads (void);
 
  /* This pass tries to replaces an if-then-else block with an
 assignment.  We have four kinds of transformations.  Some of these
 @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
   bb2:
 x = PHI x' (bb0), ...;
 
 -   A similar transformation is done for MAX_EXPR.  */
 +   A similar transformation is done for MAX_EXPR.
 
 +
 +   This pass also performs a fifth transformation of a slightly different
 +   flavor.
 +
 +   Adjacent Load Hoisting
 +   --
 +   
 +   This transformation replaces
 +
 + bb0:
 +   if (...) goto bb2; else goto bb1;
 + bb1:
 +   x1 = (expr).field1;
 +   goto bb3;
 + bb2:
 +   x2 = (expr).field2;
 + bb3:
 +   # x = PHI x1, x2;
 +
 +   with
 +
 + bb0:
 +   x1 = (expr).field1;
 +   x2 = (expr).field2;
 +   if (...) goto bb2; else goto bb1;
 + bb1:
 +   goto bb3;
 + bb2:
 + bb3:
 +   # x = PHI x1, x2;
 +
 +   The purpose of this transformation is to enable generation of conditional
 +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
 +   the loads is speculative, the transformation is restricted to very
 +   specific cases to avoid introducing a page fault.  We are looking for
 +   the common idiom:
 +
 + if (...)
 +   x = y-left;
 + else
 +   x = y-right;
 +
 +   where left and right are typically adjacent pointers in a tree structure. 
  */
 +
  static unsigned int
  tree_ssa_phiopt (void)
  {
 -  return tree_ssa_phiopt_worker (false);
 +  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
  }
 
  /* This pass tries to transform conditional stores into unconditional

Re: [PATCH][1/n] Improve vectorization in PR53355

2012-05-16 Thread William J. Schmidt

On Tue, 2012-05-15 at 14:17 +0200, Richard Guenther wrote:
 This is the first patch to make the generated code for the testcase
 in PR53355 better.  It teaches VRP about LSHIFT_EXPRs (albeit only
 of a very simple form).
 
 Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

This appears to have caused
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53385.

Thanks,
Bill

 
 Richard.
 
 2012-05-15  Richard Guenther  rguent...@suse.de
 
   PR tree-optimization/53355
   * tree-vrp.c (extract_range_from_binary_expr_1): Handle LSHIFT_EXPRs
   by constants.
 
   * gcc.dg/tree-ssa/vrp67.c: New testcase.
 
 Index: gcc/tree-vrp.c
 ===
 *** gcc/tree-vrp.c(revision 187503)
 --- gcc/tree-vrp.c(working copy)
 *** extract_range_from_binary_expr_1 (value_
 *** 2403,2408 
 --- 2403,2409 
  code != ROUND_DIV_EXPR
  code != TRUNC_MOD_EXPR
  code != RSHIFT_EXPR
 +code != LSHIFT_EXPR
  code != MIN_EXPR
  code != MAX_EXPR
  code != BIT_AND_EXPR
 *** extract_range_from_binary_expr_1 (value_
 *** 2596,2601 
 --- 2597,2636 
 extract_range_from_multiplicative_op_1 (vr, code, vr0, vr1);
 return;
   }
 +   else if (code == LSHIFT_EXPR)
 + {
 +   /* If we have a LSHIFT_EXPR with any shift values outside [0..prec-1],
 +  then drop to VR_VARYING.  Outside of this range we get undefined
 +  behavior from the shift operation.  We cannot even trust
 +  SHIFT_COUNT_TRUNCATED at this stage, because that applies to rtl
 +  shifts, and the operation at the tree level may be widened.  */
 +   if (vr1.type != VR_RANGE
 +   || !value_range_nonnegative_p (vr1)
 +   || TREE_CODE (vr1.max) != INTEGER_CST
 +   || compare_tree_int (vr1.max, TYPE_PRECISION (expr_type) - 1) == 1)
 + {
 +   set_value_range_to_varying (vr);
 +   return;
 + }
 + 
 +   /* We can map shifts by constants to MULT_EXPR handling.  */
 +   if (range_int_cst_singleton_p (vr1))
 + {
 +   value_range_t vr1p = { VR_RANGE, NULL_TREE, NULL_TREE, NULL };
 +   vr1p.min
 + = double_int_to_tree (expr_type,
 +   double_int_lshift (double_int_one,
 +  TREE_INT_CST_LOW (vr1.min),
 +  TYPE_PRECISION (expr_type),
 +  false));
 +   vr1p.max = vr1p.min;
 +   extract_range_from_multiplicative_op_1 (vr, MULT_EXPR, vr0, vr1p);
 +   return;
 + }
 + 
 +   set_value_range_to_varying (vr);
 +   return;
 + }
 else if (code == TRUNC_DIV_EXPR
  || code == FLOOR_DIV_EXPR
  || code == CEIL_DIV_EXPR
 Index: gcc/testsuite/gcc.dg/tree-ssa/vrp67.c
 ===
 *** gcc/testsuite/gcc.dg/tree-ssa/vrp67.c (revision 0)
 --- gcc/testsuite/gcc.dg/tree-ssa/vrp67.c (revision 0)
 ***
 *** 0 
 --- 1,38 
 + /* { dg-do compile } */
 + /* { dg-options -O2 -fdump-tree-vrp1 } */
 + 
 + unsigned foo (unsigned i)
 + {
 +   if (i == 2)
 + {
 +   i = i  2;
 +   if (i != 8)
 + link_error ();
 + }
 +   return i;
 + }
 + unsigned bar (unsigned i)
 + {
 +   if (i == 1  (sizeof (unsigned) * 8 - 1))
 + {
 +   i = i  1;
 +   if (i != 0)
 + link_error ();
 + }
 +   return i;
 + }
 + unsigned baz (unsigned i)
 + {
 +   i = i  15;
 +   if (i == 0)
 + return 0;
 +   i = 1000 - i;
 +   i = 1;
 +   i = 1;
 +   if (i == 0)
 + link_error ();
 +   return i;
 + }
 + 
 + /* { dg-final { scan-tree-dump-times Folding predicate 3 vrp1 } } */
 + /* { dg-final { cleanup-tree-dump vrp1 } } */

PING: [PATCH] Fix PR53217

2012-05-15 Thread William J. Schmidt

Ping.

Thanks,
Bill

On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote:
 This fixes another statement-placement issue when reassociating
 expressions with repeated factors.  Multiplies feeding into
 __builtin_powi calls were not getting placed properly ahead of them in
 some cases.
 
 Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
 regressions.  I've also run SPEC cpu2006 with no build or correctness
 issues.  OK for trunk?
 
 Thanks,
 Bill
 
 
 gcc:
 
 2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
   PR tree-optimization/53217
   * tree-ssa-reassoc.c (bip_map): New static variable.
   (possibly_move_powi): Move feeding multiplies with __builtin_powi call.
   (attempt_builtin_powi): Save feeding multiplies on a stack.
   (reassociate_bb): Create and destroy bip_map.
 
 gcc/testsuite:
 
 2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
   PR tree-optimization/53217
   * gfortran.dg/pr53217.f90: New test.
 
 
 Index: gcc/testsuite/gfortran.dg/pr53217.f90
 ===
 --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
 +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
 @@ -0,0 +1,28 @@
 +! { dg-do compile }
 +! { dg-options -O1 -ffast-math }
 +
 +! This tests only for compile-time failure, which formerly occurred
 +! when statements were emitted out of order, failing verify_ssa.
 +
 +MODULE xc_cs1
 +  INTEGER, PARAMETER :: dp=KIND(0.0D0)
 +  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, 
 +  c = 0.2533_dp, 
 +  d = 0.349_dp
 +CONTAINS
 +  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, 
 e_ndrho_ndrho,
 +   npoints, error)
 +REAL(KIND=dp), DIMENSION(*), 
 +  INTENT(INOUT)  :: e_rho_rho, e_rho_ndrho, 
 +e_ndrho_ndrho
 +DO ip = 1, npoints
 +  IF ( rho(ip)  eps_rho ) THEN
 + oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
 + d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 
 + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 
 + +104*r**6)*od**3*oc**4
 + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
 +  END IF
 +END DO
 +  END SUBROUTINE cs1_u_2
 +END MODULE xc_cs1
 Index: gcc/tree-ssa-reassoc.c
 ===
 --- gcc/tree-ssa-reassoc.c(revision 187117)
 +++ gcc/tree-ssa-reassoc.c(working copy)
 @@ -200,6 +200,10 @@ static long *bb_rank;
  /* Operand-rank hashtable.  */
  static struct pointer_map_t *operand_rank;
 
 +/* Map from inserted __builtin_powi calls to multiply chains that
 +   feed them.  */
 +static struct pointer_map_t *bip_map;
 +
  /* Forward decls.  */
  static long get_rank (tree);
 
 @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var)
  static void
  possibly_move_powi (gimple stmt, tree op)
  {
 -  gimple stmt2;
 +  gimple stmt2, *mpy;
tree fndecl;
gimple_stmt_iterator gsi1, gsi2;
 
 @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op)
return;
  }
 
 +  /* Move the __builtin_powi.  */
gsi1 = gsi_for_stmt (stmt);
gsi2 = gsi_for_stmt (stmt2);
gsi_move_before (gsi2, gsi1);
 +
 +  /* See if there are multiplies feeding the __builtin_powi base
 + argument that must also be moved.  */
 +  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL)
 +{
 +  /* If we've already moved this statement, we're done.  This is
 + identified by a NULL entry for the statement in bip_map.  */
 +  gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy);
 +  if (next  !*next)
 + return;
 +
 +  stmt = stmt2;
 +  stmt2 = *mpy;
 +  gsi1 = gsi_for_stmt (stmt);
 +  gsi2 = gsi_for_stmt (stmt2);
 +  gsi_move_before (gsi2, gsi1);
 +
 +  /* The moved multiply may be DAG'd from multiple calls if it
 +  was the result of a cached multiply.  Only move it once.
 +  Rank order ensures we move it to the right place the first
 +  time.  */
 +  if (next)
 + *next = NULL;
 +  else
 + {
 +   next = (gimple *) pointer_map_insert (bip_map, *mpy);
 +   *next = NULL;
 + }
 +}
  }
 
  /* This function checks three consequtive operands in
 @@ -3281,6 +3315,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
while (true)
  {
HOST_WIDE_INT power;
 +  gimple last_mul = NULL;
 
/* First look for the largest cached product of factors from
preceding iterations.  If found, create a builtin_powi for
 @@ -3318,16 +3353,25 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
   }
 else
   {
 +   gimple *value;
 +
 iter_result = get_reassoc_pow_ssa_name (target, type);
 pow_stmt = gimple_build_call (powi_fndecl, 2, rf1-repr

[PATCH, 4.7] Backport fix to [un]signed_type_for

2012-05-10 Thread William J. Schmidt

Backporting this patch to 4.7 fixes a problem building Fedora 17.
Bootstrapped and regression tested on powerpc64-unknown-linux-gnu.  Is
the backport OK?

Thanks,
Bill


2012-05-10  Bill Schmidt  wschm...@vnet.linux.ibm.com

Backport from trunk:
2012-03-12  Richard Guenther  rguent...@suse.de

* tree.c (signed_or_unsigned_type_for): Use
build_nonstandard_integer_type.
(signed_type_for): Adjust documentation.
(unsigned_type_for): Likewise.
* tree-pretty-print.c (dump_generic_node): Use standard names
for non-standard integer types if available.


Index: gcc/tree-pretty-print.c
===
--- gcc/tree-pretty-print.c (revision 187368)
+++ gcc/tree-pretty-print.c (working copy)
@@ -723,11 +723,41 @@ dump_generic_node (pretty_printer *buffer, tree no
  }
else if (TREE_CODE (node) == INTEGER_TYPE)
  {
-   pp_string (buffer, (TYPE_UNSIGNED (node)
-   ? unnamed-unsigned:
-   : unnamed-signed:));
-   pp_decimal_int (buffer, TYPE_PRECISION (node));
-   pp_string (buffer, );
+   if (TYPE_PRECISION (node) == CHAR_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? unsigned char
+ : signed char));
+   else if (TYPE_PRECISION (node) == SHORT_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? unsigned short
+ : signed short));
+   else if (TYPE_PRECISION (node) == INT_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? unsigned int
+ : signed int));
+   else if (TYPE_PRECISION (node) == LONG_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? unsigned long
+ : signed long));
+   else if (TYPE_PRECISION (node) == LONG_LONG_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? unsigned long long
+ : signed long long));
+   else if (TYPE_PRECISION (node) = CHAR_TYPE_SIZE
+ exact_log2 (TYPE_PRECISION (node)))
+ {
+   pp_string (buffer, (TYPE_UNSIGNED (node) ? uint : int));
+   pp_decimal_int (buffer, TYPE_PRECISION (node));
+   pp_string (buffer, _t);
+ }
+   else
+ {
+   pp_string (buffer, (TYPE_UNSIGNED (node)
+   ? unnamed-unsigned:
+   : unnamed-signed:));
+   pp_decimal_int (buffer, TYPE_PRECISION (node));
+   pp_string (buffer, );
+ }
  }
else if (TREE_CODE (node) == COMPLEX_TYPE)
  {
Index: gcc/tree.c
===
--- gcc/tree.c  (revision 187368)
+++ gcc/tree.c  (working copy)
@@ -10162,32 +10162,26 @@ widest_int_cst_value (const_tree x)
   return val;
 }
 
-/* If TYPE is an integral type, return an equivalent type which is
-unsigned iff UNSIGNEDP is true.  If TYPE is not an integral type,
-return TYPE itself.  */
+/* If TYPE is an integral or pointer type, return an integer type with
+   the same precision which is unsigned iff UNSIGNEDP is true, or itself
+   if TYPE is already an integer type of signedness UNSIGNEDP.  */
 
 tree
 signed_or_unsigned_type_for (int unsignedp, tree type)
 {
-  tree t = type;
-  if (POINTER_TYPE_P (type))
-{
-  /* If the pointer points to the normal address space, use the
-size_type_node.  Otherwise use an appropriate size for the pointer
-based on the named address space it points to.  */
-  if (!TYPE_ADDR_SPACE (TREE_TYPE (t)))
-   t = size_type_node;
-  else
-   return lang_hooks.types.type_for_size (TYPE_PRECISION (t), unsignedp);
-}
+  if (TREE_CODE (type) == INTEGER_TYPE  TYPE_UNSIGNED (type) == unsignedp)
+return type;
 
-  if (!INTEGRAL_TYPE_P (t) || TYPE_UNSIGNED (t) == unsignedp)
-return t;
+  if (!INTEGRAL_TYPE_P (type)
+   !POINTER_TYPE_P (type))
+return NULL_TREE;
 
-  return lang_hooks.types.type_for_size (TYPE_PRECISION (t), unsignedp);
+  return build_nonstandard_integer_type (TYPE_PRECISION (type), unsignedp);
 }
 
-/* Returns unsigned variant of TYPE.  */
+/* If TYPE is an integral or pointer type, return an integer type with
+   the same precision which is unsigned, or itself if TYPE is already an
+   unsigned integer type.  */
 
 tree
 unsigned_type_for

Re: [PATCH, 4.7] Backport fix to [un]signed_type_for

2012-05-10 Thread William J. Schmidt

On Thu, 2012-05-10 at 18:49 +0200, Jakub Jelinek wrote:
 On Thu, May 10, 2012 at 11:44:27AM -0500, William J. Schmidt wrote:
  Backporting this patch to 4.7 fixes a problem building Fedora 17.
  Bootstrapped and regression tested on powerpc64-unknown-linux-gnu.  Is
  the backport OK?
 
 For 4.7 I'd very much prefer a less intrusive change (i.e. change
 the java langhook) instead, but I'll defer to Richard if he prefers
 this over that.

OK.  If that's desired, this is the possible change to the langhook:

Index: gcc/java/typeck.c
===
--- gcc/java/typeck.c   (revision 187158)
+++ gcc/java/typeck.c   (working copy)
@@ -189,6 +189,12 @@ java_type_for_size (unsigned bits, int unsignedp)
 return unsignedp ? unsigned_int_type_node : int_type_node;
   if (bits = TYPE_PRECISION (long_type_node))
 return unsignedp ? unsigned_long_type_node : long_type_node;
+  /* A 64-bit target with TImode requires 128-bit type definitions
+ for bitsizetype.  */
+  if (int128_integer_type_node
+   bits == TYPE_PRECISION (int128_integer_type_node))
+return (unsignedp ? int128_unsigned_type_node
+   : int128_integer_type_node);
   return 0;
 }

which also fixed the problem and bootstraps without regressions.
Whichever you guys prefer is fine with me.

Thanks,
Bill
 
  2012-05-10  Bill Schmidt  wschm...@vnet.linux.ibm.com
  
  Backport from trunk:
  2012-03-12  Richard Guenther  rguent...@suse.de
  
  * tree.c (signed_or_unsigned_type_for): Use
  build_nonstandard_integer_type.
  (signed_type_for): Adjust documentation.
  (unsigned_type_for): Likewise.
  * tree-pretty-print.c (dump_generic_node): Use standard names
  for non-standard integer types if available.
 
   Jakub

[PATCH] Fix PR53217

2012-05-08 Thread William J. Schmidt

This fixes another statement-placement issue when reassociating
expressions with repeated factors.  Multiplies feeding into
__builtin_powi calls were not getting placed properly ahead of them in
some cases.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  I've also run SPEC cpu2006 with no build or correctness
issues.  OK for trunk?

Thanks,
Bill


gcc:

2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/53217
* tree-ssa-reassoc.c (bip_map): New static variable.
(possibly_move_powi): Move feeding multiplies with __builtin_powi call.
(attempt_builtin_powi): Save feeding multiplies on a stack.
(reassociate_bb): Create and destroy bip_map.

gcc/testsuite:

2012-05-08  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/53217
* gfortran.dg/pr53217.f90: New test.


Index: gcc/testsuite/gfortran.dg/pr53217.f90
===
--- gcc/testsuite/gfortran.dg/pr53217.f90   (revision 0)
+++ gcc/testsuite/gfortran.dg/pr53217.f90   (revision 0)
@@ -0,0 +1,28 @@
+! { dg-do compile }
+! { dg-options -O1 -ffast-math }
+
+! This tests only for compile-time failure, which formerly occurred
+! when statements were emitted out of order, failing verify_ssa.
+
+MODULE xc_cs1
+  INTEGER, PARAMETER :: dp=KIND(0.0D0)
+  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, 
+  c = 0.2533_dp, 
+  d = 0.349_dp
+CONTAINS
+  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, e_ndrho_ndrho,
+   npoints, error)
+REAL(KIND=dp), DIMENSION(*), 
+  INTENT(INOUT)  :: e_rho_rho, e_rho_ndrho, 
+e_ndrho_ndrho
+DO ip = 1, npoints
+  IF ( rho(ip)  eps_rho ) THEN
+ oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
+ d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 
+ -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 
+ +104*r**6)*od**3*oc**4
+ e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
+  END IF
+END DO
+  END SUBROUTINE cs1_u_2
+END MODULE xc_cs1
Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 187117)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -200,6 +200,10 @@ static long *bb_rank;
 /* Operand-rank hashtable.  */
 static struct pointer_map_t *operand_rank;
 
+/* Map from inserted __builtin_powi calls to multiply chains that
+   feed them.  */
+static struct pointer_map_t *bip_map;
+
 /* Forward decls.  */
 static long get_rank (tree);
 
@@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var)
 static void
 possibly_move_powi (gimple stmt, tree op)
 {
-  gimple stmt2;
+  gimple stmt2, *mpy;
   tree fndecl;
   gimple_stmt_iterator gsi1, gsi2;
 
@@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op)
   return;
 }
 
+  /* Move the __builtin_powi.  */
   gsi1 = gsi_for_stmt (stmt);
   gsi2 = gsi_for_stmt (stmt2);
   gsi_move_before (gsi2, gsi1);
+
+  /* See if there are multiplies feeding the __builtin_powi base
+ argument that must also be moved.  */
+  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL)
+{
+  /* If we've already moved this statement, we're done.  This is
+ identified by a NULL entry for the statement in bip_map.  */
+  gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy);
+  if (next  !*next)
+   return;
+
+  stmt = stmt2;
+  stmt2 = *mpy;
+  gsi1 = gsi_for_stmt (stmt);
+  gsi2 = gsi_for_stmt (stmt2);
+  gsi_move_before (gsi2, gsi1);
+
+  /* The moved multiply may be DAG'd from multiple calls if it
+was the result of a cached multiply.  Only move it once.
+Rank order ensures we move it to the right place the first
+time.  */
+  if (next)
+   *next = NULL;
+  else
+   {
+ next = (gimple *) pointer_map_insert (bip_map, *mpy);
+ *next = NULL;
+   }
+}
 }
 
 /* This function checks three consequtive operands in
@@ -3281,6 +3315,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
   while (true)
 {
   HOST_WIDE_INT power;
+  gimple last_mul = NULL;
 
   /* First look for the largest cached product of factors from
 preceding iterations.  If found, create a builtin_powi for
@@ -3318,16 +3353,25 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
}
  else
{
+ gimple *value;
+
  iter_result = get_reassoc_pow_ssa_name (target, type);
  pow_stmt = gimple_build_call (powi_fndecl, 2, rf1-repr, 
build_int_cst (integer_type_node,
   power));
  gimple_call_set_lhs (pow_stmt,

[PATCH] Hoist adjacent pointer loads

2012-05-03 Thread William J. Schmidt

This patch was posted for comment back in February during stage 4.  It
addresses a performance issue noted in the EEMBC routelookup benchmark
on a common idiom:

  if (...)
x = y-left;
  else
x = y-right;

If the two loads can be hoisted out of the if/else, the if/else can be
replaced by a conditional move instruction on architectures that support
one.  Because this speculates one of the loads, the patch constrains the
optimization to avoid introducing page faults.

Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
new failures.  The patch provides significant improvement to the
routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.

One question is what optimization level should be required for this.
Because of the speculation, -O3 might be in order.  I don't believe
-Ofast is required as there is no potential correctness issue involved.
Right now the patch doesn't check the optimization level (like the rest
of the phi-opt transforms), which is likely a poor choice.

Ok for trunk?

Thanks,
Bill


2012-05-03  Bill Schmidt  wschm...@linux.vnet.ibm.com

* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_reg_dependence): New function.
(local_mem_dependence): Likewise.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.


Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187057)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include cfgloop.h
 #include tree-data-ref.h
 #include tree-pretty-print.h
+#include gimple-pretty-print.h
+#include insn-config.h
+#include expr.h
+#include optabs.h
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI x' (bb0), ...;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = (expr).field1;
+   goto bb3;
+ bb2:
+   x2 = (expr).field2;
+ bb3:
+   # x = PHI x1, x2;
+
+   with
+
+ bb0:
+   x1 = (expr).field1;
+   x2 = (expr).field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI x1, x2;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y-left;
+ else
+   x = y-right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
 static unsigned int
 tree_ssa_cs_elim (void)
 {
-  return tree_ssa_phiopt_worker (true);
+  return tree_ssa_phiopt_worker (true, false);
 }
 
 /*

Re: [PATCH] Hoist adjacent pointer loads

2012-05-03 Thread William J. Schmidt

On Thu, 2012-05-03 at 09:40 -0600, Jeff Law wrote:
 On 05/03/2012 08:33 AM, William J. Schmidt wrote:
  This patch was posted for comment back in February during stage 4.  It
  addresses a performance issue noted in the EEMBC routelookup benchmark
  on a common idiom:
 
 if (...)
   x = y-left;
 else
   x = y-right;
 
  If the two loads can be hoisted out of the if/else, the if/else can be
  replaced by a conditional move instruction on architectures that support
  one.  Because this speculates one of the loads, the patch constrains the
  optimization to avoid introducing page faults.
 
  Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
  new failures.  The patch provides significant improvement to the
  routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.
 
  One question is what optimization level should be required for this.
  Because of the speculation, -O3 might be in order.  I don't believe
  -Ofast is required as there is no potential correctness issue involved.
  Right now the patch doesn't check the optimization level (like the rest
  of the phi-opt transforms), which is likely a poor choice.
 Doesn't this need to be conditionalized on the memory model that's 
 currently active?
 
Yes and no.  What's important is that you don't want to introduce page
faults (or less urgently, cache misses) by speculating the load.  So the
patch is currently extremely constrained, and likely will always stay
that way.  Only fields that are pointers and that are strictly adjacent
are hoisted, and only if they're in the same 16-byte block.  (The number
16 is a parameter that can be adjusted.)

Hopefully I didn't miss your point -- let me know if I did and I'll try
again. :)

Thanks,
Bill

 jeff

Re: [PATCH] Hoist adjacent pointer loads

2012-05-03 Thread William J. Schmidt



On Thu, 2012-05-03 at 11:44 -0600, Jeff Law wrote:
 On 05/03/2012 10:47 AM, William J. Schmidt wrote:
 
  Yes and no.  What's important is that you don't want to introduce page
  faults (or less urgently, cache misses) by speculating the load.  So the
  patch is currently extremely constrained, and likely will always stay
  that way.  Only fields that are pointers and that are strictly adjacent
  are hoisted, and only if they're in the same 16-byte block.  (The number
  16 is a parameter that can be adjusted.)
 
  Hopefully I didn't miss your point -- let me know if I did and I'll try
  again. :)
 You missed the point :-)
 
 Under the C++11 memory model you can't introduce new data races on 
 objects which might be visible to multiple threads.  This requirement 
 can restrict speculation in many cases.  Furthermore, it sounds like C11 
 will have similar constraints.
 
 I believe there's a wiki page which touches on these kinds of issues.
 
 That doesn't mean we can't ever do the optimization, just that we have 
 to be more careful than we have in the past when mucking around with 
 memory optimizations.

OK, thanks!  Looks like I have some reading to do about the new memory
models.

However, from the wiki page I see:  A speculative load which has its
results thrown away are considered to not have changed the semantics of
the program, and are therefore allowed.  That seems to cover the case
here: the load is hoisted, but if the path where it was originally
loaded is not executed, its result is discarded.

If needed, though, what flags/detection mechanisms are available for
determining that the load speculation should be disabled?

Thanks,
Bill
 
 jeff

Re: [PATCH] Improve COND_EXPR expansion

2012-05-02 Thread William J. Schmidt

On Mon, 2012-04-30 at 20:22 -0700, Andrew Pinski wrote:
 Hi,
   This patch improves the expansion of COND_EXPR into RTL, directly
 using conditional moves.
 I had to fix a bug in the x86 backend where emit_conditional_move
 could cause a crash as we had a comparison mode of DImode which is not
 handled by the 32bit part.  can_conditionally_move_p return true as we
 had an SImode for the other operands.
 Note other targets might need a similar fix as x86 had but I could not
 test those targets and this is really the first time where
 emit_conditional_move is being called with different modes for the
 comparison and the other operands mode and the comparison mode is not
 of the CC class.

Hi Andrew,

I verified your patch on powerpc64-unknown-linux-gnu.  There were no new
testcase regressions, and SPEC cpu2006 built ok with your changes.

Hope this helps!

Bill
 
 The main reasoning to do this conversion early rather than wait for
 ifconv as the resulting code is slightly better.  Also the compiler is
 slightly faster.
 
 OK?  Bootstrapped and tested on both mips64-linux-gnu (where it was
 originally written for) and x86_64-linux-gnu.
 
 Thanks,
 Andrew Pinski
 
 ChangeLog:
 * expr.c (convert_tree_comp_to_rtx): New function.
 (expand_expr_real_2): Try using conditional moves for COND_EXPRs if they 
 exist.
 * config/i386/i386.c (ix86_expand_int_movcc): Disallow comparison
 modes of DImode for 32bits and TImode.

[Patch ping] Strength reduction

2012-04-29 Thread William J. Schmidt

Thought I'd ping http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01225.html
since it's been about six weeks.  Any initial feedback would be very
much appreciated!

Thanks,
Bill

[PATCH, powerpc] Fix PR47197

2012-04-24 Thread William J. Schmidt

This fixes an error wherein a nontrivial expression oassed to an Altivec
built-in results in an ICE, following Joseph Myers's suggested approach
in the bugzilla.

Bootstrapped and tested with no new regressions on
powerpc64-unknown-linux-gnu.  Ok for trunk?

Thanks,
Bill


gcc:

2012-04-24  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR target/47197
* config/rs6000/rs6000-c.c (fully_fold_convert): New function.
(altivec_build_resolved_builtin): Call fully_fold_convert.

gcc/testsuite:

2012-04-24  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR target/47197
* gcc.target/powerpc/pr47197.c: New test.


Index: gcc/testsuite/gcc.target/powerpc/pr47197.c
===
--- gcc/testsuite/gcc.target/powerpc/pr47197.c  (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr47197.c  (revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options -maltivec } */
+
+/* Compile-only test to ensure that expressions can be passed to
+   Altivec builtins without error.  */
+
+#include altivec.h
+
+void func(unsigned char *buf, unsigned len)
+{
+vec_dst(buf, (len = 256 ? 0 : len) | 512, 2);
+}
Index: gcc/config/rs6000/rs6000-c.c
===
--- gcc/config/rs6000/rs6000-c.c(revision 186761)
+++ gcc/config/rs6000/rs6000-c.c(working copy)
@@ -3421,6 +3421,22 @@ rs6000_builtin_type_compatible (tree t, int id)
 }
 
 
+/* In addition to calling fold_convert for EXPR of type TYPE, also
+   call c_fully_fold to remove any C_MAYBE_CONST_EXPRs that could be
+   hiding there (PR47197).  */
+
+static tree
+fully_fold_convert (tree type, tree expr)
+{
+  tree result = fold_convert (type, expr);
+  bool maybe_const = true;
+
+  if (!c_dialect_cxx ())
+result = c_fully_fold (result, false, maybe_const);
+
+  return result;
+}
+
 /* Build a tree for a function call to an Altivec non-overloaded builtin.
The overloaded builtin that matched the types and args is described
by DESC.  The N arguments are given in ARGS, respectively.  
@@ -3470,18 +3486,18 @@ altivec_build_resolved_builtin (tree *args, int n,
   break;
 case 1:
   call = build_call_expr (impl_fndecl, 1,
- fold_convert (arg_type[0], args[0]));
+ fully_fold_convert (arg_type[0], args[0]));
   break;
 case 2:
   call = build_call_expr (impl_fndecl, 2,
- fold_convert (arg_type[0], args[0]),
- fold_convert (arg_type[1], args[1]));
+ fully_fold_convert (arg_type[0], args[0]),
+ fully_fold_convert (arg_type[1], args[1]));
   break;
 case 3:
   call = build_call_expr (impl_fndecl, 3,
- fold_convert (arg_type[0], args[0]),
- fold_convert (arg_type[1], args[1]),
- fold_convert (arg_type[2], args[2]));
+ fully_fold_convert (arg_type[0], args[0]),
+ fully_fold_convert (arg_type[1], args[1]),
+ fully_fold_convert (arg_type[2], args[2]));
   break;
 default:
   gcc_unreachable ();

Re: [PATCH] Fix PR44214

2012-04-23 Thread William J. Schmidt

On Mon, 2012-04-23 at 11:09 +0200, Richard Guenther wrote:
 On Fri, 20 Apr 2012, William J. Schmidt wrote:
 
  On Fri, 2012-04-20 at 11:32 -0700, H.J. Lu wrote:
   On Thu, Apr 19, 2012 at 6:58 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
This enhances constant folding for division by complex and vector
constants.  When -freciprocal-math is present, such divisions are
converted into multiplies by the constant reciprocal.  When an exact
reciprocal is available, this is done for vector constants when
optimizing.  I did not implement logic for exact reciprocals of complex
constants because either (a) the complexity doesn't justify the
likelihood of occurrence, or (b) I'm lazy.  Your choice. ;)
   
Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu.  Ok
for trunk?
   
Thanks,
Bill
   
   
gcc:
   
2012-04-19  Bill Schmidt  wschm...@linux.vnet.ibm.com
   
   PR rtl-optimization/44214
   * fold-const.c (exact_inverse): New function.
   (fold_binary_loc): Fold vector and complex division by constant 
into
   multiply by recripocal with flag_reciprocal_math; fold vector 
division
   by constant into multiply by reciprocal with exact inverse.
   
gcc/testsuite:
   
   
   It caused:
   
   FAIL: gcc.dg/torture/builtin-explog-1.c  -O0  (test for excess errors)
   FAIL: gcc.dg/torture/builtin-power-1.c  -O0  (test for excess errors)
   
   on x86.
   
  
  Hm, sorry, I don't know how that escaped my testing.  This was due to
  the suggestion to have the optimize test encompass the
  -freciprocal-math test.  Looks like this changes some expected behavior,
  at least for these two tests.  
  
  Two options: Revert the move of the optimize test, or change the tests
  to require -O1 or above.  Richard, what's your preference?
 
 Change the test to require -O1 or above.
 
 Richard.
 

OK, following committed as obvious.

Thanks,
Bill


gcc-testsuite:

2012-04-23  Bill Schmidt  wschm...@linux.ibm.com

PR regression/53076
* gcc.dg/torture/builtin-explog-1.c: Skip if -O0.
* gcc.dg/torture/builtin-power-1.c: Likewise.


Index: gcc/testsuite/gcc.dg/torture/builtin-explog-1.c
===
--- gcc/testsuite/gcc.dg/torture/builtin-explog-1.c (revision 186624)
+++ gcc/testsuite/gcc.dg/torture/builtin-explog-1.c (working copy)
@@ -7,6 +7,7 @@
 
 /* { dg-do link } */
 /* { dg-options -ffast-math } */
+/* { dg-skip-if PR44214 { *-*-* } { -O0 } {  } } */
 
 /* Define e with as many bits as found in builtins.c:dconste.  */
 #define M_E  
2.7182818284590452353602874713526624977572470936999595749669676277241
Index: gcc/testsuite/gcc.dg/torture/builtin-power-1.c
===
--- gcc/testsuite/gcc.dg/torture/builtin-power-1.c  (revision 186624)
+++ gcc/testsuite/gcc.dg/torture/builtin-power-1.c  (working copy)
@@ -8,6 +8,7 @@
 /* { dg-do link } */
 /* { dg-options -ffast-math } */
 /* { dg-add-options c99_runtime } */
+/* { dg-skip-if PR44214 { *-*-* } { -O0 } {  } } */
 
 #include ../builtins-config.h

Re: [PATCH] Fix PR44214

2012-04-20 Thread William J. Schmidt

On Fri, 2012-04-20 at 10:04 +0200, Richard Guenther wrote:
 On Thu, 19 Apr 2012, William J. Schmidt wrote:
 
  This enhances constant folding for division by complex and vector
  constants.  When -freciprocal-math is present, such divisions are
  converted into multiplies by the constant reciprocal.  When an exact
  reciprocal is available, this is done for vector constants when
  optimizing.  I did not implement logic for exact reciprocals of complex
  constants because either (a) the complexity doesn't justify the
  likelihood of occurrence, or (b) I'm lazy.  Your choice. ;)
  
  Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu.  Ok
  for trunk?
 
 See below ...
 
  Thanks,
  Bill
  
  
  gcc:
  
  2012-04-19  Bill Schmidt  wschm...@linux.vnet.ibm.com
  
  PR rtl-optimization/44214
  * fold-const.c (exact_inverse): New function.
  (fold_binary_loc): Fold vector and complex division by constant into
  multiply by recripocal with flag_reciprocal_math; fold vector division
  by constant into multiply by reciprocal with exact inverse.
  
  gcc/testsuite:
  
  2012-04-19  Bill Schmidt  wschm...@linux.vnet.ibm.com
  
  PR rtl-optimization/44214
  * gcc.target/powerpc/pr44214-1.c: New test.
  * gcc.dg/pr44214-2.c: Likewise.
  * gcc.target/powerpc/pr44214-3.c: Likewise.
  
  
  Index: gcc/fold-const.c
  ===
  --- gcc/fold-const.c(revision 186573)
  +++ gcc/fold-const.c(working copy)
  @@ -9693,6 +9693,48 @@ fold_addr_of_array_ref_difference (location_t loc,
 return NULL_TREE;
   }
   
  +/* If the real or vector real constant CST of type TYPE has an exact
  +   inverse, return it, else return NULL.  */
  +
  +static tree
  +exact_inverse (tree type, tree cst)
  +{
  +  REAL_VALUE_TYPE r;
  +  tree unit_type, *elts;
  +  enum machine_mode mode;
  +  unsigned vec_nelts, i;
  +
  +  switch (TREE_CODE (cst))
  +{
  +case REAL_CST:
  +  r = TREE_REAL_CST (cst);
  +
  +  if (exact_real_inverse (TYPE_MODE (type), r))
  +   return build_real (type, r);
  +
  +  return NULL_TREE;
  +
  +case VECTOR_CST:
  +  vec_nelts = VECTOR_CST_NELTS (cst);
  +  elts = XALLOCAVEC (tree, vec_nelts);
  +  unit_type = TREE_TYPE (type);
  +  mode = TYPE_MODE (unit_type);
  +
  +  for (i = 0; i  vec_nelts; i++)
  +   {
  + r = TREE_REAL_CST (VECTOR_CST_ELT (cst, i));
  + if (!exact_real_inverse (mode, r))
  +   return NULL_TREE;
  + elts[i] = build_real (unit_type, r);
  +   }
  +
  +  return build_vector (type, elts);
  +
  +default:
  +  return NULL_TREE;
  +}
  +}
  +
   /* Fold a binary expression of code CODE and type TYPE with operands
  OP0 and OP1.  LOC is the location of the resulting expression.
  Return the folded expression if folding is successful.  Otherwise,
  @@ -11734,23 +11776,25 @@ fold_binary_loc (location_t loc,
   so only do this if -freciprocal-math.  We can actually
   always safely do it if ARG1 is a power of two, but it's hard to
   tell if it is or not in a portable manner.  */
  -  if (TREE_CODE (arg1) == REAL_CST)
  +  if (TREE_CODE (arg1) == REAL_CST
  + || (TREE_CODE (arg1) == COMPLEX_CST
  +  COMPLEX_FLOAT_TYPE_P (TREE_TYPE (arg1)))
  + || (TREE_CODE (arg1) == VECTOR_CST
  +  VECTOR_FLOAT_TYPE_P (TREE_TYPE (arg1
  {
if (flag_reciprocal_math
  -  0 != (tem = const_binop (code, build_real (type, dconst1),
  +  0 != (tem = fold_binary (code, type, build_one_cst (type),
arg1)))
 
 Any reason for not using const_binop?

As it turns out, no.  I (blindly) made this change based on your comment
in the PR...

The fold code should probably simply use fold_binary to do the constant
folding (which already should handle 1/x for x vector and complex. There
is a build_one_cst to build the constant 1 for any type).

...but now that I've looked at it that was unnecessary, so I must have
misinterpreted this.  I'll revert to using const_binop.

 
  return fold_build2_loc (loc, MULT_EXPR, type, arg0, tem);
  - /* Find the reciprocal if optimizing and the result is exact.  */
  - if (optimize)
  + /* Find the reciprocal if optimizing and the result is exact.
  +TODO: Complex reciprocal not implemented.  */
  + if (optimize
  +  TREE_CODE (arg1) != COMPLEX_CST)
 
 I know this is all pre-existing, but really the flag_reciprocal_math
 case should be under if (optimize), too.  So, can you move this check
 to the toplevel covering both cases?

Sure.

 
 The testcases should apply to generic vectors, too, and should scan
 the .original dump (where folding first applied).  So they should
 not be target specific (and they should use -freciprocal-math).

OK.  I was ignorant of the generic vector syntax using __attribute__.
If I change pr44214-1.c to use

Re: [PATCH] Fix PR44214

2012-04-20 Thread William J. Schmidt

On Fri, 2012-04-20 at 11:32 -0700, H.J. Lu wrote:
 On Thu, Apr 19, 2012 at 6:58 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  This enhances constant folding for division by complex and vector
  constants.  When -freciprocal-math is present, such divisions are
  converted into multiplies by the constant reciprocal.  When an exact
  reciprocal is available, this is done for vector constants when
  optimizing.  I did not implement logic for exact reciprocals of complex
  constants because either (a) the complexity doesn't justify the
  likelihood of occurrence, or (b) I'm lazy.  Your choice. ;)
 
  Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu.  Ok
  for trunk?
 
  Thanks,
  Bill
 
 
  gcc:
 
  2012-04-19  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
 PR rtl-optimization/44214
 * fold-const.c (exact_inverse): New function.
 (fold_binary_loc): Fold vector and complex division by constant into
 multiply by recripocal with flag_reciprocal_math; fold vector 
  division
 by constant into multiply by reciprocal with exact inverse.
 
  gcc/testsuite:
 
 
 It caused:
 
 FAIL: gcc.dg/torture/builtin-explog-1.c  -O0  (test for excess errors)
 FAIL: gcc.dg/torture/builtin-power-1.c  -O0  (test for excess errors)
 
 on x86.
 

Hm, sorry, I don't know how that escaped my testing.  This was due to
the suggestion to have the optimize test encompass the
-freciprocal-math test.  Looks like this changes some expected behavior,
at least for these two tests.  

Two options: Revert the move of the optimize test, or change the tests
to require -O1 or above.  Richard, what's your preference?

Thanks,
Bill

[PATCH] Fix PR44214

2012-04-19 Thread William J. Schmidt

This enhances constant folding for division by complex and vector
constants.  When -freciprocal-math is present, such divisions are
converted into multiplies by the constant reciprocal.  When an exact
reciprocal is available, this is done for vector constants when
optimizing.  I did not implement logic for exact reciprocals of complex
constants because either (a) the complexity doesn't justify the
likelihood of occurrence, or (b) I'm lazy.  Your choice. ;)

Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu.  Ok
for trunk?

Thanks,
Bill


gcc:

2012-04-19  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR rtl-optimization/44214
* fold-const.c (exact_inverse): New function.
(fold_binary_loc): Fold vector and complex division by constant into
multiply by recripocal with flag_reciprocal_math; fold vector division
by constant into multiply by reciprocal with exact inverse.

gcc/testsuite:

2012-04-19  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR rtl-optimization/44214
* gcc.target/powerpc/pr44214-1.c: New test.
* gcc.dg/pr44214-2.c: Likewise.
* gcc.target/powerpc/pr44214-3.c: Likewise.


Index: gcc/fold-const.c
===
--- gcc/fold-const.c(revision 186573)
+++ gcc/fold-const.c(working copy)
@@ -9693,6 +9693,48 @@ fold_addr_of_array_ref_difference (location_t loc,
   return NULL_TREE;
 }
 
+/* If the real or vector real constant CST of type TYPE has an exact
+   inverse, return it, else return NULL.  */
+
+static tree
+exact_inverse (tree type, tree cst)
+{
+  REAL_VALUE_TYPE r;
+  tree unit_type, *elts;
+  enum machine_mode mode;
+  unsigned vec_nelts, i;
+
+  switch (TREE_CODE (cst))
+{
+case REAL_CST:
+  r = TREE_REAL_CST (cst);
+
+  if (exact_real_inverse (TYPE_MODE (type), r))
+   return build_real (type, r);
+
+  return NULL_TREE;
+
+case VECTOR_CST:
+  vec_nelts = VECTOR_CST_NELTS (cst);
+  elts = XALLOCAVEC (tree, vec_nelts);
+  unit_type = TREE_TYPE (type);
+  mode = TYPE_MODE (unit_type);
+
+  for (i = 0; i  vec_nelts; i++)
+   {
+ r = TREE_REAL_CST (VECTOR_CST_ELT (cst, i));
+ if (!exact_real_inverse (mode, r))
+   return NULL_TREE;
+ elts[i] = build_real (unit_type, r);
+   }
+
+  return build_vector (type, elts);
+
+default:
+  return NULL_TREE;
+}
+}
+
 /* Fold a binary expression of code CODE and type TYPE with operands
OP0 and OP1.  LOC is the location of the resulting expression.
Return the folded expression if folding is successful.  Otherwise,
@@ -11734,23 +11776,25 @@ fold_binary_loc (location_t loc,
 so only do this if -freciprocal-math.  We can actually
 always safely do it if ARG1 is a power of two, but it's hard to
 tell if it is or not in a portable manner.  */
-  if (TREE_CODE (arg1) == REAL_CST)
+  if (TREE_CODE (arg1) == REAL_CST
+ || (TREE_CODE (arg1) == COMPLEX_CST
+  COMPLEX_FLOAT_TYPE_P (TREE_TYPE (arg1)))
+ || (TREE_CODE (arg1) == VECTOR_CST
+  VECTOR_FLOAT_TYPE_P (TREE_TYPE (arg1
{
  if (flag_reciprocal_math
-  0 != (tem = const_binop (code, build_real (type, dconst1),
+  0 != (tem = fold_binary (code, type, build_one_cst (type),
  arg1)))
return fold_build2_loc (loc, MULT_EXPR, type, arg0, tem);
- /* Find the reciprocal if optimizing and the result is exact.  */
- if (optimize)
+ /* Find the reciprocal if optimizing and the result is exact.
+TODO: Complex reciprocal not implemented.  */
+ if (optimize
+  TREE_CODE (arg1) != COMPLEX_CST)
{
- REAL_VALUE_TYPE r;
- r = TREE_REAL_CST (arg1);
- if (exact_real_inverse (TYPE_MODE(TREE_TYPE(arg0)), r))
-   {
- tem = build_real (type, r);
- return fold_build2_loc (loc, MULT_EXPR, type,
- fold_convert_loc (loc, type, arg0), tem);
-   }
+ tree inverse = exact_inverse (TREE_TYPE (arg0), arg1);
+
+ if (inverse)
+   return fold_build2_loc (loc, MULT_EXPR, type, arg0, inverse);
}
}
   /* Convert A/B/C to A/(B*C).  */
Index: gcc/testsuite/gcc.target/powerpc/pr44214-3.c
===
--- gcc/testsuite/gcc.target/powerpc/pr44214-3.c(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr44214-3.c(revision 0)
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -mcpu=power7 -fdump-tree-optimized } */
+
+void do_div (vector double *a, vector double *b)
+{
+  *a = *b / (vector double) { 2.0, 2.0 };
+}
+
+/* Since 2.0 has an exact reciprocal, constant folding should multiply *b
+   by the

[PATCH] Allow un-distribution with repeated factors (PR52976 follow-up)

2012-04-17 Thread William J. Schmidt

The emergency reassociation patch for PR52976 disabled un-distribution
in the presence of repeated factors to avoid ICEs in zero_one_operation.
This patch fixes such cases properly by teaching zero_one_operation
about __builtin_pow* calls.

Bootstrapped with no new regressions on powerpc64-linux.  Also built
SPEC cpu2000 and cpu2006 successfully.  Ok for trunk?

Thanks,
Bill


gcc:

2012-04-17  Bill Schmidt  wschm...@linux.vnet.ibm.com

* tree-ssa-reassoc.c (stmt_is_power_of_op): New function.
(decrement_power): Likewise.
(propagate_op_to_single_use): Likewise.
(zero_one_operation): Handle __builtin_pow* calls in linearized
expression trees; factor logic into propagate_op_to_single_use.
(undistribute_ops_list): Allow operands with repeat counts  1.


gcc/testsuite:

2012-04-17  Bill Schmidt  wschm...@linux.vnet.ibm.com

gfortran.dg/reassoc_7.f: New test.
gfortran.dg/reassoc_8.f: Likewise.
gfortran.dg/reassoc_9.f: Likewise.
gfortran.dg/reassoc_10.f: Likewise.


Index: gcc/testsuite/gfortran.dg/reassoc_10.f
===
--- gcc/testsuite/gfortran.dg/reassoc_10.f  (revision 0)
+++ gcc/testsuite/gfortran.dg/reassoc_10.f  (revision 0)
@@ -0,0 +1,17 @@
+! { dg-do compile }
+! { dg-options -O3 -ffast-math -fdump-tree-optimized }
+
+  SUBROUTINE S55199(P,Q,Dvdph)
+  implicit none
+  real(8) :: c1,c2,c3,P,Q,Dvdph
+  c1=0.1d0
+  c2=0.2d0
+  c3=0.3d0
+  Dvdph = c1 + 2.*P*c2 + 3.*P**2*Q**3*c3
+  END
+
+! There should be five multiplies following un-distribution
+! and power expansion.
+
+! { dg-final { scan-tree-dump-times  \\\*  5 optimized } }
+! { dg-final { cleanup-tree-dump optimized } }
Index: gcc/testsuite/gfortran.dg/reassoc_7.f
===
--- gcc/testsuite/gfortran.dg/reassoc_7.f   (revision 0)
+++ gcc/testsuite/gfortran.dg/reassoc_7.f   (revision 0)
@@ -0,0 +1,16 @@
+! { dg-do compile }
+! { dg-options -O3 -ffast-math -fdump-tree-optimized }
+
+  SUBROUTINE S55199(P,Dvdph)
+  implicit none
+  real(8) :: c1,c2,c3,P,Dvdph
+  c1=0.1d0
+  c2=0.2d0
+  c3=0.3d0
+  Dvdph = c1 + 2.*P*c2 + 3.*P**2*c3
+  END
+
+! There should be two multiplies following un-distribution.
+
+! { dg-final { scan-tree-dump-times  \\\*  2 optimized } }
+! { dg-final { cleanup-tree-dump optimized } }
Index: gcc/testsuite/gfortran.dg/reassoc_8.f
===
--- gcc/testsuite/gfortran.dg/reassoc_8.f   (revision 0)
+++ gcc/testsuite/gfortran.dg/reassoc_8.f   (revision 0)
@@ -0,0 +1,17 @@
+! { dg-do compile }
+! { dg-options -O3 -ffast-math -fdump-tree-optimized }
+
+  SUBROUTINE S55199(P,Dvdph)
+  implicit none
+  real(8) :: c1,c2,c3,P,Dvdph
+  c1=0.1d0
+  c2=0.2d0
+  c3=0.3d0
+  Dvdph = c1 + 2.*P**2*c2 + 3.*P**3*c3
+  END
+
+! There should be three multiplies following un-distribution
+! and power expansion.
+
+! { dg-final { scan-tree-dump-times  \\\*  3 optimized } }
+! { dg-final { cleanup-tree-dump optimized } }
Index: gcc/testsuite/gfortran.dg/reassoc_9.f
===
--- gcc/testsuite/gfortran.dg/reassoc_9.f   (revision 0)
+++ gcc/testsuite/gfortran.dg/reassoc_9.f   (revision 0)
@@ -0,0 +1,17 @@
+! { dg-do compile }
+! { dg-options -O3 -ffast-math -fdump-tree-optimized }
+
+  SUBROUTINE S55199(P,Dvdph)
+  implicit none
+  real(8) :: c1,c2,c3,P,Dvdph
+  c1=0.1d0
+  c2=0.2d0
+  c3=0.3d0
+  Dvdph = c1 + 2.*P**2*c2 + 3.*P**4*c3
+  END
+
+! There should be three multiplies following un-distribution
+! and power expansion.
+
+! { dg-final { scan-tree-dump-times  \\\*  3 optimized } }
+! { dg-final { cleanup-tree-dump optimized } }
Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 186495)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -1020,6 +1020,98 @@ oecount_cmp (const void *p1, const void *p2)
 return c1-id - c2-id;
 }
 
+/* Return TRUE iff STMT represents a builtin call that raises OP
+   to some exponent.  */
+
+static bool
+stmt_is_power_of_op (gimple stmt, tree op)
+{
+  tree fndecl;
+
+  if (!is_gimple_call (stmt))
+return false;
+
+  fndecl = gimple_call_fndecl (stmt);
+
+  if (!fndecl
+  || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL)
+return false;
+
+  switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt)))
+{
+CASE_FLT_FN (BUILT_IN_POW):
+CASE_FLT_FN (BUILT_IN_POWI):
+  return (operand_equal_p (gimple_call_arg (stmt, 0), op, 0));
+  
+default:
+  return false;
+}
+}
+
+/* Given STMT which is a __builtin_pow* call, decrement its exponent
+   in place and return the result.  Assumes that stmt_is_power_of_op
+   was previously called

[PATCH] Fix __builtin_powi placement (PR52976 follow-up)

2012-04-17 Thread William J. Schmidt

The emergency patch for PR52976 manipulated the operand rank system to
force inserted __builtin_powi calls to occur before uses of the call
results.  However, this is generally the wrong approach, as it forces
other computations to move unnecessarily, and extends the lifetimes of
other operands.

This patch fixes the problem in the proper way, by letting the rank
system determine where the __builtin_powi call belongs, and moving the
call to that location during the expression rewrite.

Bootstrapped with no new regressions on powerpc64-linux.  SPEC cpu2000
and cpu2006 also build cleanly.  Ok for trunk?

Thanks,
Bill


gcc:

2012-04-17  Bill Schmidt  wschm...@linux.vnet.ibm.com

* tree-ssa-reassoc.c (add_to_ops_vec_max_rank): Delete.
(possibly_move_powi): New function.
(rewrite_expr_tree): Call possibly_move_powi.
(rewrite_expr_tree_parallel): Likewise.
(attempt_builtin_powi): Change call of add_to_ops_vec_max_rank to
call add_to_ops_vec instead.


gcc/testsuite:

2012-04-17  Bill Schmidt  wschm...@linux.vnet.ibm.com

gfortran.dg/reassoc_11.f: New test.



Index: gcc/testsuite/gfortran.dg/reassoc_11.f
===
--- gcc/testsuite/gfortran.dg/reassoc_11.f  (revision 0)
+++ gcc/testsuite/gfortran.dg/reassoc_11.f  (revision 0)
@@ -0,0 +1,17 @@
+! { dg-do compile }
+! { dg-options -O3 -ffast-math }
+
+! This tests only for compile-time failure, which formerly occurred
+! when a __builtin_powi was introduced by reassociation in a bad place.
+
+  SUBROUTINE GRDURBAN(URBWSTR, ZIURB, GRIDHT)
+
+  IMPLICIT NONE
+  INTEGER :: I
+  REAL :: SW2, URBWSTR, ZIURB, GRIDHT(87)
+
+  SAVE 
+
+  SW2 = 1.6*(GRIDHT(I)/ZIURB)**0.667*URBWSTR**2
+
+  END
Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 186495)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -544,28 +544,6 @@ add_repeat_to_ops_vec (VEC(operand_entry_t, heap)
   reassociate_stats.pows_encountered++;
 }
 
-/* Add an operand entry to *OPS for the tree operand OP, giving the
-   new entry a larger rank than any other operand already in *OPS.  */
-
-static void
-add_to_ops_vec_max_rank (VEC(operand_entry_t, heap) **ops, tree op)
-{
-  operand_entry_t oe = (operand_entry_t) pool_alloc (operand_entry_pool);
-  operand_entry_t oe1;
-  unsigned i;
-  unsigned max_rank = 0;
-
-  FOR_EACH_VEC_ELT (operand_entry_t, *ops, i, oe1)
-if (oe1-rank  max_rank)
-  max_rank = oe1-rank;
-
-  oe-op = op;
-  oe-rank = max_rank + 1;
-  oe-id = next_operand_entry_id++;
-  oe-count = 1;
-  VEC_safe_push (operand_entry_t, heap, *ops, oe);
-}
-
 /* Return true if STMT is reassociable operation containing a binary
operation with tree code CODE, and is inside LOOP.  */
 
@@ -2162,6 +2242,47 @@ remove_visited_stmt_chain (tree var)
 }
 }
 
+/* If OP is an SSA name, find its definition and determine whether it
+   is a call to __builtin_powi.  If so, move the definition prior to
+   STMT.  Only do this during early reassociation.  */
+
+static void
+possibly_move_powi (gimple stmt, tree op)
+{
+  gimple stmt2;
+  tree fndecl;
+  gimple_stmt_iterator gsi1, gsi2;
+
+  if (!first_pass_instance
+  || !flag_unsafe_math_optimizations
+  || TREE_CODE (op) != SSA_NAME)
+return;
+  
+  stmt2 = SSA_NAME_DEF_STMT (op);
+
+  if (!is_gimple_call (stmt2)
+  || !has_single_use (gimple_call_lhs (stmt2)))
+return;
+
+  fndecl = gimple_call_fndecl (stmt2);
+
+  if (!fndecl
+  || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL)
+return;
+
+  switch (DECL_FUNCTION_CODE (fndecl))
+{
+CASE_FLT_FN (BUILT_IN_POWI):
+  break;
+default:
+  return;
+}
+
+  gsi1 = gsi_for_stmt (stmt);
+  gsi2 = gsi_for_stmt (stmt2);
+  gsi_move_before (gsi2, gsi1);
+}
+
 /* This function checks three consequtive operands in
passed operands vector OPS starting from OPINDEX and
swaps two operands if it is profitable for binary operation
@@ -2267,6 +2388,8 @@ rewrite_expr_tree (gimple stmt, unsigned int opind
  print_gimple_stmt (dump_file, stmt, 0, 0);
}
 
+ possibly_move_powi (stmt, oe1-op);
+ possibly_move_powi (stmt, oe2-op);
}
   return;
 }
@@ -2312,6 +2435,8 @@ rewrite_expr_tree (gimple stmt, unsigned int opind
  fprintf (dump_file,  into );
  print_gimple_stmt (dump_file, stmt, 0, 0);
}
+
+  possibly_move_powi (stmt, oe-op);
 }
   /* Recurse on the LHS of the binary operator, which is guaranteed to
  be the non-leaf side.  */
@@ -2485,6 +2610,9 @@ rewrite_expr_tree_parallel (gimple stmt, int width
  fprintf (dump_file,  into );
  print_gimple_stmt (dump_file, stmts[i], 0, 0);
}
+
+  possibly_move_powi (stmts[i], op1);
+  possibly_move_powi (stmts[i], op2);
 }
 
   remove_visited_stmt_chain (last_rhs1);

Re: [PATCH] Fix PR52976

2012-04-16 Thread William J. Schmidt

On Mon, 2012-04-16 at 11:01 +0200, Richard Guenther wrote:
 On Sat, Apr 14, 2012 at 7:05 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  This patch corrects two errors in reassociating expressions with
  repeated factors.  First, undistribution needs to recognize repeated
  factors.  For now, repeated factors will be ineligible for this
  optimization.  In the future, this can be improved.  Second, when a
  __builtin_powi call is introduced, its target SSA name must be given a
  rank higher than other operands in the operand list.  Otherwise, uses of
  the call result may be introduced prior to the call.
 
  Bootstrapped and regression tested on powerpc64-linux.  Confirmed that
  cpu2000 and cpu2006 SPEC tests build cleanly.  OK for trunk?
 
 Ok, given it fixes quite some fallout.
 
OK, thanks.

 But I wonder why the rank computation does not properly work
 automagically in the powi case.

The reassociator generally tries to replace expressions in place unless
the rank system tells it otherwise.  At the moment, __builtin_powi calls
are added right before the root of the reassociation chain (the last
multiply).  In the cases that failed, the natural rank of the call was
one greater than the rank of the repeated factors, and there were other
factors with higher rank than that.  So the call was in the middle of
the ranks but placement required it to have the highest rank.  Because
the call can't be further reassociated, it sort of ruins the flexibility
of the rank system's placement algorithm.

It would probably be better to insert the calls as early as necessary,
but no earlier, to properly order things while letting the rank system
do its job normally.  That would help reduce lifetimes of reassociated
values.  I didn't see an obvious way to do that with a quick fix; I'm
planning to think about it some more.

 
 Also for undistribution it looks like this might introduce missed 
 optimizations?
 Thus, how hard would it be to teach it to properly handle -count != 1?  ISTR
 it does some counting itself.

I'm planning to work on that as well.  I looked at it enough over the
weekend to know it wasn't completely trivial, so I wanted to get the
problem papered over for now.  It shouldn't be too hard to get right.

Thanks,
Bill

 
 Thanks,
 Richard.
 
  Thanks,
  Bill
 
 
  2012-04-14  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
 PR tree-optimization/52976
 * tree-ssa-reassoc.c (add_to_ops_vec_max_rank): New function.
 (undistribute_ops_list): Ops with repeat counts aren't eligible for
 undistribution.
 (attempt_builtin_powi): Call add_to_ops_vec_max_rank.
 
 
  Index: gcc/tree-ssa-reassoc.c
  ===
  --- gcc/tree-ssa-reassoc.c  (revision 186393)
  +++ gcc/tree-ssa-reassoc.c  (working copy)
  @@ -544,6 +544,28 @@ add_repeat_to_ops_vec (VEC(operand_entry_t, heap)
reassociate_stats.pows_encountered++;
   }
 
  +/* Add an operand entry to *OPS for the tree operand OP, giving the
  +   new entry a larger rank than any other operand already in *OPS.  */
  +
  +static void
  +add_to_ops_vec_max_rank (VEC(operand_entry_t, heap) **ops, tree op)
  +{
  +  operand_entry_t oe = (operand_entry_t) pool_alloc (operand_entry_pool);
  +  operand_entry_t oe1;
  +  unsigned i;
  +  unsigned max_rank = 0;
  +
  +  FOR_EACH_VEC_ELT (operand_entry_t, *ops, i, oe1)
  +if (oe1-rank  max_rank)
  +  max_rank = oe1-rank;
  +
  +  oe-op = op;
  +  oe-rank = max_rank + 1;
  +  oe-id = next_operand_entry_id++;
  +  oe-count = 1;
  +  VEC_safe_push (operand_entry_t, heap, *ops, oe);
  +}
  +
   /* Return true if STMT is reassociable operation containing a binary
 operation with tree code CODE, and is inside LOOP.  */
 
  @@ -1200,6 +1222,7 @@ undistribute_ops_list (enum tree_code opcode,
dcode = gimple_assign_rhs_code (oe1def);
if ((dcode != MULT_EXPR
 dcode != RDIV_EXPR)
  + || oe1-count != 1
   || !is_reassociable_op (oe1def, dcode, loop))
 continue;
 
  @@ -1243,6 +1266,8 @@ undistribute_ops_list (enum tree_code opcode,
   oecount c;
   void **slot;
   size_t idx;
  + if (oe1-count != 1)
  +   continue;
   c.oecode = oecode;
   c.cnt = 1;
   c.id = next_oecount_id++;
  @@ -1311,7 +1336,7 @@ undistribute_ops_list (enum tree_code opcode,
 
   FOR_EACH_VEC_ELT (operand_entry_t, subops[i], j, oe1)
 {
  - if (oe1-op == c-op)
  + if (oe1-op == c-op  oe1-count == 1)
 {
   SET_BIT (candidates2, i);
   ++nr_candidates2;
  @@ -3275,8 +3300,10 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
   gsi_insert_before (gsi, pow_stmt, GSI_SAME_STMT);
 }
 
  -  /* Append the result of this iteration to the ops vector.  */
  -  add_to_ops_vec (ops, iter_result);
  +  /* Append the result

[PATCH] Fix PR52976

2012-04-14 Thread William J. Schmidt

This patch corrects two errors in reassociating expressions with
repeated factors.  First, undistribution needs to recognize repeated
factors.  For now, repeated factors will be ineligible for this
optimization.  In the future, this can be improved.  Second, when a
__builtin_powi call is introduced, its target SSA name must be given a
rank higher than other operands in the operand list.  Otherwise, uses of
the call result may be introduced prior to the call.

Bootstrapped and regression tested on powerpc64-linux.  Confirmed that
cpu2000 and cpu2006 SPEC tests build cleanly.  OK for trunk?

Thanks,
Bill


2012-04-14  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/52976
* tree-ssa-reassoc.c (add_to_ops_vec_max_rank): New function.
(undistribute_ops_list): Ops with repeat counts aren't eligible for
undistribution.
(attempt_builtin_powi): Call add_to_ops_vec_max_rank.


Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 186393)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -544,6 +544,28 @@ add_repeat_to_ops_vec (VEC(operand_entry_t, heap)
   reassociate_stats.pows_encountered++;
 }
 
+/* Add an operand entry to *OPS for the tree operand OP, giving the
+   new entry a larger rank than any other operand already in *OPS.  */
+
+static void
+add_to_ops_vec_max_rank (VEC(operand_entry_t, heap) **ops, tree op)
+{
+  operand_entry_t oe = (operand_entry_t) pool_alloc (operand_entry_pool);
+  operand_entry_t oe1;
+  unsigned i;
+  unsigned max_rank = 0;
+
+  FOR_EACH_VEC_ELT (operand_entry_t, *ops, i, oe1)
+if (oe1-rank  max_rank)
+  max_rank = oe1-rank;
+
+  oe-op = op;
+  oe-rank = max_rank + 1;
+  oe-id = next_operand_entry_id++;
+  oe-count = 1;
+  VEC_safe_push (operand_entry_t, heap, *ops, oe);
+}
+
 /* Return true if STMT is reassociable operation containing a binary
operation with tree code CODE, and is inside LOOP.  */
 
@@ -1200,6 +1222,7 @@ undistribute_ops_list (enum tree_code opcode,
   dcode = gimple_assign_rhs_code (oe1def);
   if ((dcode != MULT_EXPR
dcode != RDIV_EXPR)
+ || oe1-count != 1
  || !is_reassociable_op (oe1def, dcode, loop))
continue;
 
@@ -1243,6 +1266,8 @@ undistribute_ops_list (enum tree_code opcode,
  oecount c;
  void **slot;
  size_t idx;
+ if (oe1-count != 1)
+   continue;
  c.oecode = oecode;
  c.cnt = 1;
  c.id = next_oecount_id++;
@@ -1311,7 +1336,7 @@ undistribute_ops_list (enum tree_code opcode,
 
  FOR_EACH_VEC_ELT (operand_entry_t, subops[i], j, oe1)
{
- if (oe1-op == c-op)
+ if (oe1-op == c-op  oe1-count == 1)
{
  SET_BIT (candidates2, i);
  ++nr_candidates2;
@@ -3275,8 +3300,10 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
  gsi_insert_before (gsi, pow_stmt, GSI_SAME_STMT);
}
 
-  /* Append the result of this iteration to the ops vector.  */
-  add_to_ops_vec (ops, iter_result);
+  /* Append the result of this iteration to the ops vector.
+ Give it a rank higher than all other ranks in the ops vector
+ so that all uses of it will be forced to come after it.  */
+  add_to_ops_vec_max_rank (ops, iter_result);
 
   /* Decrement the occurrence count of each element in the product
 by the count found above, and remove this many copies of each

Re: [PATCH] Fix PR18589

2012-04-12 Thread William J. Schmidt

On Thu, 2012-04-12 at 09:50 -0700, H.J. Lu wrote:
 On Thu, Apr 5, 2012 at 6:49 AM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  On Thu, 2012-04-05 at 11:23 +0200, Richard Guenther wrote:
  On Wed, Apr 4, 2012 at 9:15 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
  
   Unfortunately this seems to be necessary if I name the two passes
   reassoc1 and reassoc2.  If I try to name both of them reassoc I
   get failures in other tests like gfortran.dg/reassoc_4, where
   -fdump-tree-reassoc1 doesn't work.  Unless I'm missing something
   obvious, I think I need to keep that change.
 
  Hm, naming them reassoc1 and reassoc2 is a hack.  Naming both
  reassoc will not trigger re-naming them to reassoc1 and reassoc2
  I think.  How ugly.  Especially that -fdump-tree-reassoc will no longer
  work.  Maybe instead of using two pass structs resort to using
  the existing hack with using first_pass_instance and 
  TODO_mark_first_instance.
 
  OK, that seems to be the best among evils.  Using the
  first_pass_instance hack, the patch is transformed as below.
  Regstrapped on powerpc64-linux, no additional failures.  OK for trunk?
 
  Thanks,
  Bill
 
 
  gcc:
 
  2012-04-05  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
 PR tree-optimization/18589
 * tree-ssa-reassoc.c (reassociate_stats): Add two fields.
 (operand_entry): Add count field.
 (add_repeat_to_ops_vec): New function.
 (completely_remove_stmt): Likewise.
 (remove_def_if_absorbed_call): Likewise.
 (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls.
 (acceptable_pow_call): New function.
 (linearize_expr_tree): Look for builtin pow/powi calls and add 
  operand
 entries with repeat counts when found.
 (repeat_factor_d): New struct and associated typedefs.
 (repeat_factor_vec): New static vector variable.
 (compare_repeat_factors): New function.
 (get_reassoc_pow_ssa_name): Likewise.
 (attempt_builtin_powi): Likewise.
 (reassociate_bb): Call attempt_builtin_powi.
 (fini_reassoc): Two new calls to statistics_counter_event.
 
 
 It breaks bootstrap on Linux/ia32:
 
 ../../src-trunk/gcc/tree-ssa-reassoc.c: In function 'void
 attempt_builtin_powi(gimple, VEC_operand_entry_t_heap**,
 tree_node**)':
 ../../src-trunk/gcc/tree-ssa-reassoc.c:3189:41: error: format '%ld'
 expects argument of type 'long int', but argument 3 has type 'long
 long int' [-Werror=format]
  fprintf (dump_file, )^%ld\n, power);
  ^
 ../../src-trunk/gcc/tree-ssa-reassoc.c:3222:44: error: format '%ld'
 expects argument of type 'long int', but argument 3 has type 'long
 long int' [-Werror=format]
 fprintf (dump_file, )^%ld\n, power);
 ^
 cc1plus: all warnings being treated as errors
 
 
 H.J.
 

Whoops.  Looks like I need to use HOST_WIDE_INT_PRINT_DEC instead of %ld
in those spots.  I'll get a fix prepared.

Re: [PATCH] Fix PR18589

2012-04-12 Thread William J. Schmidt

On Thu, 2012-04-12 at 09:50 -0700, H.J. Lu wrote:
 On Thu, Apr 5, 2012 at 6:49 AM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  On Thu, 2012-04-05 at 11:23 +0200, Richard Guenther wrote:
  On Wed, Apr 4, 2012 at 9:15 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
  
   Unfortunately this seems to be necessary if I name the two passes
   reassoc1 and reassoc2.  If I try to name both of them reassoc I
   get failures in other tests like gfortran.dg/reassoc_4, where
   -fdump-tree-reassoc1 doesn't work.  Unless I'm missing something
   obvious, I think I need to keep that change.
 
  Hm, naming them reassoc1 and reassoc2 is a hack.  Naming both
  reassoc will not trigger re-naming them to reassoc1 and reassoc2
  I think.  How ugly.  Especially that -fdump-tree-reassoc will no longer
  work.  Maybe instead of using two pass structs resort to using
  the existing hack with using first_pass_instance and 
  TODO_mark_first_instance.
 
  OK, that seems to be the best among evils.  Using the
  first_pass_instance hack, the patch is transformed as below.
  Regstrapped on powerpc64-linux, no additional failures.  OK for trunk?
 
  Thanks,
  Bill
 
 
  gcc:
 
  2012-04-05  Bill Schmidt  wschm...@linux.vnet.ibm.com
 
 PR tree-optimization/18589
 * tree-ssa-reassoc.c (reassociate_stats): Add two fields.
 (operand_entry): Add count field.
 (add_repeat_to_ops_vec): New function.
 (completely_remove_stmt): Likewise.
 (remove_def_if_absorbed_call): Likewise.
 (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls.
 (acceptable_pow_call): New function.
 (linearize_expr_tree): Look for builtin pow/powi calls and add 
  operand
 entries with repeat counts when found.
 (repeat_factor_d): New struct and associated typedefs.
 (repeat_factor_vec): New static vector variable.
 (compare_repeat_factors): New function.
 (get_reassoc_pow_ssa_name): Likewise.
 (attempt_builtin_powi): Likewise.
 (reassociate_bb): Call attempt_builtin_powi.
 (fini_reassoc): Two new calls to statistics_counter_event.
 
 
 It breaks bootstrap on Linux/ia32:
 
 ../../src-trunk/gcc/tree-ssa-reassoc.c: In function 'void
 attempt_builtin_powi(gimple, VEC_operand_entry_t_heap**,
 tree_node**)':
 ../../src-trunk/gcc/tree-ssa-reassoc.c:3189:41: error: format '%ld'
 expects argument of type 'long int', but argument 3 has type 'long
 long int' [-Werror=format]
  fprintf (dump_file, )^%ld\n, power);
  ^
 ../../src-trunk/gcc/tree-ssa-reassoc.c:3222:44: error: format '%ld'
 expects argument of type 'long int', but argument 3 has type 'long
 long int' [-Werror=format]
 fprintf (dump_file, )^%ld\n, power);
 ^
 cc1plus: all warnings being treated as errors
 
 
 H.J.
 

Thanks, H.J.  Sorry for the problem!

Fixing as follows.  I'll plan to commit as obvious shortly.


2012-04-12  Bill Schmidt  wschm...@linux.vnet.ibm.com

* tree-ssa-reassoc.c (attempt_builtin_powi_stats): Change %ld to
HOST_WIDE_INT_PRINT_DEC in format strings.


Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 186384)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -3186,7 +3186,8 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
  if (elt  vec_len - 1)
fputs ( * , dump_file);
}
- fprintf (dump_file, )^%ld\n, power);
+ fprintf (dump_file, )^HOST_WIDE_INT_PRINT_DEC\n,
+  power);
}
}
}
@@ -3219,7 +3220,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
  if (elt  vec_len - 1)
fputs ( * , dump_file);
}
- fprintf (dump_file, )^%ld\n, power);
+ fprintf (dump_file, )^HOST_WIDE_INT_PRINT_DEC\n, power);
}
 
  reassociate_stats.pows_created++;

Re: [PATCH] Fix PR52614

2012-04-05 Thread William J. Schmidt

On Thu, 2012-04-05 at 11:30 +0200, Richard Guenther wrote:
 On Thu, Apr 5, 2012 at 6:22 AM, Mike Stump mikest...@comcast.net wrote:
  On Apr 4, 2012, at 7:56 PM, William J. Schmidt wrote:
  There seems to be tacit agreement that the vector tests should use
  -fno-common on all targets to avoid the recent spate of failures (see
  discussion in 52571 and 52603).
 
  OK for trunk?
 
  Ok.  Any other solution I think will be real work and we shouldn't loose 
  the testing between now and then by not having the test cases working.
 
 Ian, you are the source of all of these problems.  While I did not notice
 any degradations in SPEC (on x86) with handling commons correctly
 now, the fact
 that our testsuite needs -fno-common to make things vectorizable shows
 that users might be impacted negatively by this, which is only a real problem
 in corner cases.  Why can the link editor not promote the definitions 
 alignment
 when merging with a common with bigger alignment?
 
 Richard.
 

Follow-up question:  Should -ftree-vectorize imply -fno-common in the
short term?

Thanks,
Bill

Re: [PATCH] Fix PR18589

2012-04-05 Thread William J. Schmidt

On Thu, 2012-04-05 at 11:23 +0200, Richard Guenther wrote:
 On Wed, Apr 4, 2012 at 9:15 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  
  Unfortunately this seems to be necessary if I name the two passes
  reassoc1 and reassoc2.  If I try to name both of them reassoc I
  get failures in other tests like gfortran.dg/reassoc_4, where
  -fdump-tree-reassoc1 doesn't work.  Unless I'm missing something
  obvious, I think I need to keep that change.
 
 Hm, naming them reassoc1 and reassoc2 is a hack.  Naming both
 reassoc will not trigger re-naming them to reassoc1 and reassoc2
 I think.  How ugly.  Especially that -fdump-tree-reassoc will no longer
 work.  Maybe instead of using two pass structs resort to using
 the existing hack with using first_pass_instance and TODO_mark_first_instance.

OK, that seems to be the best among evils.  Using the
first_pass_instance hack, the patch is transformed as below.
Regstrapped on powerpc64-linux, no additional failures.  OK for trunk?

Thanks,
Bill


gcc:

2012-04-05  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/18589
* tree-ssa-reassoc.c (reassociate_stats): Add two fields.
(operand_entry): Add count field.
(add_repeat_to_ops_vec): New function.
(completely_remove_stmt): Likewise.
(remove_def_if_absorbed_call): Likewise.
(remove_visited_stmt_chain): Remove feeding builtin pow/powi calls.
(acceptable_pow_call): New function.
(linearize_expr_tree): Look for builtin pow/powi calls and add operand
entries with repeat counts when found.
(repeat_factor_d): New struct and associated typedefs.
(repeat_factor_vec): New static vector variable.
(compare_repeat_factors): New function.
(get_reassoc_pow_ssa_name): Likewise.
(attempt_builtin_powi): Likewise.
(reassociate_bb): Call attempt_builtin_powi.
(fini_reassoc): Two new calls to statistics_counter_event.

gcc/testsuite:

2012-04-05  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/18589
* gcc.dg/tree-ssa/pr18589-1.c: New test.
* gcc.dg/tree-ssa/pr18589-2.c: Likewise.
* gcc.dg/tree-ssa/pr18589-3.c: Likewise.
* gcc.dg/tree-ssa/pr18589-4.c: Likewise.
* gcc.dg/tree-ssa/pr18589-5.c: Likewise.
* gcc.dg/tree-ssa/pr18589-6.c: Likewise.
* gcc.dg/tree-ssa/pr18589-7.c: Likewise.
* gcc.dg/tree-ssa/pr18589-8.c: Likewise.
* gcc.dg/tree-ssa/pr18589-9.c: Likewise.
* gcc.dg/tree-ssa/pr18589-10.c: Likewise.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c   (revision 0)
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */
+
+double baz (double x, double y, double z, double u)
+{
+  return x * x * y * y * y * z * z * z * z * u;
+}
+
+/* { dg-final { scan-tree-dump-times  \\*  6 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c   (revision 0)
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */
+
+double baz (double x, double y, double z, double u)
+{
+  return x * x * x * y * y * y * z * z * z * z * u * u * u * u;
+}
+
+/* { dg-final { scan-tree-dump-times  \\*  6 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c   (revision 0)
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */
+
+double baz (double x, double y)
+{
+  return __builtin_pow (x, 3.0) * __builtin_pow (y, 4.0);
+}
+
+/* { dg-final { scan-tree-dump-times  \\*  4 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c   (revision 0)
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */
+
+float baz (float x, float y)
+{
+  return x * x * x * x * y * y * y * y;
+}
+
+/* { dg-final { scan-tree-dump-times  \\*  3 optimized } } */
+/* { dg-final { cleanup-tree-dump optimized } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-8.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr18589-8.c

Re: [PATCH] Fix PR18589

2012-04-04 Thread William J. Schmidt

On Wed, 2012-04-04 at 13:35 +0200, Richard Guenther wrote:
 On Tue, Apr 3, 2012 at 10:25 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
 
 
  On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote:
  On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
   Hi,
  
   This is a re-post of the patch I posted for comments in January to
   address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589.  The patch
   modifies reassociation to expose repeated factors from __builtin_pow*
   calls, optimally reassociate repeated factors, and possibly reconstitute
   __builtin_powi calls from the results of reassociation.
  
   Bootstrapped and passes regression tests for powerpc64-linux-gnu.  I
   expect there may need to be some small changes, but I am targeting this
   for trunk approval.
  
   Thanks very much for the review,
 
  Hmm.  How much work would it be to extend the reassoc 'IL' to allow
  a repeat factor per op?  I realize what you do is all within what reassoc
  already does though ideally we would not require any GIMPLE IL changes
  for building up / optimizing the reassoc IL but only do so when we commit
  changes.
 
  Thanks,
  Richard.
 
  Hi Richard,
 
  I've revised my patch along these lines; see the new version below.
  While testing it I realized I could do a better job of reducing the
  number of multiplies, so there are some changes to that logic as well,
  and a couple of additional test cases.  Regstrapped successfully on
  powerpc64-linux.
 
  Hope this looks better!
 
 Yes indeed.  A few observations though.  You didn't integrate
 attempt_builtin_powi
 with optimize_ops_list - presumably because it's result does not really fit
 the single-operation assumption?  But note that undistribute_ops_list and
 optimize_range_tests have the same issue.  Thus, I'd have prefered if
 attempt_builtin_powi worked in the same way, remove the parts of the
 ops list it consumed and stick an operand for its result there instead.
 That should simplify things (not having that special powi_result) and
 allow for multiple powi results in a single op list?

Multiple powi results are already handled, but yes, what you're
suggesting would simplify things by eliminating the need to create
explicit multiplies to join them and the cached-multiply results
together.  Sounds reasonable on the surface; it just hadn't occurred to
me to do it this way.  I'll have a look.

Any other major concerns while I'm reworking this?

Thanks,
Bill
 
 Thanks,
 Richard.

Re: [PATCH] Fix PR18589

2012-04-04 Thread William J. Schmidt

On Wed, 2012-04-04 at 15:08 +0200, Richard Guenther wrote:
 On Wed, Apr 4, 2012 at 2:35 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  On Wed, 2012-04-04 at 13:35 +0200, Richard Guenther wrote:
  On Tue, Apr 3, 2012 at 10:25 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
  
  
   On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote:
   On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
Hi,
   
This is a re-post of the patch I posted for comments in January to
address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589.  The patch
modifies reassociation to expose repeated factors from __builtin_pow*
calls, optimally reassociate repeated factors, and possibly 
reconstitute
__builtin_powi calls from the results of reassociation.
   
Bootstrapped and passes regression tests for powerpc64-linux-gnu.  I
expect there may need to be some small changes, but I am targeting 
this
for trunk approval.
   
Thanks very much for the review,
  
   Hmm.  How much work would it be to extend the reassoc 'IL' to allow
   a repeat factor per op?  I realize what you do is all within what 
   reassoc
   already does though ideally we would not require any GIMPLE IL changes
   for building up / optimizing the reassoc IL but only do so when we 
   commit
   changes.
  
   Thanks,
   Richard.
  
   Hi Richard,
  
   I've revised my patch along these lines; see the new version below.
   While testing it I realized I could do a better job of reducing the
   number of multiplies, so there are some changes to that logic as well,
   and a couple of additional test cases.  Regstrapped successfully on
   powerpc64-linux.
  
   Hope this looks better!
 
  Yes indeed.  A few observations though.  You didn't integrate
  attempt_builtin_powi
  with optimize_ops_list - presumably because it's result does not really fit
  the single-operation assumption?  But note that undistribute_ops_list and
  optimize_range_tests have the same issue.  Thus, I'd have prefered if
  attempt_builtin_powi worked in the same way, remove the parts of the
  ops list it consumed and stick an operand for its result there instead.
  That should simplify things (not having that special powi_result) and
  allow for multiple powi results in a single op list?
 
  Multiple powi results are already handled, but yes, what you're
  suggesting would simplify things by eliminating the need to create
  explicit multiplies to join them and the cached-multiply results
  together.  Sounds reasonable on the surface; it just hadn't occurred to
  me to do it this way.  I'll have a look.
 
  Any other major concerns while I'm reworking this?
 
 No, the rest looks fine (you should not need to repace
 -fdump-tree-reassoc-details
 with -fdump-tree-reassoc1-details -fdump-tree-reassoc2-details in the first
 testcase).

Unfortunately this seems to be necessary if I name the two passes
reassoc1 and reassoc2.  If I try to name both of them reassoc I
get failures in other tests like gfortran.dg/reassoc_4, where
-fdump-tree-reassoc1 doesn't work.  Unless I'm missing something
obvious, I think I need to keep that change.

Frankly I was surprised and relieved that there weren't more tests that
used the generic -fdump-tree-reassoc.

Thanks,
Bill
 
 Thanks,
 Richard.
 
  Thanks,
  Bill
 
  Thanks,
  Richard.

Re: [PATCH] Fix PR18589

2012-04-04 Thread William J. Schmidt

On Wed, 2012-04-04 at 13:35 +0200, Richard Guenther wrote:
 On Tue, Apr 3, 2012 at 10:25 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
 
  Hi Richard,
 
  I've revised my patch along these lines; see the new version below.
  While testing it I realized I could do a better job of reducing the
  number of multiplies, so there are some changes to that logic as well,
  and a couple of additional test cases.  Regstrapped successfully on
  powerpc64-linux.
 
  Hope this looks better!
 
 Yes indeed.  A few observations though.  You didn't integrate
 attempt_builtin_powi
 with optimize_ops_list - presumably because it's result does not really fit
 the single-operation assumption?  But note that undistribute_ops_list and
 optimize_range_tests have the same issue.  Thus, I'd have prefered if
 attempt_builtin_powi worked in the same way, remove the parts of the
 ops list it consumed and stick an operand for its result there instead.
 That should simplify things (not having that special powi_result) and
 allow for multiple powi results in a single op list?

An excellent suggestion.  I've implemented it below and it is indeed
much cleaner this way.

Bootstrapped/regression tested with no new failures on powerpc64-linux.
Is this incarnation OK for trunk?

Thanks,
Bill

 
 Thanks,
 Richard.
 
  Thanks,
  Bill


gcc:

2012-04-04  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/18589
* tree-pass.h: Replace pass_reassoc with pass_early_reassoc and
pass_late_reassoc.
* passes.c (init_optimization_passes): Change pass_reassoc calls to
pass_early_reassoc and pass_late_reassoc.
* tree-ssa-reassoc.c (reassociate_stats): Add two fields.
(operand_entry): Add count field.
(early_reassoc): New static var.
(add_repeat_to_ops_vec): New function.
(completely_remove_stmt): Likewise.
(remove_def_if_absorbed_call): Likewise.
(remove_visited_stmt_chain): Remove feeding builtin pow/powi calls.
(acceptable_pow_call): New function.
(linearize_expr_tree): Look for builtin pow/powi calls and add operand
entries with repeat counts when found.
(repeat_factor_d): New struct and associated typedefs.
(repeat_factor_vec): New static vector variable.
(compare_repeat_factors): New function.
(get_reassoc_pow_ssa_name): Likewise.
(attempt_builtin_powi): Likewise.
(reassociate_bb): Call attempt_builtin_powi.
(fini_reassoc): Two new calls to statistics_counter_event.
(execute_early_reassoc): New function.
(execute_late_reassoc): Likewise.
(pass_early_reassoc): Rename from pass_reassoc, call
execute_early_reassoc.
(pass_late_reassoc): New gimple_opt_pass that calls
execute_late_reassoc.

gcc/testsuite:

2012-04-04  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/18589
* gcc.dg/pr46309.c: Change -fdump-tree-reassoc-details to
-fdump-tree-reassoc1-details and -fdump-tree-reassoc2-details.
* gcc.dg/tree-ssa/pr18589-1.c: New test.
* gcc.dg/tree-ssa/pr18589-2.c: Likewise.
* gcc.dg/tree-ssa/pr18589-3.c: Likewise.
* gcc.dg/tree-ssa/pr18589-4.c: Likewise.
* gcc.dg/tree-ssa/pr18589-5.c: Likewise.
* gcc.dg/tree-ssa/pr18589-6.c: Likewise.
* gcc.dg/tree-ssa/pr18589-7.c: Likewise.
* gcc.dg/tree-ssa/pr18589-8.c: Likewise.
* gcc.dg/tree-ssa/pr18589-9.c: Likewise.
* gcc.dg/tree-ssa/pr18589-10.c: Likewise.


Index: gcc/tree-pass.h
===
--- gcc/tree-pass.h (revision 186108)
+++ gcc/tree-pass.h (working copy)
@@ -441,7 +441,8 @@ extern struct gimple_opt_pass pass_copy_prop;
 extern struct gimple_opt_pass pass_vrp;
 extern struct gimple_opt_pass pass_uncprop;
 extern struct gimple_opt_pass pass_return_slot;
-extern struct gimple_opt_pass pass_reassoc;
+extern struct gimple_opt_pass pass_early_reassoc;
+extern struct gimple_opt_pass pass_late_reassoc;
 extern struct gimple_opt_pass pass_rebuild_cgraph_edges;
 extern struct gimple_opt_pass pass_remove_cgraph_callee_edges;
 extern struct gimple_opt_pass pass_build_cgraph_edges;
Index: gcc/testsuite/gcc.dg/pr46309.c
===
--- gcc/testsuite/gcc.dg/pr46309.c  (revision 186108)
+++ gcc/testsuite/gcc.dg/pr46309.c  (working copy)
@@ -1,6 +1,6 @@
 /* PR tree-optimization/46309 */
 /* { dg-do compile } */
-/* { dg-options -O2 -fdump-tree-reassoc-details } */
+/* { dg-options -O2 -fdump-tree-reassoc1-details 
-fdump-tree-reassoc2-details } */
 /* The transformation depends on BRANCH_COST being greater than 1
(see the notes in the PR), so try to force that.  */
 /* { dg-additional-options -mtune=octeon2 { target mips*-*-* } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c

[PATCH] Fix PR52614

2012-04-04 Thread William J. Schmidt

There seems to be tacit agreement that the vector tests should use
-fno-common on all targets to avoid the recent spate of failures (see
discussion in 52571 and 52603).  This patch (proposed by Dominique
D'Humieures) does just that.  I agreed to shepherd the patch through.
I've verified that it removes the failures for powerpc64-linux.  Various
others have verified for arm, sparc, and darwin.  OK for trunk?

Thanks,
Bill


gcc/testsuite:

2012-04-04  Bill Schmidt  wschm...@linux.vnet.ibm.com
Dominique D'Humieures domi...@lps.ens.fr

PR testsuite/52614
* gcc.dg/vect/vect.exp: Use -fno-common on all targets.
* gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp: Likewise.


Index: gcc/testsuite/gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp
===
--- gcc/testsuite/gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp  
(revision 186108)
+++ gcc/testsuite/gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp  
(working copy)
@@ -34,7 +34,7 @@ if ![is-effective-target powerpc_altivec_ok] {
 set DEFAULT_VECTCFLAGS 
 
 # These flags are used for all targets.
-lappend DEFAULT_VECTCFLAGS -O2 -ftree-vectorize -fvect-cost-model
+lappend DEFAULT_VECTCFLAGS -O2 -ftree-vectorize -fvect-cost-model 
-fno-common
 
 # If the target system supports vector instructions, the default action
 # for a test is 'run', otherwise it's 'compile'.  Save current default.
Index: gcc/testsuite/gcc.dg/vect/vect.exp
===
--- gcc/testsuite/gcc.dg/vect/vect.exp  (revision 186108)
+++ gcc/testsuite/gcc.dg/vect/vect.exp  (working copy)
@@ -40,7 +40,7 @@ if ![check_vect_support_and_set_flags] {
 }
 
 # These flags are used for all targets.
-lappend DEFAULT_VECTCFLAGS -ftree-vectorize -fno-vect-cost-model
+lappend DEFAULT_VECTCFLAGS -ftree-vectorize -fno-vect-cost-model 
-fno-common
 
 # Initialize `dg'.
 dg-init

Re: [PATCH] Fix PR18589

2012-04-03 Thread William J. Schmidt



On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote:
 On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  Hi,
 
  This is a re-post of the patch I posted for comments in January to
  address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589.  The patch
  modifies reassociation to expose repeated factors from __builtin_pow*
  calls, optimally reassociate repeated factors, and possibly reconstitute
  __builtin_powi calls from the results of reassociation.
 
  Bootstrapped and passes regression tests for powerpc64-linux-gnu.  I
  expect there may need to be some small changes, but I am targeting this
  for trunk approval.
 
  Thanks very much for the review,
 
 Hmm.  How much work would it be to extend the reassoc 'IL' to allow
 a repeat factor per op?  I realize what you do is all within what reassoc
 already does though ideally we would not require any GIMPLE IL changes
 for building up / optimizing the reassoc IL but only do so when we commit
 changes.
 
 Thanks,
 Richard.

Hi Richard,

I've revised my patch along these lines; see the new version below.
While testing it I realized I could do a better job of reducing the
number of multiplies, so there are some changes to that logic as well,
and a couple of additional test cases.  Regstrapped successfully on
powerpc64-linux.

Hope this looks better!

Thanks,
Bill


gcc:

2012-04-03  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/18589
* tree-pass.h: Replace pass_reassoc with pass_early_reassoc and
pass_late_reassoc.
* passes.c (init_optimization_passes): Change pass_reassoc calls to
pass_early_reassoc and pass_late_reassoc.
* tree-ssa-reassoc.c (reassociate_stats): Add two fields.
(operand_entry): Add count field.
(early_reassoc): New static var.
(add_repeat_to_ops_vec): New function.
(completely_remove_stmt): Likewise.
(remove_def_if_absorbed_call): Likewise.
(remove_visited_stmt_chain): Remove feeding builtin pow/powi calls.
(acceptable_pow_call): New function.
(linearize_expr_tree): Look for builtin pow/powi calls and add operand
entries with repeat counts when found.
(repeat_factor_d): New struct and associated typedefs.
(repeat_factor_vec): New static vector variable.
(compare_repeat_factors): New function.
(get_reassoc_pow_ssa_name): Likewise.
(attempt_builtin_powi): Likewise.
(reassociate_bb): Attempt to create __builtin_powi calls, and multiply
their results by any leftover reassociated factors; remove builtin
pow/powi calls that were absorbed by reassociation.
(fini_reassoc): Two new calls to statistics_counter_event.
(execute_early_reassoc): New function.
(execute_late_reassoc): Likewise.
(pass_early_reassoc): Replace pass_reassoc, renamed to reassoc1,
call execute_early_reassoc.
(pass_late_reassoc): New gimple_opt_pass named reassoc2 that calls
execute_late_reassoc.

gcc/testsuite:

2012-04-03  Bill Schmidt  wschm...@linux.vnet.ibm.com

PR tree-optimization/18589
* gcc.dg/pr46309.c: Change -fdump-tree-reassoc-details to
-fdump-tree-reassoc[12]-details.
* gcc.dg/tree-ssa/pr18589-1.c: New test.
* gcc.dg/tree-ssa/pr18589-2.c: Likewise.
* gcc.dg/tree-ssa/pr18589-3.c: Likewise.
* gcc.dg/tree-ssa/pr18589-4.c: Likewise.
* gcc.dg/tree-ssa/pr18589-5.c: Likewise.
* gcc.dg/tree-ssa/pr18589-6.c: Likewise.
* gcc.dg/tree-ssa/pr18589-7.c: Likewise.
* gcc.dg/tree-ssa/pr18589-8.c: Likewise.
* gcc.dg/tree-ssa/pr18589-9.c: Likewise.
* gcc.dg/tree-ssa/pr18589-10.c: Likewise.


Index: gcc/tree-pass.h
===
--- gcc/tree-pass.h (revision 186108)
+++ gcc/tree-pass.h (working copy)
@@ -441,7 +441,8 @@ extern struct gimple_opt_pass pass_copy_prop;
 extern struct gimple_opt_pass pass_vrp;
 extern struct gimple_opt_pass pass_uncprop;
 extern struct gimple_opt_pass pass_return_slot;
-extern struct gimple_opt_pass pass_reassoc;
+extern struct gimple_opt_pass pass_early_reassoc;
+extern struct gimple_opt_pass pass_late_reassoc;
 extern struct gimple_opt_pass pass_rebuild_cgraph_edges;
 extern struct gimple_opt_pass pass_remove_cgraph_callee_edges;
 extern struct gimple_opt_pass pass_build_cgraph_edges;
Index: gcc/testsuite/gcc.dg/pr46309.c
===
--- gcc/testsuite/gcc.dg/pr46309.c  (revision 186108)
+++ gcc/testsuite/gcc.dg/pr46309.c  (working copy)
@@ -1,6 +1,6 @@
 /* PR tree-optimization/46309 */
 /* { dg-do compile } */
-/* { dg-options -O2 -fdump-tree-reassoc-details } */
+/* { dg-options -O2 -fdump-tree-reassoc1-details 
-fdump-tree-reassoc2-details } */
 /* The transformation depends on BRANCH_COST being greater than 1
(see

Re: [PATCH] Fix PR18589

2012-03-28 Thread William J. Schmidt



On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote:
 On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  Hi,
 
  This is a re-post of the patch I posted for comments in January to
  address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589.  The patch
  modifies reassociation to expose repeated factors from __builtin_pow*
  calls, optimally reassociate repeated factors, and possibly reconstitute
  __builtin_powi calls from the results of reassociation.
 
  Bootstrapped and passes regression tests for powerpc64-linux-gnu.  I
  expect there may need to be some small changes, but I am targeting this
  for trunk approval.
 
  Thanks very much for the review,
 
 Hmm.  How much work would it be to extend the reassoc 'IL' to allow
 a repeat factor per op?  I realize what you do is all within what reassoc
 already does though ideally we would not require any GIMPLE IL changes
 for building up / optimizing the reassoc IL but only do so when we commit
 changes.

Ah, I take your point.  I will look into it.  We still need the
additional data structures to allow sorting by factor repeat counts, but
perhaps expanding the builtins can be avoided until it's proven
necessary.  The patch as submitted may be slightly easier to implement
and understand, but I agree it would be better to avoid changing GIMPLE
unnecessarily if possible.  I'll get back to you shortly.

Thanks,
Bill

 
 Thanks,
 Richard.

Re: [PATCH] Straight line strength reduction, part 1

2012-03-21 Thread William J. Schmidt

On Wed, 2012-03-21 at 10:33 +0100, Richard Guenther wrote:
 On Mon, Mar 19, 2012 at 2:19 AM, Andrew Pinski pins...@gmail.com wrote:
  On Sun, Mar 18, 2012 at 6:12 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
  Greetings,
 
  Now that we're into stage 1 again, I'd like to submit the first round of
  changes for dominator-based strength reduction, which will address
  issues from PR22586, PR35308, PR46556, and perhaps others.  I'm
  attaching two patches: the smaller (slsr-part1) is the patch I'm
  submitting for approval today, while the larger (slsr-fyi) is for
  reference only, but may be useful if questions arise about how the small
  patch fits into the intended whole.
 
  This patch contains the logic for identifying strength reduction
  candidates, and makes replacements only for those candidates where the
  stride is a fixed constant.  Replacement for candidates with fixed but
  unknown strides are not implemented herein, but that logic can be viewed
  in the larger patch.  This patch does not address strength reduction of
  data reference expressions, or candidates with conditional increments;
  those issues will be dealt with in future patches.
 
  The cost model is built on the one used by tree-ssa-ivopts.c, and I've
  added some new instruction costs to that model in place.  It might
  eventually be good to divorce that modeling code from IVOPTS, but that's
  an orthogonal patch and somewhat messy.
 
  I think this is the wrong way to do straight line strength reduction
  considering we have a nice value numbering system which should be easy
  to extended to support it.
 
 Well, it is easy to handle very specific easy cases like
 
 a = i * 2;
 b = i * 3;
 c = i * 4;
 
 to transform it to
 
 a = i * 2;
 b = a + i;
 c = b + i;
 
 but already
 
 a = i * 2;
 b = i * 4;
 c = i * 6;
 
 would need extra special code.  The easy case could be handled in eliminate ()
 by, when seeing A * CST, looking up A * (CST - 1) and if that
 succeeds, transform
 it to VAL + A.  Cost issues are increasing the lifetime of VAL.  I've done 
 this
 simple case at some point, but it failed to handle the common associated 
 cases,
 when we transform (a + 1) * 2, (a + 1) * 3, etc. to a * 2 + 2, a * 3 +
 3, etc.  I think
 it is the re-association in case of a strength-reduction opportunity
 that makes the
 separate pass better?  How would you suggest handling this case in the
 VN framework?  Detect the a * 3 + 3 pattern and then do two lookups, one for
 a * 2 and one for val + 2?  But then we still don't have a value for a + 1
 to re-use ...

And it becomes even more difficult with more complex scenarios.
Consider:

a = x + (3 * s);
b = x + (5 * s);
c = x + (7 * s);

The framework I've developed recognizes that this group of instructions
is related, and that it is profitable to replace them as follows:

a = x + (3 * s);
t = 2 * s;
b = a + t;
c = b + t;

The introduced multiply by 2 (one shift) is far cheaper than the two
multiplies that it replaces.  However, suppose you have instead:

a = x + (2 * s);
b = x + (8 * s);

Now it isn't profitable to replace this by:

a = x + (2 * s);
t = 6 * s;
b = a + t;

since a multiply by 6 (2 shifts, one add) is more costly than a multiply
by 8 (one shift).  To make these decisions correctly requires analyzing
all the related statements together, which value numbering as it stands
is not equipped to do.  Logic to handle these cases is included in my
larger fyi patch.

As another example, consider conditionally-executed increments:

a = i * 5;
if (...)
  i = i + 1;
b = i * 5;

This can be correctly and profitably strength-reduced as:

a = i * 5;
t = a;
if (...)
  {
i = i + 1;
t = t + 5;
  }
b = t;

(This is an approximation to the actual phi representation, which I've
omitted for clarity.)  Again, this kind of analysis is not something
that fits naturally into value numbering.  I don't yet have this in the
fyi patch, but have it largely working in a private version.

My conclusion is that if strength reduction is done in value numbering,
it must either be a very limited form of strength reduction, or the kind
of logic I've developed that considers chains of related candidates
together must be glued onto value numbering.  I think the latter would
be a mistake, as it would introduce much unnecessary complexity to what
is currently a very clean approach to PRE; the strength reduction would
become an ugly wart that people would complain about.  I think it's far
cleaner to keep the two issues separate.

 
 Bill, experimenting with pattern detection in eliminate () would be a
 possibility.

For the reasons expressed above, I don't think that would get very far
or make anyone very happy...

I appreciate Andrew's view that value numbering is a logical place to do
strength reduction, but after considering the problem over the last few
months I have to disagree.  If you don't mind, at this point I would
prefer to have my current patch considered on its merits.

Thanks,
Bill

1 2 >

1 - 100 of 197 matches

Mail list logo