[PATCH, rs6000] Remove XFAIL from default_format_denormal_2.f90 for PowerPC on Linux
Hi, The testcase gfortran.dg/default_format_denormal_2.f90 has been reporting XPASS since 4.8 on the powerpc*-unknown-linux-gnu platforms. This patch removes the XFAIL for powerpc*-*-linux-* from the test. I believe this pattern doesn't match any other platforms, but please let me know if I should replace it with a more specific pattern instead. Verified on powerpc64-unknown-linux-gnu (-m32 and -m64) and powerpc64le-unknown-linux-gnu (-m64). Is this ok for trunk, 4.9, and 4.8? Thanks, Bill 2014-06-17 Bill Schmidt wschm...@linux.vnet.ibm.com * gfortran.dg/default_format_denormal_2.f90: Remove xfail for powerpc*-*-linux*. Index: gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 === --- gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 (revision 211741) +++ gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 (working copy) @@ -1,5 +1,5 @@ ! { dg-require-effective-target fortran_large_real } -! { dg-do run { xfail powerpc*-apple-darwin* powerpc*-*-linux* } } +! { dg-do run { xfail powerpc*-apple-darwin* } } ! Test XFAILed on these platforms because the system's printf() lacks ! proper support for denormalized long doubles. See PR24685 !
Re: [PATCH] Fix PR54674
On Tue, 2012-09-25 at 09:14 +0200, Richard Guenther wrote: On Mon, 24 Sep 2012, William J. Schmidt wrote: In cases where pointers and ints are cast back and forth, SLSR can be tricked into introducing a multiply where one of the operands is of pointer type. Don't do that! Verified that the reduced test case in the PR is fixed with a cross-compile to sh4-unknown-linux-gnu with -Os, which is the only known situation where the replacement looks profitable. (It appears multiply costs are underestimated.) Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Ok. Btw, a multiply by/of a pointer in GIMPLE is done by casting to an appropriate unsigned type, doing the multiply, and then casting back to the pointer type. Just in case it _is_ profitable to do the transform (the patch seems to try to avoid the situation only?) Ok, that's good to know, thanks. There's a general to-do in that area to make the whole casting part better than it is right now, and that should be addressed when I can get back to GCC and work on some of these things. I'll add a comment to that effect. Appreciate the information! Thanks, Bill Thanks, Richard. Thanks, Bill 2012-09-24 Bill Schmidt wschm...@linux.vnet.ibm.com * gimple-ssa-strength-reduction.c (analyze_increments): Don't introduce a multiplication with a pointer operand. Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 191665) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -2028,6 +2028,15 @@ analyze_increments (slsr_cand_t first_dep, enum ma incr_vec[i].cost = COST_INFINITE; + /* If we need to add an initializer, make sure we don't introduce +a multiply by a pointer type, which can happen in certain cast +scenarios. */ + else if (!incr_vec[i].initializer + TREE_CODE (first_dep-stride) != INTEGER_CST + POINTER_TYPE_P (TREE_TYPE (first_dep-stride))) + + incr_vec[i].cost = COST_INFINITE; + /* For any other increment, if this is a multiply candidate, we must introduce a temporary T and initialize it with T_0 = stride * increment. When optimizing for speed, walk the
[PATCH] Fix PR54492
Richard found some N^2 behavior in SLSR that has to be suppressed. Searching for the best possible basis is overkill when there are hundreds of thousands of possibilities. This patch constrains the search to good enough in such cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no regressions. Ok for trunk? Thanks, Bill 2012-08-10 Bill Schmidt wschm...@linux.vnet.ibm.com * gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit the time spent searching for a basis. Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 191135) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c) cand_chain_t chain; slsr_cand_t basis = NULL; + // Limit potential of N^2 behavior for long candidate chains. + int iters = 0; + const int MAX_ITERS = 50; + mapping_key.base_expr = c-base_expr; chain = (cand_chain_t) htab_find (base_cand_map, mapping_key); - for (; chain; chain = chain-next) + for (; chain iters MAX_ITERS; chain = chain-next, ++iters) { slsr_cand_t one_basis = chain-cand;
Re: [PATCH] Fix PR54492
On Mon, 2012-09-10 at 16:45 +0200, Richard Guenther wrote: On Mon, 10 Sep 2012, William J. Schmidt wrote: Richard found some N^2 behavior in SLSR that has to be suppressed. Searching for the best possible basis is overkill when there are hundreds of thousands of possibilities. This patch constrains the search to good enough in such cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no regressions. Ok for trunk? Hm, rather than stopping the search, can we stop adding new candidates instead so the list never grows that long? If that's not easy the patch is ok as-is. I think this way is probably better. Right now the potential bases are organized as a stack with new ones added to the front and considered first. To disable it there would require adding state to keep a count, and then we would only be looking at the most distant ones. This way the 50 most recently added potential bases (most likely to be local) are considered. Thanks, Bill Thanks, Richard. Thanks, Bill 2012-08-10 Bill Schmidt wschm...@linux.vnet.ibm.com * gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit the time spent searching for a basis. Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 191135) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c) cand_chain_t chain; slsr_cand_t basis = NULL; + // Limit potential of N^2 behavior for long candidate chains. + int iters = 0; + const int MAX_ITERS = 50; + mapping_key.base_expr = c-base_expr; chain = (cand_chain_t) htab_find (base_cand_map, mapping_key); - for (; chain; chain = chain-next) + for (; chain iters MAX_ITERS; chain = chain-next, ++iters) { slsr_cand_t one_basis = chain-cand;
Re: [PATCH] Fix PR54492
On Mon, 2012-09-10 at 16:56 +0200, Richard Guenther wrote: On Mon, 10 Sep 2012, Jakub Jelinek wrote: On Mon, Sep 10, 2012 at 04:45:24PM +0200, Richard Guenther wrote: On Mon, 10 Sep 2012, William J. Schmidt wrote: Richard found some N^2 behavior in SLSR that has to be suppressed. Searching for the best possible basis is overkill when there are hundreds of thousands of possibilities. This patch constrains the search to good enough in such cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no regressions. Ok for trunk? Hm, rather than stopping the search, can we stop adding new candidates instead so the list never grows that long? If that's not easy the patch is ok as-is. Don't we want a param for that, or is a hardcoded magic constant fine here? I suppose a param for it would be nice. OK, I'll get a param in place and get back to you. Thanks... Bill Richard. 2012-08-10 Bill Schmidt wschm...@linux.vnet.ibm.com * gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit the time spent searching for a basis. Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 191135) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c) cand_chain_t chain; slsr_cand_t basis = NULL; + // Limit potential of N^2 behavior for long candidate chains. + int iters = 0; + const int MAX_ITERS = 50; + mapping_key.base_expr = c-base_expr; chain = (cand_chain_t) htab_find (base_cand_map, mapping_key); - for (; chain; chain = chain-next) + for (; chain iters MAX_ITERS; chain = chain-next, ++iters) { slsr_cand_t one_basis = chain-cand; Jakub
Re: [PATCH] Fix PR54492
Here's the revised patch with a param. Bootstrapped and tested in the same manner. Ok for trunk? Thanks, Bill 2012-08-10 Bill Schmidt wschm...@linux.vnet.ibm.com * doc/invoke.texi (max-slsr-cand-scan): New description. * gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit the time spent searching for a basis. * params.def (PARAM_MAX_SLSR_CANDIDATE_SCAN): New param. Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 191135) +++ gcc/doc/invoke.texi (working copy) @@ -9407,6 +9407,11 @@ having a regular register file and accurate regist See @file{haifa-sched.c} in the GCC sources for more details. The default choice depends on the target. + +@item max-slsr-cand-scan +Set the maximum number of existing candidates that will be considered when +seeking a basis for a new straight-line strength reduction candidate. + @end table @end table Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 191135) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see #include domwalk.h #include pointer-set.h #include expmed.h +#include params.h /* Information about a strength reduction candidate. Each statement in the candidate table represents an expression of one of the @@ -353,10 +354,14 @@ find_basis_for_candidate (slsr_cand_t c) cand_chain_t chain; slsr_cand_t basis = NULL; + // Limit potential of N^2 behavior for long candidate chains. + int iters = 0; + int max_iters = PARAM_VALUE (PARAM_MAX_SLSR_CANDIDATE_SCAN); + mapping_key.base_expr = c-base_expr; chain = (cand_chain_t) htab_find (base_cand_map, mapping_key); - for (; chain; chain = chain-next) + for (; chain iters max_iters; chain = chain-next, ++iters) { slsr_cand_t one_basis = chain-cand; Index: gcc/params.def === --- gcc/params.def (revision 191135) +++ gcc/params.def (working copy) @@ -973,6 +973,13 @@ DEFPARAM (PARAM_SCHED_PRESSURE_ALGORITHM, Which -fsched-pressure algorithm to apply, 1, 1, 2) +/* Maximum length of candidate scans in straight-line strength reduction. */ +DEFPARAM (PARAM_MAX_SLSR_CANDIDATE_SCAN, + max-slsr-cand-scan, + Maximum length of candidate scans for straight-line + strength reduction, + 50, 1, 99) + /* Local variables: mode:c
Re: [patch] rs6000: plug a leak
On Thu, 2012-08-23 at 00:53 +0200, Steven Bosscher wrote: Hello Bill, This patch plugs a leak in rs6000.c:rs6000_density_test(). You have to free the array that get_loop_body returns. Noticed while going over all uses of get_loop_body (it's a common mistake to leak the return array). Patch is completely untested because I don't know when/how this function is used. You've added this function: 2012-07-31 Bill Schmidt ... * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise costs for vec_perm and vec_promote_demote down to more natural values. (struct _rs6000_cost_data): New data structure. --(rs6000_density_test): New function so I suppose you know what it's for and how to test this patch :-) Could you test this for me and commit it if nothing strange happens? Hi Steven, Regstrapped with no additional failures on powerpc64-unknown-linux-gnu. Built CPU2006 without error. Committed as obvious. Thanks again, Bill Thanks, Ciao! Steven Index: config/rs6000/rs6000.c === --- config/rs6000/rs6000.c (revision 190601) +++ config/rs6000/rs6000.c (working copy) @@ -3509,6 +3509,7 @@ rs6000_density_test (rs6000_cost_data *d not_vec_cost++; } } + free (bbs); density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
Re: [patch] rs6000: plug a leak
On Thu, 2012-08-23 at 00:53 +0200, Steven Bosscher wrote: Hello Bill, This patch plugs a leak in rs6000.c:rs6000_density_test(). You have to free the array that get_loop_body returns. Noticed while going over all uses of get_loop_body (it's a common mistake to leak the return array). Patch is completely untested because I don't know when/how this function is used. You've added this function: 2012-07-31 Bill Schmidt ... * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise costs for vec_perm and vec_promote_demote down to more natural values. (struct _rs6000_cost_data): New data structure. --(rs6000_density_test): New function so I suppose you know what it's for and how to test this patch :-) Could you test this for me and commit it if nothing strange happens? Sure thing! Thanks for catching this. Bill Thanks, Ciao! Steven Index: config/rs6000/rs6000.c === --- config/rs6000/rs6000.c (revision 190601) +++ config/rs6000/rs6000.c (working copy) @@ -3509,6 +3509,7 @@ rs6000_density_test (rs6000_cost_data *d not_vec_cost++; } } + free (bbs); density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
[PATCH] Fix PR54240
Replace the once vacuously true, and now vacuously false, test for existence of a conditional move instruction for a given mode, with one that actually checks what it's supposed to. Add a test case so we don't miss such things in future. The test is powerpc-specific. It would be good to have an i386 version of the test as well, if someone can help with that. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: 2012-08-13 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/54240 * tree-ssa-phiopt.c (hoist_adjacent_loads): Correct test for existence of conditional move with given mode. gcc/testsuite: 2012-08-13 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/54240 * gcc.target/powerpc/pr54240.c: New test. Index: gcc/testsuite/gcc.target/powerpc/pr54240.c === --- gcc/testsuite/gcc.target/powerpc/pr54240.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/pr54240.c (revision 0) @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -misel -fdump-tree-phiopt-details } */ + +typedef struct s { + int v; + int b; + struct s *l; + struct s *r; +} S; + + +int foo(S *s) +{ + S *this; + S *next; + + this = s; + if (this-b) +next = this-l; + else +next = this-r; + + return next-v; +} + +/* { dg-final { scan-tree-dump Hoisting adjacent loads phiopt1 } } */ +/* { dg-final { cleanup-tree-dump phiopt1 } } */ Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 190305) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -1843,7 +1843,8 @@ hoist_adjacent_loads (basic_block bb0, basic_block /* Check the mode of the arguments to be sure a conditional move can be generated for it. */ - if (!optab_handler (cmov_optab, TYPE_MODE (TREE_TYPE (arg1 + if (optab_handler (movcc_optab, TYPE_MODE (TREE_TYPE (arg1))) + == CODE_FOR_nothing) continue; /* Both statements must be assignments whose RHS is a COMPONENT_REF. */
Re: [PATCH] Fix PR54240
Thanks, Andrew! Bill On Tue, 2012-08-14 at 14:17 -0700, Andrew Pinski wrote: On Tue, Aug 14, 2012 at 2:15 PM, Andrew Pinski pins...@gmail.com wrote: On Tue, Aug 14, 2012 at 2:11 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Replace the once vacuously true, and now vacuously false, test for existence of a conditional move instruction for a given mode, with one that actually checks what it's supposed to. Add a test case so we don't miss such things in future. The test is powerpc-specific. It would be good to have an i386 version of the test as well, if someone can help with that. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Here is one which can go into gcc.target/mips : /* { dg-do compile } */ /* { dg-options -O2 -fdump-tree-phiopt-details } */ Sorry the dg-options should be: /* { dg-options -O2 -fdump-tree-phiopt-details isa=4 } */ Thanks, Andrew typedef struct s { int v; int b; struct s *l; struct s *r; } S; int foo(S *s) { S *this; S *next; this = s; if (this-b) next = this-l; else next = this-r; return next-v; } /* { dg-final { scan-tree-dump Hoisting adjacent loads phiopt1 } } */ /* { dg-final { cleanup-tree-dump phiopt1 } } */
[PATCH] Fix PR54245
Currently we can insert an initializer that performs a multiply in too small of a type for correctness. For now, detect the problem and avoid the optimization when this would happen. Eventually I will fix this up to cause the multiply to be performed in a sufficiently wide type. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: 2012-08-14 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/54245 * gimple-ssa-strength-reduction.c (legal_cast_p_1): New function. (legal_cast_p): Split out logic to legal_cast_p_1. (analyze_increments): Avoid introducing multiplies in smaller types. gcc/testsuite: 2012-08-14 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/54245 * gcc.dg/tree-ssa/pr54245.c: New test. Index: gcc/testsuite/gcc.dg/tree-ssa/pr54245.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr54245.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/pr54245.c (revision 0) @@ -0,0 +1,49 @@ +/* { dg-do compile } */ +/* { dg-options -O1 -fdump-tree-slsr-details } */ + +#include stdio.h + +#define W1 22725 +#define W2 21407 +#define W3 19266 +#define W6 8867 + +void idct_row(short *row, int *dst) +{ +int a0, a1, b0, b1; + +a0 = W1 * row[0]; +a1 = a0; + +a0 += W2 * row[2]; +a1 += W6 * row[2]; + +b0 = W1 * row[1]; +b1 = W3 * row[1]; + +dst[0] = a0 + b0; +dst[1] = a0 - b0; +dst[2] = a1 + b1; +dst[3] = a1 - b1; +} + +static short block[8] = { 1, 2, 3, 4 }; + +int main(void) +{ +int out[4]; +int i; + +idct_row(block, out); + +for (i = 0; i 4; i++) +printf(%d\n, out[i]); + +return !(out[2] == 87858 out[3] == 10794); +} + +/* For now, disable inserting an initializer when the multiplication will + take place in a smaller type than originally. This test may be deleted + in future when this case is handled more precisely. */ +/* { dg-final { scan-tree-dump-times Inserting initializer 0 slsr } } */ +/* { dg-final { cleanup-tree-dump slsr } } */ Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 190305) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -1089,6 +1089,32 @@ slsr_process_neg (gimple gs, tree rhs1, bool speed add_cand_for_stmt (gs, c); } +/* Help function for legal_cast_p, operating on two trees. Checks + whether it's allowable to cast from RHS to LHS. See legal_cast_p + for more details. */ + +static bool +legal_cast_p_1 (tree lhs, tree rhs) +{ + tree lhs_type, rhs_type; + unsigned lhs_size, rhs_size; + bool lhs_wraps, rhs_wraps; + + lhs_type = TREE_TYPE (lhs); + rhs_type = TREE_TYPE (rhs); + lhs_size = TYPE_PRECISION (lhs_type); + rhs_size = TYPE_PRECISION (rhs_type); + lhs_wraps = TYPE_OVERFLOW_WRAPS (lhs_type); + rhs_wraps = TYPE_OVERFLOW_WRAPS (rhs_type); + + if (lhs_size rhs_size + || (rhs_wraps !lhs_wraps) + || (rhs_wraps lhs_wraps rhs_size != lhs_size)) +return false; + + return true; +} + /* Return TRUE if GS is a statement that defines an SSA name from a conversion and is legal for us to combine with an add and multiply in the candidate table. For example, suppose we have: @@ -1129,28 +1155,11 @@ slsr_process_neg (gimple gs, tree rhs1, bool speed static bool legal_cast_p (gimple gs, tree rhs) { - tree lhs, lhs_type, rhs_type; - unsigned lhs_size, rhs_size; - bool lhs_wraps, rhs_wraps; - if (!is_gimple_assign (gs) || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (gs))) return false; - lhs = gimple_assign_lhs (gs); - lhs_type = TREE_TYPE (lhs); - rhs_type = TREE_TYPE (rhs); - lhs_size = TYPE_PRECISION (lhs_type); - rhs_size = TYPE_PRECISION (rhs_type); - lhs_wraps = TYPE_OVERFLOW_WRAPS (lhs_type); - rhs_wraps = TYPE_OVERFLOW_WRAPS (rhs_type); - - if (lhs_size rhs_size - || (rhs_wraps !lhs_wraps) - || (rhs_wraps lhs_wraps rhs_size != lhs_size)) -return false; - - return true; + return legal_cast_p_1 (gimple_assign_lhs (gs), rhs); } /* Given GS which is a cast to a scalar integer type, determine whether @@ -1996,6 +2005,31 @@ analyze_increments (slsr_cand_t first_dep, enum ma != POINTER_PLUS_EXPR))) incr_vec[i].cost = COST_NEUTRAL; + /* FORNOW: If we need to add an initializer, give up if a cast from +the candidate's type to its stride's type can lose precision. +This could eventually be handled better by expressly retaining the +result of a cast to a wider type in the stride. Example: + + short int _1; + _2 = (int) _1; + _3 = _2 * 10; + _4 = x + _3;ADD: x + (10 * _1) : int + _5 = _2 * 15; + _6 = x + _3;ADD: x + (15 * _1) : int + + Right now replacing _6
Re: [PATCH] Strength reduction part 3 of 4: candidates with unknown strides
On Wed, 2012-08-08 at 19:22 -0700, Janis Johnson wrote: On 08/08/2012 06:41 PM, William J. Schmidt wrote: On Wed, 2012-08-08 at 15:35 -0700, Janis Johnson wrote: On 08/08/2012 03:27 PM, Andrew Pinski wrote: On Wed, Aug 8, 2012 at 3:25 PM, H.J. Lu hjl.to...@gmail.com wrote: On Wed, Aug 1, 2012 at 10:36 AM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: +/* { dg-do compile } */ +/* { dg-options -O3 -fdump-tree-dom2 -fwrapv } */ +/* { dg-skip-if { ilp32 } { -m32 } { } } */ + This doesn't work on x32 nor Linux/ia32 since -m32 may not be needed for ILP32. This patch works for me. OK to install? This also does not work for mips64 where the options are either -mabi=32 or -mabi=n32 for ILP32. HJL's patch looks correct. Thanks, Andrew There are GCC targets with 16-bit integers. What's the actual set of targets on which this test is meant to run? There's a list of effective-target names based on data type sizes in http://gcc.gnu.org/onlinedocs/gccint/Effective_002dTarget-Keywords.html#Effective_002dTarget-Keywords. Yes, sorry. The test really is only valid when int and long have different sizes. So according to that link we should skip ilp32 and llp64 at a minimum. It isn't clear what we should do for int16 since the size of long isn't specified, so I suppose we should skip that as well. So, perhaps modify HJ's patch to have /* { dg-do compile { target { ! { ilp32 llp64 int16 } } } } */ ? Thanks, Bill That's confusing. Perhaps what you really need is a new effective target for sizeof(int) != sizeof(long). Good idea. I'll work up a patch when I get a moment. Thanks, Bill Janis
[PATCH] Fix PR54211
Fix a thinko in strength reduction. I was checking the type of the wrong operand to determine whether address arithmetic should be used in replacing expressions. This produced a spurious POINTER_PLUS_EXPR when an address was converted to an unsigned long and back again. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: 2012-08-09 Bill Schmidt wschm...@linux.vnet.ibm.com PR middle-end/54211 * gimple-ssa-strength-reduction.c (analyze_candidates_and_replace): Use cand_type to determine whether pointer arithmetic will be generated. gcc/testsuite: 2012-08-09 Bill Schmidt wschm...@linux.vnet.ibm.com PR middle-end/54211 * gcc.dg/tree-ssa/pr54211.c: New test. Index: gcc/testsuite/gcc.dg/tree-ssa/pr54211.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr54211.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/pr54211.c (revision 0) @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options -Os } */ + +int a, b; +unsigned char e; +void fn1 () +{ +unsigned char *c=0; +for (;; a++) +{ +unsigned char d = *(c + b); +for (; ed; c++) +goto Found_Top; +} +Found_Top: +if (0) +goto Empty_Bitmap; +for (;; a++) +{ +unsigned char *e = c + b; +for (; c e; c++) +goto Found_Bottom; +c -= b; +} +Found_Bottom: +Empty_Bitmap: +; +} Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 190260) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -2534,7 +2534,7 @@ analyze_candidates_and_replace (void) /* Determine whether we'll be generating pointer arithmetic when replacing candidates. */ address_arithmetic_p = (c-kind == CAND_ADD - POINTER_TYPE_P (TREE_TYPE (c-base_expr))); + POINTER_TYPE_P (c-cand_type)); /* If all candidates have already been replaced under other interpretations, nothing remains to be done. */
[PATCH, testsuite] New effective target long_neq_int
As suggested by Janis regarding testsuite/gcc.dg/tree-ssa/slsr-30.c, this patch adds a new effective target for machines having long and int of differing sizes. Tested on powerpc64-unknown-linux-gnu, where the test passes for -m64 and is skipped for -m32. Ok for trunk? Thanks, Bill doc: 2012-08-09 Bill Schmidt wschm...@linux.vnet.ibm.com * sourcebuild.texi: Document long_neq_int effective target. testsuite: 2012-08-09 Bill Schmidt wschm...@linux.vnet.ibm.com * lib/target-supports.exp (check_effective_target_long_neq_int): New. * gcc.dg/tree-ssa/slsr-30.c: Check for long_neq_int effective target. Index: gcc/doc/sourcebuild.texi === --- gcc/doc/sourcebuild.texi(revision 190260) +++ gcc/doc/sourcebuild.texi(working copy) @@ -1303,6 +1303,9 @@ Target has @code{int} that is at 32 bits or longer @item int16 Target has @code{int} that is 16 bits or shorter. +@item long_neq_int +Target has @code{int} and @code{long} with different sizes. + @item large_double Target supports @code{double} that is longer than @code{float}. Index: gcc/testsuite/lib/target-supports.exp === --- gcc/testsuite/lib/target-supports.exp (revision 190260) +++ gcc/testsuite/lib/target-supports.exp (working copy) @@ -1689,6 +1689,15 @@ proc check_effective_target_llp64 { } { }] } +# Return 1 if long and int have different sizes, +# 0 otherwise. + +proc check_effective_target_long_neq_int { } { +return [check_no_compiler_messages long_ne_int object { + int dummy[sizeof (int) != sizeof (long) ? 1 : -1]; +}] +} + # Return 1 if the target supports long double larger than double, # 0 otherwise. Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 190260) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (working copy) @@ -1,7 +1,7 @@ /* Verify straight-line strength reduction fails for simple integer addition with casts thrown in when -fwrapv is used. */ -/* { dg-do compile { target { ! { ilp32 } } } } */ +/* { dg-do compile { target { long_neq_int } } } */ /* { dg-options -O3 -fdump-tree-dom2 -fwrapv } */ long
Re: [PATCH] Strength reduction part 3 of 4: candidates with unknown strides
On Wed, 2012-08-08 at 15:35 -0700, Janis Johnson wrote: On 08/08/2012 03:27 PM, Andrew Pinski wrote: On Wed, Aug 8, 2012 at 3:25 PM, H.J. Lu hjl.to...@gmail.com wrote: On Wed, Aug 1, 2012 at 10:36 AM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Greetings, Thanks for the review of part 2! Here's another chunk of the SLSR code (I feel I owe you a few beers at this point). This performs analysis and replacement on groups of related candidates having an SSA name (rather than a constant) for a stride. This leaves only the conditional increment (CAND_PHI) case, which will be handled in the last patch of the series. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: 2012-08-01 Bill Schmidt wschm...@linux.ibm.com * gimple-ssa-strength-reduction.c (struct incr_info_d): New struct. (incr_vec): New static var. (incr_vec_len): Likewise. (address_arithmetic_p): Likewise. (stmt_cost): Remove dead assignment. (dump_incr_vec): New function. (cand_abs_increment): Likewise. (lazy_create_slsr_reg): Likewise. (incr_vec_index): Likewise. (count_candidates): Likewise. (record_increment): Likewise. (record_increments): Likewise. (unreplaced_cand_in_tree): Likewise. (optimize_cands_for_speed_p): Likewise. (lowest_cost_path): Likewise. (total_savings): Likewise. (analyze_increments): Likewise. (ncd_for_two_cands): Likewise. (nearest_common_dominator_for_cands): Likewise. (profitable_increment_p): Likewise. (insert_initializers): Likewise. (introduce_cast_before_cand): Likewise. (replace_rhs_if_not_dup): Likewise. (replace_one_candidate): Likewise. (replace_profitable_candidates): Likewise. (analyze_candidates_and_replace): Handle candidates with SSA-name strides. gcc/testsuite: 2012-08-01 Bill Schmidt wschm...@linux.ibm.com * gcc.dg/tree-ssa/slsr-5.c: New. * gcc.dg/tree-ssa/slsr-6.c: New. * gcc.dg/tree-ssa/slsr-7.c: New. * gcc.dg/tree-ssa/slsr-8.c: New. * gcc.dg/tree-ssa/slsr-9.c: New. * gcc.dg/tree-ssa/slsr-10.c: New. * gcc.dg/tree-ssa/slsr-11.c: New. * gcc.dg/tree-ssa/slsr-12.c: New. * gcc.dg/tree-ssa/slsr-13.c: New. * gcc.dg/tree-ssa/slsr-14.c: New. * gcc.dg/tree-ssa/slsr-15.c: New. * gcc.dg/tree-ssa/slsr-16.c: New. * gcc.dg/tree-ssa/slsr-17.c: New. * gcc.dg/tree-ssa/slsr-18.c: New. * gcc.dg/tree-ssa/slsr-19.c: New. * gcc.dg/tree-ssa/slsr-20.c: New. * gcc.dg/tree-ssa/slsr-21.c: New. * gcc.dg/tree-ssa/slsr-22.c: New. * gcc.dg/tree-ssa/slsr-23.c: New. * gcc.dg/tree-ssa/slsr-24.c: New. * gcc.dg/tree-ssa/slsr-25.c: New. * gcc.dg/tree-ssa/slsr-26.c: New. * gcc.dg/tree-ssa/slsr-30.c: New. * gcc.dg/tree-ssa/slsr-31.c: New. == --- gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 0) @@ -0,0 +1,25 @@ +/* Verify straight-line strength reduction fails for simple integer addition + with casts thrown in when -fwrapv is used. */ + +/* { dg-do compile } */ +/* { dg-options -O3 -fdump-tree-dom2 -fwrapv } */ +/* { dg-skip-if { ilp32 } { -m32 } { } } */ + This doesn't work on x32 nor Linux/ia32 since -m32 may not be needed for ILP32. This patch works for me. OK to install? This also does not work for mips64 where the options are either -mabi=32 or -mabi=n32 for ILP32. HJL's patch looks correct. Thanks, Andrew There are GCC targets with 16-bit integers. What's the actual set of targets on which this test is meant to run? There's a list of effective-target names based on data type sizes in http://gcc.gnu.org/onlinedocs/gccint/Effective_002dTarget-Keywords.html#Effective_002dTarget-Keywords. Yes, sorry. The test really is only valid when int and long have different sizes. So according to that link we should skip ilp32 and llp64 at a minimum. It isn't clear what we should do for int16 since the size of long isn't specified, so I suppose we should skip that as well. So, perhaps modify HJ's patch to have /* { dg-do compile { target { ! { ilp32 llp64 int16 } } } } */ ? Thanks, Bill Janis Thanks. -- H.J. --- * gcc.dg/tree-ssa/slsr-30.c: Require non-ilp32. Remove dg-skip-if. diff --git a/gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c b/gcc/testsuite/gcc.dg/tree -ssa/slsr-30.c index fbd6897..7921f43 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/slsr-30
[PATCH, committed] Fix PR53773
Change this test case to use the optimized dump so that the unreliable vect-details dump can't cause different behavior on different targets. Verified on powerpc64-unknown-linux-gnu, committed as obvious. Thanks, Bill 2012-08-03 Bill Schmidt wschm...@linux.ibm.com * testsuite/gcc.dg/vect/pr53773.c: Change to use optimized dump. Index: gcc/testsuite/gcc.dg/vect/pr53773.c === --- gcc/testsuite/gcc.dg/vect/pr53773.c (revision 190018) +++ gcc/testsuite/gcc.dg/vect/pr53773.c (working copy) @@ -1,4 +1,5 @@ /* { dg-do compile } */ +/* { dg-options -fdump-tree-optimized } */ int foo (int integral, int decimal, int power_ten) @@ -13,7 +14,7 @@ foo (int integral, int decimal, int power_ten) return integral+decimal; } -/* Two occurrences in annotations, two in code. */ -/* { dg-final { scan-tree-dump-times \\* 10 4 vect } } */ +/* { dg-final { scan-tree-dump-times \\* 10 2 optimized } } */ /* { dg-final { cleanup-tree-dump vect } } */ +/* { dg-final { cleanup-tree-dump optimized } } */
[PATCH, committed] Strength reduction clean-up (base name = base expr)
This cleans up terminology in strength reduction. What used to be a base SSA name is now sometimes other tree expressions, so the term base name is replaced by base expression throughout. Bootstrapped and tested with no new regressions on powerpc64-unknown-linux-gnu; committed as obvious. Thanks, Bill 2012-08-01 Bill Schmidt wschm...@linux.ibm.com * gimple-ssa-strength-reduction.c (struct slsr_cand_d): Change base_name to base_expr. (struct cand_chain_d): Likewise. (base_cand_hash): Likewise. (base_cand_eq): Likewise. (record_potential_basis): Likewise. (alloc_cand_and_find_basis): Likewise. (create_mul_ssa_cand): Likewise. (create_mul_imm_cand): Likewise. (create_add_ssa_cand): Likewise. (create_add_imm_cand): Likewise. (slsr_process_cast): Likewise. (slsr_process_copy): Likewise. (dump_candidate): Likewise. (base_cand_dump_callback): Likewise. (unconditional_cands_with_known_stride_p): Likewise. (cand_increment): Likewise. Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 190037) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -166,8 +166,8 @@ struct slsr_cand_d /* The candidate statement S1. */ gimple cand_stmt; - /* The base SSA name B. */ - tree base_name; + /* The base expression B: often an SSA name, but not always. */ + tree base_expr; /* The stride S. */ tree stride; @@ -175,7 +175,7 @@ struct slsr_cand_d /* The index constant i. */ double_int index; - /* The type of the candidate. This is normally the type of base_name, + /* The type of the candidate. This is normally the type of base_expr, but casts may have occurred when combining feeding instructions. A candidate can only be a basis for candidates of the same final type. (For CAND_REFs, this is the type to be used for operand 1 of the @@ -216,12 +216,13 @@ typedef struct slsr_cand_d slsr_cand, *slsr_cand_t typedef const struct slsr_cand_d *const_slsr_cand_t; /* Pointers to candidates are chained together as part of a mapping - from SSA names to the candidates that use them as a base name. */ + from base expressions to the candidates that use them. */ struct cand_chain_d { - /* SSA name that serves as a base name for the chain of candidates. */ - tree base_name; + /* Base expression for the chain of candidates: often, but not + always, an SSA name. */ + tree base_expr; /* Pointer to a candidate. */ slsr_cand_t cand; @@ -253,7 +254,7 @@ static struct pointer_map_t *stmt_cand_map; /* Obstack for candidates. */ static struct obstack cand_obstack; -/* Hash table embodying a mapping from base names to chains of candidates. */ +/* Hash table embodying a mapping from base exprs to chains of candidates. */ static htab_t base_cand_map; /* Obstack for candidate chains. */ @@ -272,7 +273,7 @@ lookup_cand (cand_idx idx) static hashval_t base_cand_hash (const void *p) { - tree base_expr = ((const_cand_chain_t) p)-base_name; + tree base_expr = ((const_cand_chain_t) p)-base_expr; return iterative_hash_expr (base_expr, 0); } @@ -291,10 +292,10 @@ base_cand_eq (const void *p1, const void *p2) { const_cand_chain_t const chain1 = (const_cand_chain_t) p1; const_cand_chain_t const chain2 = (const_cand_chain_t) p2; - return operand_equal_p (chain1-base_name, chain2-base_name, 0); + return operand_equal_p (chain1-base_expr, chain2-base_expr, 0); } -/* Use the base name from candidate C to look for possible candidates +/* Use the base expr from candidate C to look for possible candidates that can serve as a basis for C. Each potential basis must also appear in a block that dominates the candidate statement and have the same stride and type. If more than one possible basis exists, @@ -308,7 +309,7 @@ find_basis_for_candidate (slsr_cand_t c) cand_chain_t chain; slsr_cand_t basis = NULL; - mapping_key.base_name = c-base_name; + mapping_key.base_expr = c-base_expr; chain = (cand_chain_t) htab_find (base_cand_map, mapping_key); for (; chain; chain = chain-next) @@ -337,8 +338,8 @@ find_basis_for_candidate (slsr_cand_t c) return 0; } -/* Record a mapping from the base name of C to C itself, indicating that - C may potentially serve as a basis using that base name. */ +/* Record a mapping from the base expression of C to C itself, indicating that + C may potentially serve as a basis using that base expression. */ static void record_potential_basis (slsr_cand_t c) @@ -347,7 +348,7 @@ record_potential_basis (slsr_cand_t c) void **slot; node = (cand_chain_t) obstack_alloc (chain_obstack, sizeof (cand_chain)); - node-base_name = c-base_name; + node-base_expr = c-base_expr; node-cand = c; node-next = NULL; slot = htab_find_slot
[PATCH, rs6000] Vectorizer heuristic
Now that the vectorizer cost model is set up to facilitate per-target heuristics, I'm revisiting the density heuristic I submitted previously. This allows the vec_permute and vec_promote_demote costs to be set to their natural values, but inhibits vectorization in cases like sphinx3 where vectorizing a loop leads to issue stalls from overcommitted resources. Bootstrapped on powerpc64-unknown-linux-gnu with no new regressions. Measured performance on cpu2000 and cpu2006 with no significant changes in performance. Ok for trunk? Thanks, Bill 2012-07-31 Bill Schmidt wschm...@linux.ibm.com * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise costs for vec_perm and vec_promote_demote down to more natural values. (struct _rs6000_cost_data): New data structure. (rs6000_density_test): New function. (rs6000_init_cost): Change to use rs6000_cost_data. (rs6000_add_stmt_cost): Likewise. (rs6000_finish_cost): Perform density test when vectorizing a loop. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 189845) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -60,6 +60,7 @@ #include params.h #include tm-constrs.h #include opts.h +#include tree-vectorizer.h #if TARGET_XCOFF #include xcoffout.h /* get declarations of xcoff_*_section_name */ #endif @@ -3378,13 +3379,13 @@ rs6000_builtin_vectorization_cost (enum vect_cost_ case vec_perm: if (TARGET_VSX) - return 4; + return 3; else return 1; case vec_promote_demote: if (TARGET_VSX) - return 5; + return 4; else return 1; @@ -3520,14 +3521,71 @@ rs6000_preferred_simd_mode (enum machine_mode mode return word_mode; } +typedef struct _rs6000_cost_data +{ + struct loop *loop_info; + unsigned cost[3]; +} rs6000_cost_data; + +/* Test for likely overcommitment of vector hardware resources. If a + loop iteration is relatively large, and too large a percentage of + instructions in the loop are vectorized, the cost model may not + adequately reflect delays from unavailable vector resources. + Penalize the loop body cost for this case. */ + +static void +rs6000_density_test (rs6000_cost_data *data) +{ + const int DENSITY_PCT_THRESHOLD = 85; + const int DENSITY_SIZE_THRESHOLD = 70; + const int DENSITY_PENALTY = 10; + struct loop *loop = data-loop_info; + basic_block *bbs = get_loop_body (loop); + int nbbs = loop-num_nodes; + int vec_cost = data-cost[vect_body], not_vec_cost = 0; + int i, density_pct; + + for (i = 0; i nbbs; i++) +{ + basic_block bb = bbs[i]; + gimple_stmt_iterator gsi; + + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (gsi)) + { + gimple stmt = gsi_stmt (gsi); + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + + if (!STMT_VINFO_RELEVANT_P (stmt_info) + !STMT_VINFO_IN_PATTERN_P (stmt_info)) + not_vec_cost++; + } +} + + density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost); + + if (density_pct DENSITY_PCT_THRESHOLD + vec_cost + not_vec_cost DENSITY_SIZE_THRESHOLD) +{ + data-cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; + if (vect_print_dump_info (REPORT_DETAILS)) + fprintf (vect_dump, +density %d%%, cost %d exceeds threshold, penalizing +loop body cost by %d%%, density_pct, +vec_cost + not_vec_cost, DENSITY_PENALTY); +} +} + /* Implement targetm.vectorize.init_cost. */ static void * -rs6000_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED) +rs6000_init_cost (struct loop *loop_info) { - unsigned *cost = XNEWVEC (unsigned, 3); - cost[vect_prologue] = cost[vect_body] = cost[vect_epilogue] = 0; - return cost; + rs6000_cost_data *data = XNEW (struct _rs6000_cost_data); + data-loop_info = loop_info; + data-cost[vect_prologue] = 0; + data-cost[vect_body] = 0; + data-cost[vect_epilogue] = 0; + return data; } /* Implement targetm.vectorize.add_stmt_cost. */ @@ -3537,7 +3595,7 @@ rs6000_add_stmt_cost (void *data, int count, enum struct _stmt_vec_info *stmt_info, int misalign, enum vect_cost_model_location where) { - unsigned *cost = (unsigned *) data; + rs6000_cost_data *cost_data = (rs6000_cost_data*) data; unsigned retval = 0; if (flag_vect_cost_model) @@ -3552,7 +3610,7 @@ rs6000_add_stmt_cost (void *data, int count, enum count *= 50; /* FIXME. */ retval = (unsigned) (count * stmt_cost); - cost[where] += retval; + cost_data-cost[where] += retval; } return retval; @@ -3564,10 +3622,14 @@ static void rs6000_finish_cost (void *data, unsigned *prologue_cost, unsigned *body_cost, unsigned *epilogue_cost) { - unsigned *cost =
[PATCH] Fix PR53733
This fixes the de-canonicalization of commutative GIMPLE operations in the vectorizer that occurs when processing reductions. A loop_vec_info is flagged for cleanup when a de-canonicalization has occurred in that loop, and the cleanup is done when the loop_vec_info is destroyed. Bootstrapped on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: 2012-07-30 Bill Schmidt wschm...@linux.ibm.com PR tree-optimization/53773 * tree-vectorizer.h (struct _loop_vec_info): Add operands_swapped. (LOOP_VINFO_OPERANDS_SWAPPED): New macro. * tree-vect-loop.c (new_loop_vec_info): Initialize LOOP_VINFO_OPERANDS_SWAPPED field. (destroy_loop_vec_info): Restore canonical form. (vect_is_slp_reduction): Set LOOP_VINFO_OPERANDS_SWAPPED field. (vect_is_simple_reduction_1): Likewise. gcc/testsuite: 2012-07-30 Bill Schmidt wschm...@linux.ibm.com PR tree-optimization/53773 * testsuite/gcc.dg/vect/pr53773.c: New test. Index: gcc/testsuite/gcc.dg/vect/pr53773.c === --- gcc/testsuite/gcc.dg/vect/pr53773.c (revision 0) +++ gcc/testsuite/gcc.dg/vect/pr53773.c (revision 0) @@ -0,0 +1,19 @@ +/* { dg-do compile } */ + +int +foo (int integral, int decimal, int power_ten) +{ + while (power_ten 0) +{ + integral *= 10; + decimal *= 10; + power_ten--; +} + + return integral+decimal; +} + +/* Two occurrences in annotations, two in code. */ +/* { dg-final { scan-tree-dump-times \\* 10 4 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + Index: gcc/tree-vectorizer.h === --- gcc/tree-vectorizer.h (revision 189938) +++ gcc/tree-vectorizer.h (working copy) @@ -296,6 +296,12 @@ typedef struct _loop_vec_info { this. */ bool peeling_for_gaps; + /* Reductions are canonicalized so that the last operand is the reduction + operand. If this places a constant into RHS1, this decanonicalizes + GIMPLE for other phases, so we must track when this has occurred and + fix it up. */ + bool operands_swapped; + } *loop_vec_info; /* Access Functions. */ @@ -326,6 +332,7 @@ typedef struct _loop_vec_info { #define LOOP_VINFO_PEELING_HTAB(L) (L)-peeling_htab #define LOOP_VINFO_TARGET_COST_DATA(L) (L)-target_cost_data #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)-peeling_for_gaps +#define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)-operands_swapped #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \ VEC_length (gimple, (L)-may_misalign_stmts) 0 Index: gcc/tree-vect-loop.c === --- gcc/tree-vect-loop.c(revision 189938) +++ gcc/tree-vect-loop.c(working copy) @@ -853,6 +853,7 @@ new_loop_vec_info (struct loop *loop) LOOP_VINFO_PEELING_HTAB (res) = NULL; LOOP_VINFO_TARGET_COST_DATA (res) = init_cost (loop); LOOP_VINFO_PEELING_FOR_GAPS (res) = false; + LOOP_VINFO_OPERANDS_SWAPPED (res) = false; return res; } @@ -873,6 +874,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b int j; VEC (slp_instance, heap) *slp_instances; slp_instance instance; + bool swapped; if (!loop_vinfo) return; @@ -881,6 +883,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b bbs = LOOP_VINFO_BBS (loop_vinfo); nbbs = loop-num_nodes; + swapped = LOOP_VINFO_OPERANDS_SWAPPED (loop_vinfo); if (!clean_stmts) { @@ -905,6 +908,22 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b for (si = gsi_start_bb (bb); !gsi_end_p (si); ) { gimple stmt = gsi_stmt (si); + + /* We may have broken canonical form by moving a constant +into RHS1 of a commutative op. Fix such occurrences. */ + if (swapped is_gimple_assign (stmt)) + { + enum tree_code code = gimple_assign_rhs_code (stmt); + + if ((code == PLUS_EXPR + || code == POINTER_PLUS_EXPR + || code == MULT_EXPR) + CONSTANT_CLASS_P (gimple_assign_rhs1 (stmt))) + swap_tree_operands (stmt, + gimple_assign_rhs1_ptr (stmt), + gimple_assign_rhs2_ptr (stmt)); + } + /* Free stmt_vec_info. */ free_stmt_vec_info (stmt); gsi_next (si); @@ -1920,6 +1939,9 @@ vect_is_slp_reduction (loop_vec_info loop_info, gi gimple_assign_rhs1_ptr (next_stmt), gimple_assign_rhs2_ptr (next_stmt)); update_stmt (next_stmt); + + if (CONSTANT_CLASS_P (gimple_assign_rhs1 (next_stmt))) + LOOP_VINFO_OPERANDS_SWAPPED (loop_info) = true; } else return false; @@ -2324,6 +2346,9 @@ vect_is_simple_reduction_1
Re: [patch[ Add explanations to sbitmap, bitmap, and sparseset
On Fri, 2012-07-27 at 15:40 +0200, Richard Guenther wrote: On Thu, Jul 26, 2012 at 11:57 AM, Steven Bosscher stevenb@gmail.com wrote: On Thu, Jul 26, 2012 at 11:23 AM, Richard Guenther richard.guent...@gmail.com wrote: Ok! Thanks for adding this exhaustive documentation. There's more to come! I want to add some explanations to ebitmap, pointer-set, fibheap, and splay-tree as sets, and add a chapter in the gccint manual too. Now if only you'd document those loop changes... ;-) Eh ... Btw, ebitmap is unused since it was added - maybe we should simply remove it ...? I wouldn't remove it just yet. I'm going to make sure that bitmap.[ch] and ebitmap.[ch] provide the same interface and see if there are places where ebitmap is a better choice than bitmap or sbitmap (cprop and gcse.c come to mind). Btw, just looking over sparseset.h what needs to be documented is that iterating over the set is faster than for an sbitmap but element ordering is random! Also it looks less efficient than sbitmap in the case when your main operation is adding to the set and querying the set randomly. It's space overhead is really huge - for smaller universes a smaller SPARSESET_ELT_TYPE would be nice, templates to the rescue! I wonder in which cases a unsigned HOST_WIDEST_FAST_INT sized universe is even useful (but a short instead of an int is probably too small ...) Another option for sparse sets would be a templatized version of Pugh's skip lists. Iteration is the same as a linked list and random access is logarithmic in the size of the set (not the universe). Space overhead is also logarithmic. The potential downside is that it involves pointers. Bill Richard. Ciao! Steven
[PATCH] Change IVOPTS and strength reduction to use expmed cost model
Per Richard Henderson's suggestion (http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01370.html), this patch changes the IVOPTS and straight-line strength reduction passes to make use of data computed by init_expmed. This required adding a new convert_cost array in expmed to store the costs of converting between various scalar integer modes, and exposing expmed's multiplication hash table for external use (new function mult_by_coeff_cost). Richard H, I'd appreciate it if you could look at what I did there and make sure it's correct. Thanks! I decided it wasn't worth distinguishing between reg-reg add costs and reg-constant add costs, so I simplified the strength reduction calculations rather than adding another array to expmed for this purpose. But I can make this distinction if that's preferable. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-07-25 Bill Schmidt wschm...@linux.ibm.com * tree-ssa-loop-ivopts.c (mbc_entry_hash): Remove. (mbc_entry_eq): Likewise. (mult_costs): Likewise. (cost_tables_exist): Likewise. (initialize_costs): Likewise. (finalize_costs): Likewise. (tree_ssa_iv_optimize_init): Remove call to initialize_costs. (add_regs_cost): Remove. (multiply_regs_cost): Likewise. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. (struct mbc_entry): Likewise. (multiply_by_const_cost): Likewise. (get_address_cost): Change add_regs_cost calls to add_cost lookups; change multiply_by_const_cost to mult_by_coeff_cost. (force_expr_to_var_cost): Likewise. (difference_cost): Change multiply_by_const_cost to mult_by_coeff_cost. (get_computation_cost_at): Change add_regs_cost calls to add_cost lookups; change multiply_by_const_cost to mult_by_coeff_cost. (determine_iv_cost): Change add_regs_cost calls to add_cost lookups. (tree_ssa_iv_optimize_finalize): Remove call to finalize_costs. * tree-ssa-address.c (expmed.h): New #include. (most_expensive_mult_to_index): Change multiply_by_const_cost to mult_by_coeff_cost. * gimple-ssa-strength-reduction.c (expmed.h): New #include. (stmt_cost): Change to use mult_by_coeff_cost, mul_cost, add_cost, neg_cost, and convert_cost instead of IVOPTS interfaces. (execute_strength_reduction): Remove calls to initialize_costs and finalize_costs. * expmed.c (struct init_expmed_rtl): Add convert rtx_def. (init_expmed_one_mode): Initialize convert rtx_def; initialize convert_cost for related modes. (mult_by_coeff_cost): New function. * expmed.h (struct target_expmed): Add x_convert_cost matrix. (convert_cost): New #define. (mult_by_coeff_cost): New extern decl. * tree-flow.h (initialize_costs): Remove decl. (finalize_costs): Likewise. (multiply_by_const_cost): Likewise. (add_regs_cost): Likewise. (multiply_regs_cost): Likewise. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. Index: gcc/tree-ssa-loop-ivopts.c === --- gcc/tree-ssa-loop-ivopts.c (revision 189845) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -88,9 +88,6 @@ along with GCC; see the file COPYING3. If not see #include tree-ssa-propagate.h #include expmed.h -static hashval_t mbc_entry_hash (const void *); -static int mbc_entry_eq (const void*, const void *); - /* FIXME: Expressions are expanded to RTL in this pass to determine the cost of different addressing modes. This should be moved to a TBD interface between the GIMPLE and RTL worlds. */ @@ -381,11 +378,6 @@ struct iv_ca_delta static VEC(tree,heap) *decl_rtl_to_reset; -/* Cached costs for multiplies by constants, and a flag to indicate - when they're valid. */ -static htab_t mult_costs[2]; -static bool cost_tables_exist = false; - static comp_cost force_expr_to_var_cost (tree, bool); /* Number of uses recorded in DATA. */ @@ -851,26 +843,6 @@ htab_inv_expr_hash (const void *ent) return expr-hash; } -/* Allocate data structures for the cost model. */ - -void -initialize_costs (void) -{ - mult_costs[0] = htab_create (100, mbc_entry_hash, mbc_entry_eq, free); - mult_costs[1] = htab_create (100, mbc_entry_hash, mbc_entry_eq, free); - cost_tables_exist = true; -} - -/* Release data structures for the cost model. */ - -void -finalize_costs (void) -{ - cost_tables_exist = false; - htab_delete (mult_costs[0]); - htab_delete (mult_costs[1]); -} - /* Initializes data structures used by the iv optimization pass, stored in DATA. */ @@ -889,8 +861,6 @@ tree_ssa_iv_optimize_init (struct ivopts_data *dat
Re: [PATCH] Change IVOPTS and strength reduction to use expmed cost model
On Wed, 2012-07-25 at 09:59 -0700, Richard Henderson wrote: On 07/25/2012 09:13 AM, William J. Schmidt wrote: Per Richard Henderson's suggestion (http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01370.html), this patch changes the IVOPTS and straight-line strength reduction passes to make use of data computed by init_expmed. This required adding a new convert_cost array in expmed to store the costs of converting between various scalar integer modes, and exposing expmed's multiplication hash table for external use (new function mult_by_coeff_cost). Richard H, I'd appreciate it if you could look at what I did there and make sure it's correct. Thanks! Correctness looks good. I decided it wasn't worth distinguishing between reg-reg add costs and reg-constant add costs, so I simplified the strength reduction calculations rather than adding another array to expmed for this purpose. But I can make this distinction if that's preferable. I don't think this is worth thinking about at this level. This is something that some rtl-level optimization ought to be able to fix up trivially, e.g. cse. Index: gcc/expmed.h === --- gcc/expmed.h(revision 189845) +++ gcc/expmed.h(working copy) @@ -155,6 +155,11 @@ struct target_expmed { int x_udiv_cost[2][NUM_MACHINE_MODES]; int x_mul_widen_cost[2][NUM_MACHINE_MODES]; int x_mul_highpart_cost[2][NUM_MACHINE_MODES]; + + /* Conversion costs are only defined between two scalar integer modes + of different sizes. The first machine mode is the destination mode, + and the second is the source mode. */ + int x_convert_cost[2][NUM_MACHINE_MODES][NUM_MACHINE_MODES]; }; 2 * NUM_MACHINE_MODES is quite large... I think we could do better with #define NUM_MODE_INT (MAX_MODE_INT - MIN_MODE_INT + 1) x_convert_cost[2][NUM_MODE_INT][NUM_MODE_INT]; though really that could be done with all of these fields all at once. That does suggest it would be better to leave at least inline functions to access these elements, rather than open code the array access. r~ Thanks for the quick review! Excellent point about the array size. The attached revised patch follows your suggestion to limit the size. I only did this for the new field, as changing all the existing accessors to inline functions is more effort than I have time for right now. This is left as an exercise for the reader. ;) Bootstrapped and tested on powepc64-unknown-linux-gnu with no new failures. Is this ok? Thanks, Bill 2012-07-25 Bill Schmidt wschm...@linux.ibm.com * tree-ssa-loop-ivopts.c (mbc_entry_hash): Remove. (mbc_entry_eq): Likewise. (mult_costs): Likewise. (cost_tables_exist): Likewise. (initialize_costs): Likewise. (finalize_costs): Likewise. (tree_ssa_iv_optimize_init): Remove call to initialize_costs. (add_regs_cost): Remove. (multiply_regs_cost): Likewise. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. (struct mbc_entry): Likewise. (multiply_by_const_cost): Likewise. (get_address_cost): Change add_regs_cost calls to add_cost lookups; change multiply_by_const_cost to mult_by_coeff_cost. (force_expr_to_var_cost): Likewise. (difference_cost): Change multiply_by_const_cost to mult_by_coeff_cost. (get_computation_cost_at): Change add_regs_cost calls to add_cost lookups; change multiply_by_const_cost to mult_by_coeff_cost. (determine_iv_cost): Change add_regs_cost calls to add_cost lookups. (tree_ssa_iv_optimize_finalize): Remove call to finalize_costs. * tree-ssa-address.c (expmed.h): New #include. (most_expensive_mult_to_index): Change multiply_by_const_cost to mult_by_coeff_cost. * gimple-ssa-strength-reduction.c (expmed.h): New #include. (stmt_cost): Change to use mult_by_coeff_cost, mul_cost, add_cost, neg_cost, and convert_cost instead of IVOPTS interfaces. (execute_strength_reduction): Remove calls to initialize_costs and finalize_costs. * expmed.c (struct init_expmed_rtl): Add convert rtx_def. (init_expmed_one_mode): Initialize convert rtx_def; initialize x_convert_cost for related modes. (mult_by_coeff_cost): New function. * expmed.h (NUM_MODE_INT): New #define. (struct target_expmed): Add x_convert_cost matrix. (set_convert_cost): New inline function. (convert_cost): Likewise. (mult_by_coeff_cost): New extern decl. * tree-flow.h (initialize_costs): Remove decl. (finalize_costs): Likewise. (multiply_by_const_cost): Likewise. (add_regs_cost): Likewise. (multiply_regs_cost): Likewise. (add_const_cost): Likewise
Re: [PING] Re: [RFC, ivopts] fix bugs in ivopts address cost computation
On Wed, 2012-07-25 at 13:39 -0600, Sandra Loosemore wrote: On 07/17/2012 05:22 AM, Richard Guenther wrote: On Wed, Jul 4, 2012 at 6:35 PM, Sandra Loosemore san...@codesourcery.com wrote: Ping? Original post with patch is here: http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html Can you update the patch and numbers based on what Bill did for straight-line strength reduction which re-uses this analysis/caching part? I will try to take another look at this once Bill has finished his work that touches on this; it's been hard for me to track a moving target. I was wondering if it might be more consistent with Bill's work to defer some of the address cost computation to new target hooks, after all. -Sandra Hi Sandra, I apologize for the mess. I should be done causing distress to this part of the code as soon as the patch I submitted today is committed. Sorry! Bill
Re: [PATCH] Vectorizer cost model outside-cost changes
On Tue, 2012-07-24 at 10:57 +0200, Richard Guenther wrote: On Mon, 23 Jul 2012, William J. Schmidt wrote: This patch completes the conversion of the vectorizer cost model to use target hooks for recording vectorization information and calculating costs. Previous work handled the costs inside the loop body or basic block being vectorized. This patch similarly converts the prologue and epilogue costs. As before, I first verified that the new model provides the same results as the old model on the regression testsuite and on SPEC CPU2006. I then removed the old model, rather than submitting an intermediate patch with both present. I have a patch that shows both if it's needed for reference. Also as before, I found an error in the old cost model wherein prologue costs of phi reduction statements were not being considered during the final vectorization decision. I have fixed this in the new model; thus, this version of the cost model will be slightly more conservative than the original. I am currently running SPEC tests to ensure there aren't any resulting degradations. One thing that could be done in future for further cleanup would be to handle the scalar iteration cost in a similar manner. Right now this is dealt with by recording N scalar_stmts, where N is the length of the scalar iteration; as with the old model, there is no attempt to differentiate between different scalar statements. This results in some hackish stuff in, e.g., tree-vect-stmts.c:record_stmt_cost (), where we have to deal with the fact that we may not have a stmt_info for the statement being recorded. This is only true for these aggregated scalar_stmt costs. Bootstrapped and tested on powerpc-unknown-linux-gnu with no new regressions. Assuming the SPEC performance tests come out ok, is this ok for trunk? So all costs we query from the backend even for the prologue/epilogue are costs for vector stmts (like inits of invariant vectors or outer-loop parts in outer loop vectorization)? Yes, with the exception of copies of scalar iterations introduced by loop peeling (the N * scalar_stmt business). There are comments in several places indicating opportunities for improvement in the modeling, including for the outer-loop case, but for now your statement holds otherwise. Thanks, Bill Ok in that case. Thanks, Richard. Thanks! Bill
Ping: [PATCH] Fix PR46556 (straight-line strength reduction, part 2)
Ping... On Thu, 2012-06-28 at 16:45 -0500, William J. Schmidt wrote: Here's a relatively small piece of strength reduction that solves that pesky addressing bug that got me looking at this in the first place... The main part of the code is the stuff that was reviewed last year, but which needed to find a good home. So hopefully that's in pretty good shape. I recast base_cand_map as an htab again since I now need to look up trees other than SSA names. I plan to put together a follow-up patch to change code and commentary references so that base_name becomes base_expr. Doing that now would clutter up the patch too much. Bootstrapped and tested on powerpc64-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: PR tree-optimization/46556 * gimple-ssa-strength-reduction.c (enum cand_kind): Add CAND_REF. (base_cand_map): Change to hash table. (base_cand_hash): New function. (base_cand_free): Likewise. (base_cand_eq): Likewise. (lookup_cand): Change base_cand_map to hash table. (find_basis_for_candidate): Likewise. (base_cand_from_table): Exclude CAND_REF. (restructure_reference): New function. (slsr_process_ref): Likewise. (find_candidates_in_block): Call slsr_process_ref. (dump_candidate): Handle CAND_REF. (base_cand_dump_callback): New function. (dump_cand_chains): Change base_cand_map to hash table. (replace_ref): New function. (replace_refs): Likewise. (analyze_candidates_and_replace): Call replace_refs. (execute_strength_reduction): Change base_cand_map to hash table. gcc/testsuite: PR tree-optimization/46556 * testsuite/gcc.dg/tree-ssa/slsr-27.c: New. * testsuite/gcc.dg/tree-ssa/slsr-28.c: New. * testsuite/gcc.dg/tree-ssa/slsr-29.c: New. Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0) @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom2 } */ + +struct x +{ + int a[16]; + int b[16]; + int c[16]; +}; + +extern void foo (int, int, int); + +void +f (struct x *p, unsigned int n) +{ + foo (p-a[n], p-c[n], p-b[n]); +} + +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 3 dom2 } } */ +/* { dg-final { cleanup-tree-dump dom2 } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0) @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom2 } */ + +struct x +{ + int a[16]; + int b[16]; + int c[16]; +}; + +extern void foo (int, int, int); + +void +f (struct x *p, unsigned int n) +{ + foo (p-a[n], p-c[n], p-b[n]); + if (n 12) +foo (p-a[n], p-c[n], p-b[n]); + else if (n 3) +foo (p-b[n], p-a[n], p-c[n]); +} + +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } } */ +/* { dg-final { cleanup-tree-dump dom2 } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0) @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom2 } */ + +struct x +{ + int a[16]; + int b[16]; + int c[16]; +}; + +extern void foo (int, int, int); + +void +f (struct x *p, unsigned int n) +{ + foo (p-a[n], p-c[n], p-b[n]); + if (n 3) +{ + foo (p-a[n], p-c[n], p-b[n]); + if (n 12) + foo (p-b[n], p-a[n], p-c[n]); +} +} + +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } } */ +/* { dg-final { cleanup-tree-dump dom2 } } */ Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 189025) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -32,7 +32,7 @@ along with GCC; see the file COPYING3. If not see 2) Explicit multiplies, unknown constant multipliers, no conditional increments. (data gathering complete, replacements pending) - 3) Implicit multiplies in addressing expressions. (pending) + 3
Re: [PATCH] Add flag to control straight-line strength reduction
On Wed, 2012-07-18 at 11:01 +0200, Richard Guenther wrote: On Wed, 18 Jul 2012, Steven Bosscher wrote: On Wed, Jul 18, 2012 at 9:59 AM, Richard Guenther rguent...@suse.de wrote: On Tue, 17 Jul 2012, William J. Schmidt wrote: I overlooked adding a pass-control flag for strength reduction, added here. I named it -ftree-slsr for consistency with other -ftree- flags, but could change it to -fgimple-slsr if you prefer that for a pass named gimple-ssa-... Bootstrapped and tested on powerpc-unknown-linux-gnu with no new regressions. Ok for trunk? The switch needs documentation in doc/invoke.texi. Other than that it's fine to stick with -ftree-..., even that exposes details to our users that are not necessary (RTL passes didn't have -frtl-... either). So in the end, why not re-use -fstrength-reduce that is already available (but stubbed out)? In the past, -fstrength-reduce applied to loop strength reduction in loop.c. I don't think it should be re-used for a completely different code transformation. Ok. I suppose -ftree-slsr is ok then. It turns out I was looking at a very old copy of the manual, and the -ftree... stuff is not as prevalent now as it once was. I'll just go with -fslsr to be consistent with -fgcse, -fipa-sra, etc. Thanks for the pointer to doc/invoke.texi -- it appears I also failed to document -fhoist-adjacent-loads, so I will go ahead and do that as well. Thanks! Bill Thanks, Richard.
Re: [PATCH] Add flag to control straight-line strength reduction
On Wed, 2012-07-18 at 08:24 -0500, William J. Schmidt wrote: On Wed, 2012-07-18 at 11:01 +0200, Richard Guenther wrote: On Wed, 18 Jul 2012, Steven Bosscher wrote: On Wed, Jul 18, 2012 at 9:59 AM, Richard Guenther rguent...@suse.de wrote: On Tue, 17 Jul 2012, William J. Schmidt wrote: I overlooked adding a pass-control flag for strength reduction, added here. I named it -ftree-slsr for consistency with other -ftree- flags, but could change it to -fgimple-slsr if you prefer that for a pass named gimple-ssa-... Bootstrapped and tested on powerpc-unknown-linux-gnu with no new regressions. Ok for trunk? The switch needs documentation in doc/invoke.texi. Other than that it's fine to stick with -ftree-..., even that exposes details to our users that are not necessary (RTL passes didn't have -frtl-... either). So in the end, why not re-use -fstrength-reduce that is already available (but stubbed out)? In the past, -fstrength-reduce applied to loop strength reduction in loop.c. I don't think it should be re-used for a completely different code transformation. Ok. I suppose -ftree-slsr is ok then. It turns out I was looking at a very old copy of the manual, and the -ftree... stuff is not as prevalent now as it once was. I'll just go with -fslsr to be consistent with -fgcse, -fipa-sra, etc. Well, posted too fast. Paging down I see that isn't true, sorry. I'll use the tree- for consistency even though it is useless information. Thanks, Bill Thanks for the pointer to doc/invoke.texi -- it appears I also failed to document -fhoist-adjacent-loads, so I will go ahead and do that as well. Thanks! Bill Thanks, Richard.
Re: [PATCH] Add flag to control straight-line strength reduction
Here's the patch with documentation changes included. I also cleaned up missing work from a couple of my previous patches, so -fhoist-adjacent-loads is documented now, and -fvect-cost-model is added to the list of options on by default at -O3. Ok for trunk? Thanks, Bill 2012-07-18 Bill Schmidt wschm...@linux.ibm.com * doc/invoke.texi: Add -fhoist-adjacent-loads and -ftree-slsr to list of flags controlling optimization; add -ftree-slsr to list of flags enabled by default at -O; add -fhoist-adjacent-loads to list of flags enabled by default at -O2; add -fvect-cost-model to list of flags enabled by default at -O3; document -fhoist-adjacent-loads and -ftree-slsr. * opts.c (default_option): Make -ftree-slsr default at -O1 and above. * gimple-ssa-strength-reduction.c (gate_strength_reduction): Use flag_tree_slsr. * common.opt: Add -ftree-slsr with flag_tree_slsr. Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 189574) +++ gcc/doc/invoke.texi (working copy) @@ -364,7 +364,8 @@ Objective-C and Objective-C++ Dialects}. -ffast-math -ffinite-math-only -ffloat-store -fexcess-precision=@var{style} @gol -fforward-propagate -ffp-contract=@var{style} -ffunction-sections @gol -fgcse -fgcse-after-reload -fgcse-las -fgcse-lm -fgraphite-identity @gol --fgcse-sm -fif-conversion -fif-conversion2 -findirect-inlining @gol +-fgcse-sm -fhoist-adjacent-loads -fif-conversion @gol +-fif-conversion2 -findirect-inlining @gol -finline-functions -finline-functions-called-once -finline-limit=@var{n} @gol -finline-small-functions -fipa-cp -fipa-cp-clone -fipa-matrix-reorg @gol -fipa-pta -fipa-profile -fipa-pure-const -fipa-reference @gol @@ -413,8 +414,8 @@ Objective-C and Objective-C++ Dialects}. -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol --ftree-reassoc @gol --ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol +-ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol +-ftree-switch-conversion -ftree-tail-merge @gol -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol -funit-at-a-time -funroll-all-loops -funroll-loops @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol @@ -6259,6 +6260,7 @@ compilation time. -ftree-forwprop @gol -ftree-fre @gol -ftree-phiprop @gol +-ftree-slsr @gol -ftree-sra @gol -ftree-pta @gol -ftree-ter @gol @@ -6286,6 +6288,7 @@ also turns on the following optimization flags: -fdevirtualize @gol -fexpensive-optimizations @gol -fgcse -fgcse-lm @gol +-fhoist-adjacent-loads @gol -finline-small-functions @gol -findirect-inlining @gol -fipa-sra @gol @@ -6311,6 +6314,7 @@ Optimize yet more. @option{-O3} turns on all opti by @option{-O2} and also turns on the @option{-finline-functions}, @option{-funswitch-loops}, @option{-fpredictive-commoning}, @option{-fgcse-after-reload}, @option{-ftree-vectorize}, +@option{-fvect-cost-model}, @option{-ftree-partial-pre} and @option{-fipa-cp-clone} options. @item -O0 @@ -7129,6 +7133,13 @@ This flag is enabled by default at @option{-O} and Perform hoisting of loads from conditional pointers on trees. This pass is enabled by default at @option{-O} and higher. +@item -fhoist-adjacent-loads +@opindex hoist-adjacent-loads +Speculatively hoist loads from both branches of an if-then-else if the +loads are from adjacent locations in the same structure and the target +architecture has a conditional move instruction. This flag is enabled +by default at @option{-O2} and higher. + @item -ftree-copy-prop @opindex ftree-copy-prop Perform copy propagation on trees. This pass eliminates unnecessary @@ -7529,6 +7540,13 @@ defining expression. This results in non-GIMPLE c much more complex trees to work on resulting in better RTL generation. This is enabled by default at @option{-O} and higher. +@item -ftree-slsr +@opindex ftree-slsr +Perform straight-line strength reduction on trees. This recognizes related +expressions involving multiplications and replaces them by less expensive +calculations when possible. This is enabled by default at @option{-O} and +higher. + @item -ftree-vectorize @opindex ftree-vectorize Perform loop vectorization on trees. This flag is enabled by default at @@ -7550,7 +7568,8 @@ except at level @option{-Os} where it is disabled. @item -fvect-cost-model @opindex fvect-cost-model -Enable cost model for vectorization. +Enable cost model for vectorization. This option is enabled by default at +@option{-O3}. @item -ftree-vrp @opindex ftree-vrp Index: gcc/opts.c === --- gcc/opts.c (revision 189574) +++ gcc/opts.c (working copy) @@ -452,6 +452,7 @@
[PATCH] Add flag to control straight-line strength reduction
I overlooked adding a pass-control flag for strength reduction, added here. I named it -ftree-slsr for consistency with other -ftree- flags, but could change it to -fgimple-slsr if you prefer that for a pass named gimple-ssa-... Bootstrapped and tested on powerpc-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-07-17 Bill Schmidt wschm...@linux.ibm.com * opts.c (default_option): Make -ftree-slsr default at -O1 and above. * gimple-ssa-strength-reduction.c (gate_strength_reduction): Use flag_tree_slsr. * common.opt: Add -ftree-slsr with flag_tree_slsr. Index: gcc/opts.c === --- gcc/opts.c (revision 189574) +++ gcc/opts.c (working copy) @@ -452,6 +452,7 @@ static const struct default_options default_option { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 }, { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 }, { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 }, +{ OPT_LEVELS_1_PLUS, OPT_ftree_slsr, NULL, 1 }, /* -O2 optimizations. */ { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 }, Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 189574) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -1501,7 +1501,7 @@ execute_strength_reduction (void) static bool gate_strength_reduction (void) { - return optimize 0; + return flag_tree_slsr; } struct gimple_opt_pass pass_strength_reduction = Index: gcc/common.opt === --- gcc/common.opt (revision 189574) +++ gcc/common.opt (working copy) @@ -2080,6 +2080,10 @@ ftree-sink Common Report Var(flag_tree_sink) Optimization Enable SSA code sinking on trees +ftree-slsr +Common Report Var(flag_tree_slsr) Optimization +Perform straight-line strength reduction + ftree-sra Common Report Var(flag_tree_sra) Optimization Perform scalar replacement of aggregates
[PATCH] Enable vectorizer cost model by default at -O3
The auto-vectorizer is overly aggressive when not constrained by the vectorizer cost model. Although the cost model is by no means perfect, it does a reasonable job of avoiding many poor vectorization decisions. Since the auto-vectorizer is enabled by default at -O3 and above, we should also enable the vectorizer cost model by default at -O3 and above. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-07-15 Bill Schmidt wschm...@linux.ibm.com * opts.c (default_option): Add -fvect-cost-model to default options at -O3 and above. Index: gcc/opts.c === --- gcc/opts.c (revision 189481) +++ gcc/opts.c (working copy) @@ -501,6 +501,7 @@ static const struct default_options default_option { OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_ftree_vectorize, NULL, 1 }, +{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },
[PATCH, committed] Fix PR53955
Configure with --disable-build-poststage1-with-cxx exposed functions that should have been marked static. Bootstrapped on powerpc-unknown-linux-gnu, committed as obvious. Thanks, Bill 2012-07-13 Bill Schmidt wschm...@linux.ibm.com PR bootstrap/53955 * config/spu/spu.c (spu_init_cost): Mark static. (spu_add_stmt_cost): Likewise. (spu_finish_cost): Likewise. (spu_destroy_cost_data): Likewise. * config/i386/i386.c (ix86_init_cost): Mark static. (ix86_add_stmt_cost): Likewise. (ix86_finish_cost): Likewise. (ix86_destroy_cost_data): Likewise. * config/rs6000/rs6000.c (rs6000_init_cost): Mark static. (rs6000_add_stmt_cost): Likewise. (rs6000_finish_cost): Likewise. (rs6000_destroy_cost_data): Likewise. Index: gcc/config/spu/spu.c === --- gcc/config/spu/spu.c(revision 189460) +++ gcc/config/spu/spu.c(working copy) @@ -6919,7 +6919,7 @@ spu_builtin_vectorization_cost (enum vect_cost_for /* Implement targetm.vectorize.init_cost. */ -void * +static void * spu_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED) { unsigned *cost = XNEW (unsigned); @@ -6929,7 +6929,7 @@ spu_init_cost (struct loop *loop_info ATTRIBUTE_UN /* Implement targetm.vectorize.add_stmt_cost. */ -unsigned +static unsigned spu_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind, struct _stmt_vec_info *stmt_info, int misalign) { @@ -6956,7 +6956,7 @@ spu_add_stmt_cost (void *data, int count, enum vec /* Implement targetm.vectorize.finish_cost. */ -unsigned +static unsigned spu_finish_cost (void *data) { return *((unsigned *) data); @@ -6964,7 +6964,7 @@ spu_finish_cost (void *data) /* Implement targetm.vectorize.destroy_cost_data. */ -void +static void spu_destroy_cost_data (void *data) { free (data); Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 189460) +++ gcc/config/i386/i386.c (working copy) @@ -40066,7 +40066,7 @@ ix86_autovectorize_vector_sizes (void) /* Implement targetm.vectorize.init_cost. */ -void * +static void * ix86_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED) { unsigned *cost = XNEW (unsigned); @@ -40076,7 +40076,7 @@ ix86_init_cost (struct loop *loop_info ATTRIBUTE_U /* Implement targetm.vectorize.add_stmt_cost. */ -unsigned +static unsigned ix86_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind, struct _stmt_vec_info *stmt_info, int misalign) { @@ -40103,7 +40103,7 @@ ix86_add_stmt_cost (void *data, int count, enum ve /* Implement targetm.vectorize.finish_cost. */ -unsigned +static unsigned ix86_finish_cost (void *data) { return *((unsigned *) data); @@ -40111,7 +40111,7 @@ ix86_finish_cost (void *data) /* Implement targetm.vectorize.destroy_cost_data. */ -void +static void ix86_destroy_cost_data (void *data) { free (data); Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 189460) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -3522,7 +3522,7 @@ rs6000_preferred_simd_mode (enum machine_mode mode /* Implement targetm.vectorize.init_cost. */ -void * +static void * rs6000_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED) { unsigned *cost = XNEW (unsigned); @@ -3532,7 +3532,7 @@ rs6000_init_cost (struct loop *loop_info ATTRIBUTE /* Implement targetm.vectorize.add_stmt_cost. */ -unsigned +static unsigned rs6000_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind, struct _stmt_vec_info *stmt_info, int misalign) { @@ -3559,7 +3559,7 @@ rs6000_add_stmt_cost (void *data, int count, enum /* Implement targetm.vectorize.finish_cost. */ -unsigned +static unsigned rs6000_finish_cost (void *data) { return *((unsigned *) data); @@ -3567,7 +3567,7 @@ rs6000_finish_cost (void *data) /* Implement targetm.vectorize.destroy_cost_data. */ -void +static void rs6000_destroy_cost_data (void *data) { free (data);
[PATCH] Fix PR46556 (straight-line strength reduction, part 2)
Here's a relatively small piece of strength reduction that solves that pesky addressing bug that got me looking at this in the first place... The main part of the code is the stuff that was reviewed last year, but which needed to find a good home. So hopefully that's in pretty good shape. I recast base_cand_map as an htab again since I now need to look up trees other than SSA names. I plan to put together a follow-up patch to change code and commentary references so that base_name becomes base_expr. Doing that now would clutter up the patch too much. Bootstrapped and tested on powerpc64-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill gcc: PR tree-optimization/46556 * gimple-ssa-strength-reduction.c (enum cand_kind): Add CAND_REF. (base_cand_map): Change to hash table. (base_cand_hash): New function. (base_cand_free): Likewise. (base_cand_eq): Likewise. (lookup_cand): Change base_cand_map to hash table. (find_basis_for_candidate): Likewise. (base_cand_from_table): Exclude CAND_REF. (restructure_reference): New function. (slsr_process_ref): Likewise. (find_candidates_in_block): Call slsr_process_ref. (dump_candidate): Handle CAND_REF. (base_cand_dump_callback): New function. (dump_cand_chains): Change base_cand_map to hash table. (replace_ref): New function. (replace_refs): Likewise. (analyze_candidates_and_replace): Call replace_refs. (execute_strength_reduction): Change base_cand_map to hash table. gcc/testsuite: PR tree-optimization/46556 * testsuite/gcc.dg/tree-ssa/slsr-27.c: New. * testsuite/gcc.dg/tree-ssa/slsr-28.c: New. * testsuite/gcc.dg/tree-ssa/slsr-29.c: New. Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0) @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom2 } */ + +struct x +{ + int a[16]; + int b[16]; + int c[16]; +}; + +extern void foo (int, int, int); + +void +f (struct x *p, unsigned int n) +{ + foo (p-a[n], p-c[n], p-b[n]); +} + +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 3 dom2 } } */ +/* { dg-final { cleanup-tree-dump dom2 } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0) @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom2 } */ + +struct x +{ + int a[16]; + int b[16]; + int c[16]; +}; + +extern void foo (int, int, int); + +void +f (struct x *p, unsigned int n) +{ + foo (p-a[n], p-c[n], p-b[n]); + if (n 12) +foo (p-a[n], p-c[n], p-b[n]); + else if (n 3) +foo (p-b[n], p-a[n], p-c[n]); +} + +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } } */ +/* { dg-final { cleanup-tree-dump dom2 } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0) @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom2 } */ + +struct x +{ + int a[16]; + int b[16]; + int c[16]; +}; + +extern void foo (int, int, int); + +void +f (struct x *p, unsigned int n) +{ + foo (p-a[n], p-c[n], p-b[n]); + if (n 3) +{ + foo (p-a[n], p-c[n], p-b[n]); + if (n 12) + foo (p-b[n], p-a[n], p-c[n]); +} +} + +/* { dg-final { scan-tree-dump-times \\* 4; 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times p_\\d\+\\(D\\) \\+ D 1 dom2 } } */ +/* { dg-final { scan-tree-dump-times MEM\\\[\\(struct x \\*\\)D 9 dom2 } } */ +/* { dg-final { cleanup-tree-dump dom2 } } */ Index: gcc/gimple-ssa-strength-reduction.c === --- gcc/gimple-ssa-strength-reduction.c (revision 189025) +++ gcc/gimple-ssa-strength-reduction.c (working copy) @@ -32,7 +32,7 @@ along with GCC; see the file COPYING3. If not see 2) Explicit multiplies, unknown constant multipliers, no conditional increments. (data gathering complete, replacements pending) - 3) Implicit multiplies in addressing expressions. (pending) + 3) Implicit multiplies in addressing expressions. (complete) 4) Explicit multiplies, conditional increments. (pending) It would also be possible to apply strength
[PATCH] Strength reduction
Here's a new version of the main strength reduction patch, addressing previous comments. A couple of quick notes: * I opened PR53773 and PR53774 for the cases where commutative operations were encountered with a constant in rhs1. This version of the patch still has the gcc_asserts in place to catch those cases, but I'll plan to remove those once the patch is approved. * You previously asked: +static slsr_cand_t +base_cand_from_table (tree base_in) +{ + slsr_cand mapping_key; + + gimple def = SSA_NAME_DEF_STMT (base_in); + if (!def) +return (slsr_cand_t) NULL; + + mapping_key.cand_stmt = def; + return (slsr_cand_t) htab_find (stmt_cand_map, mapping_key); isn't that reachable via the base-name - chain mapping for base_in? I had to review this a bit, but the answer is no. If you look at one of the algebraic manipulations in create_mul_ssa_cand as an example, base_in corresponds to Y. base_cand_from_table is looking for a candidate that has Y for its LHS. The base-name - chain mapping is used to find all candidates that have B as the base_name. * I added a detailed explanation of what's going on with legal_cast_p. Hopefully this will be easier to understand now. I've bootstrapped this on powerpc64-unknown-linux-gnu with three new regressions (for which I opened the two bug reports). Ok for trunk after removing the asserts? Thanks, Bill gcc: 2012-06-25 Bill Schmidt wschm...@linux.ibm.com * tree-pass.h (pass_strength_reduction): New decl. * tree-ssa-loop-ivopts.c (initialize_costs): Make non-static. (finalize_costs): Likewise. * timevar.def (TV_TREE_SLSR): New timevar. * gimple-ssa-strength-reduction.c: New. * tree-flow.h (initialize_costs): New decl. (finalize_costs): Likewise. * Makefile.in (tree-ssa-strength-reduction.o): New dependencies. * passes.c (init_optimization_passes): Add pass_strength_reduction. gcc/testsuite: 2012-06-25 Bill Schmidt wschm...@linux.ibm.com * gcc.dg/tree-ssa/slsr-1.c: New test. * gcc.dg/tree-ssa/slsr-2.c: Likewise. * gcc.dg/tree-ssa/slsr-3.c: Likewise. * gcc.dg/tree-ssa/slsr-4.c: Likewise. Index: gcc/tree-pass.h === --- gcc/tree-pass.h (revision 188890) +++ gcc/tree-pass.h (working copy) @@ -452,6 +452,7 @@ extern struct gimple_opt_pass pass_tm_memopt; extern struct gimple_opt_pass pass_tm_edges; extern struct gimple_opt_pass pass_split_functions; extern struct gimple_opt_pass pass_feedback_split_functions; +extern struct gimple_opt_pass pass_strength_reduction; /* IPA Passes */ extern struct simple_ipa_opt_pass pass_ipa_lower_emutls; Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c (revision 0) @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -fdump-tree-optimized } */ + +extern void foo (int); + +void +f (int *p, unsigned int n) +{ + foo (*(p + n * 4)); + foo (*(p + 32 + n * 4)); + if (n 3) +foo (*(p + 16 + n * 4)); + else +foo (*(p + 48 + n * 4)); +} + +/* { dg-final { scan-tree-dump-times \\+ 128 1 optimized } } */ +/* { dg-final { scan-tree-dump-times \\+ 64 1 optimized } } */ +/* { dg-final { scan-tree-dump-times \\+ 192 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c (revision 0) @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -fdump-tree-optimized } */ + +extern void foo (int); + +void +f (int *p, int n) +{ + foo (*(p + n++ * 4)); + foo (*(p + 32 + n++ * 4)); + foo (*(p + 16 + n * 4)); +} + +/* { dg-final { scan-tree-dump-times \\+ 144 1 optimized } } */ +/* { dg-final { scan-tree-dump-times \\+ 96 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c === --- gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c (revision 0) @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -fdump-tree-optimized } */ + +int +foo (int a[], int b[], int i) +{ + a[i] = b[i] + 2; + i++; + a[i] = b[i] + 2; + i++; + a[i] = b[i] + 2; + i++; + a[i] = b[i] + 2; + i++; + return i; +} + +/* { dg-final { scan-tree-dump-times \\* 4 1 optimized } } */ +/* { dg-final { scan-tree-dump-times \\+ 4 2 optimized } } */ +/* { dg-final { scan-tree-dump-times \\+ 8 1 optimized } } */ +/* { dg-final { scan-tree-dump-times \\+ 12 1 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index:
Re: [PATCH] Strength reduction preliminaries
On Fri, 2012-06-22 at 10:44 +0200, Richard Guenther wrote: On Thu, 21 Jun 2012, William J. Schmidt wrote: As promised, this breaks out the changes to the IVOPTS cost model and the added function in double-int.c. Please let me know if you would rather see me attempt to consolidate the IVOPTS logic into expmed.c per Richard H's suggestion. If we start to use it from multiple places that definitely makes sense, but you can move the stuff as a followup. OK, I'll put it on my list. I ran into a glitch with multiply_by_const_cost. The original code declared a static htab_t in the function and allocated it on demand. When I tried adding a second one in the same manner, I ran into a locking problem in the memory management library code during a call to delete_htab. The original implementation seemed a bit dicey to me anyway, so I changed this to explicitly allocate and deallocate the hash tables on (entry to/exit from) IVOPTS. Huh. That's weird and should not happen. Still it makes sense to move this to a per-function cache given that its size is basically unbound. Can you introduce a initialize_costs () / finalize_costs () function pair that allocates / frees the tables and sets a global flag that you can then assert in the functions using those tables? Ok. + if (speed) +speed = 1; I suppose this is because bool is not bool when building with a C compiler? It really looks weird and if such is necessary I'd prefer something like +add_regs_cost (enum machine_mode mode, bool speed) { + static unsigned costs[NUM_MACHINE_MODES][2]; rtx seq; unsigned cost; unsigned sidx = speed ? 0 : 1; + if (costs[mode][sidx]) +return costs[mode][sidx]; + instead. I'm always paranoid about misuse of bools in C, but I suppose this is overkill. I'll just remove the code. Thanks, Bill Otherwise the patch is ok. Thanks, Richard. This reduces the scope of the hash table from a compilation unit to each individual function. If it's preferred to maintain compilation unit scope, then the initialization/finalization of the htabs can be pushed out to do_compile. But I doubt it's worth that. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-06-21 Bill Schmidt wschm...@linux.ibm.com * double-int.c (double_int_multiple_of): New function. * double-int.h (double_int_multiple_of): New decl. * tree-ssa-loop-ivopts.c (add_cost, zero_cost): Remove undefs. (mbc_entry_hash): New forward decl. (mbc_entry_eq): Likewise. (zero_cost): Change to no_cost. (mult_costs): New static var. (tree_ssa_iv_optimize_init): Initialize mult_costs. (add_cost): Change to add_regs_cost; distinguish costs by speed. (multiply_regs_cost): New function. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. (multiply_by_cost): Change to multiply_by_const_cost; distinguish costs by speed. (get_address_cost): Change add_cost to add_regs_cost; change multiply_by_cost to multiply_by_const_cost. (force_expr_to_var_cost): Change zero_cost to no_cost; change add_cost to add_regs_cost; change multiply_by_cost to multiply_by_const_cost. (split_cost): Change zero_cost to no_cost. (ptr_difference_cost): Likewise. (difference_cost): Change zero_cost to no_cost; change multiply_by_cost to multiply_by_const_cost. (get_computation_cost_at): Change add_cost to add_regs_cost; change multiply_by_cost to multiply_by_const_cost. (determine_use_iv_cost_generic): Change zero_cost to no_cost. (determine_iv_cost): Change add_cost to add_regs_cost. (iv_ca_new): Change zero_cost to no_cost. (tree_ssa_iv_optimize_finalize): Release storage for mult_costs. * tree-ssa-address.c (most_expensive_mult_to_index): Change multiply_by_cost to multiply_by_const_cost. * tree-flow.h (multiply_by_cost): Change to multiply_by_const_cost. (add_regs_cost): New decl. (multiply_regs_cost): Likewise. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. Index: gcc/double-int.c === --- gcc/double-int.c(revision 188839) +++ gcc/double-int.c(working copy) @@ -865,6 +865,26 @@ double_int_umod (double_int a, double_int b, unsig return double_int_mod (a, b, true, code); } +/* Return TRUE iff PRODUCT is an integral multiple of FACTOR, and return + the multiple in *MULTIPLE. Otherwise return FALSE and leave *MULTIPLE + unchanged. */ + +bool +double_int_multiple_of (double_int product, double_int factor, + bool unsigned_p, double_int *multiple) +{ + double_int
Re: [PATCH] Strength reduction preliminaries
On Fri, 2012-06-22 at 10:44 +0200, Richard Guenther wrote: On Thu, 21 Jun 2012, William J. Schmidt wrote: I ran into a glitch with multiply_by_const_cost. The original code declared a static htab_t in the function and allocated it on demand. When I tried adding a second one in the same manner, I ran into a locking problem in the memory management library code during a call to delete_htab. The original implementation seemed a bit dicey to me anyway, so I changed this to explicitly allocate and deallocate the hash tables on (entry to/exit from) IVOPTS. Huh. That's weird and should not happen. Still it makes sense to move this to a per-function cache given that its size is basically unbound. Hm, this appears not to be related to my changes. I ran into the same issue when bootstrapping some other change without any of the IVOPTS changes committed. In both cases the stuck lock occurred when compiling tree-vect-stmts.c. I'll try to debug this when I get some time, unless somebody else figures it out sooner. Bill
Re: [PATCH] Add vector cost model density heuristic
On Tue, 2012-06-19 at 16:20 +0200, Richard Guenther wrote: On Tue, 19 Jun 2012, William J. Schmidt wrote: On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote: On Tue, 19 Jun 2012, William J. Schmidt wrote: I remember having this discussion, and I was looking for it to check on the details, but I can't seem to find it either in my inbox or in the archives. Can you please point me to that again? Sorry for the bother. It was in the Correct cost model for strided loads thread. Ah, right, thanks. I think it will be best to make that a separate patch in the series. Like so: (1) Add calls to the new interface without disturbing existing logic; modify the profitability algorithms to query the new model for inside costs. Default algorithm for the model is to just sum costs as is done today. Just FYI, this is not quite as straightforward as I thought. There is some code in tree-vect-data-refs.c that computes costs for various peeling options and picks one of them. In most other places we can just pass the instructions to the back end at the same place that the costs are currently calculated, but not here. This will require some more major surgery to save the instructions needed from each peeling option and only pass along the ones that end up being chosen. The upside is the same sort of delayed emit is needed for the SLP ordering problem, so the infrastructure for this will be reusable for that problem. Grumble. Bill (1a) Split up the cost hooks (one for loads/stores with misalign parm, one for vector_stmt with tree_code, etc.). (x) Add heuristics to target models as desired. (2) Handle the SLP ordering problem. (3) Handle outside costs in the target model. (4) Remove the now unnecessary cost fields and the calls that set them. I'll start work on this series of patches as I have time between other projects. Thanks! Richard.
[PATCH] Strength reduction preliminaries
As promised, this breaks out the changes to the IVOPTS cost model and the added function in double-int.c. Please let me know if you would rather see me attempt to consolidate the IVOPTS logic into expmed.c per Richard H's suggestion. I ran into a glitch with multiply_by_const_cost. The original code declared a static htab_t in the function and allocated it on demand. When I tried adding a second one in the same manner, I ran into a locking problem in the memory management library code during a call to delete_htab. The original implementation seemed a bit dicey to me anyway, so I changed this to explicitly allocate and deallocate the hash tables on (entry to/exit from) IVOPTS. This reduces the scope of the hash table from a compilation unit to each individual function. If it's preferred to maintain compilation unit scope, then the initialization/finalization of the htabs can be pushed out to do_compile. But I doubt it's worth that. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-06-21 Bill Schmidt wschm...@linux.ibm.com * double-int.c (double_int_multiple_of): New function. * double-int.h (double_int_multiple_of): New decl. * tree-ssa-loop-ivopts.c (add_cost, zero_cost): Remove undefs. (mbc_entry_hash): New forward decl. (mbc_entry_eq): Likewise. (zero_cost): Change to no_cost. (mult_costs): New static var. (tree_ssa_iv_optimize_init): Initialize mult_costs. (add_cost): Change to add_regs_cost; distinguish costs by speed. (multiply_regs_cost): New function. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. (multiply_by_cost): Change to multiply_by_const_cost; distinguish costs by speed. (get_address_cost): Change add_cost to add_regs_cost; change multiply_by_cost to multiply_by_const_cost. (force_expr_to_var_cost): Change zero_cost to no_cost; change add_cost to add_regs_cost; change multiply_by_cost to multiply_by_const_cost. (split_cost): Change zero_cost to no_cost. (ptr_difference_cost): Likewise. (difference_cost): Change zero_cost to no_cost; change multiply_by_cost to multiply_by_const_cost. (get_computation_cost_at): Change add_cost to add_regs_cost; change multiply_by_cost to multiply_by_const_cost. (determine_use_iv_cost_generic): Change zero_cost to no_cost. (determine_iv_cost): Change add_cost to add_regs_cost. (iv_ca_new): Change zero_cost to no_cost. (tree_ssa_iv_optimize_finalize): Release storage for mult_costs. * tree-ssa-address.c (most_expensive_mult_to_index): Change multiply_by_cost to multiply_by_const_cost. * tree-flow.h (multiply_by_cost): Change to multiply_by_const_cost. (add_regs_cost): New decl. (multiply_regs_cost): Likewise. (add_const_cost): Likewise. (extend_or_trunc_reg_cost): Likewise. (negate_reg_cost): Likewise. Index: gcc/double-int.c === --- gcc/double-int.c(revision 188839) +++ gcc/double-int.c(working copy) @@ -865,6 +865,26 @@ double_int_umod (double_int a, double_int b, unsig return double_int_mod (a, b, true, code); } +/* Return TRUE iff PRODUCT is an integral multiple of FACTOR, and return + the multiple in *MULTIPLE. Otherwise return FALSE and leave *MULTIPLE + unchanged. */ + +bool +double_int_multiple_of (double_int product, double_int factor, + bool unsigned_p, double_int *multiple) +{ + double_int remainder; + double_int quotient = double_int_divmod (product, factor, unsigned_p, + TRUNC_DIV_EXPR, remainder); + if (double_int_zero_p (remainder)) +{ + *multiple = quotient; + return true; +} + + return false; +} + /* Set BITPOS bit in A. */ double_int double_int_setbit (double_int a, unsigned bitpos) Index: gcc/double-int.h === --- gcc/double-int.h(revision 188839) +++ gcc/double-int.h(working copy) @@ -150,6 +150,8 @@ double_int double_int_divmod (double_int, double_i double_int double_int_sdivmod (double_int, double_int, unsigned, double_int *); double_int double_int_udivmod (double_int, double_int, unsigned, double_int *); +bool double_int_multiple_of (double_int, double_int, bool, double_int *); + double_int double_int_setbit (double_int, unsigned); int double_int_ctz (double_int); Index: gcc/tree-ssa-loop-ivopts.c === --- gcc/tree-ssa-loop-ivopts.c (revision 188839) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -89,13 +89,11 @@ along with GCC; see the file COPYING3. If not see #include target.h #include
Re: [Patch ping] Strength reduction
On Wed, 2012-06-20 at 13:11 +0200, Richard Guenther wrote: On Thu, Jun 14, 2012 at 3:21 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Pro forma ping. :) ;) I notice (with all of these functions) +unsigned +negate_cost (enum machine_mode mode, bool speed) +{ + static unsigned costs[NUM_MACHINE_MODES]; + rtx seq; + unsigned cost; + + if (costs[mode]) +return costs[mode]; + + start_sequence (); + force_operand (gen_rtx_fmt_e (NEG, mode, + gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1)), + NULL_RTX); + seq = get_insns (); + end_sequence (); + + cost = seq_cost (seq, speed); + if (!cost) +cost = 1; that the cost[] array is independent on the speed argument. Thus whatever comes first determines the cost. Odd, and probably not good. A fix would be appreciated (even for the current code ...) - simply make the array costs[NUM_MACHINE_MODES][2]. As for the renaming - can you name the functions consistently? Thus the above would be negate_reg_cost? And maybe rename the other FIXME function, too? I agree with all this. I'll prepare all the cost model changes as a separate preliminaries patch. Index: gcc/tree-ssa-strength-reduction.c === --- gcc/tree-ssa-strength-reduction.c (revision 0) +++ gcc/tree-ssa-strength-reduction.c (revision 0) @@ -0,0 +1,1611 @@ +/* Straight-line strength reduction. + Copyright (C) 2012 Free Software Foundation, Inc. I know we have these 'tree-ssa-' names, but really this is gimple-ssa now ;) So, please name it gimple-ssa-strength-reduction.c. Will do. Vive la revolution? ;) + /* Access to the statement for subsequent modification. Cached to + save compile time. */ + gimple_stmt_iterator cand_gsi; this is a iterator for cand_stmt? Then caching it is no longer necessary as the iterator is the stmt itself after recent infrastructure changes. Oh yeah, I remember seeing that go by. Nice. Will change. +/* Hash table embodying a mapping from statements to candidates. */ +static htab_t stmt_cand_map; ... +static hashval_t +stmt_cand_hash (const void *p) +{ + return htab_hash_pointer (((const_slsr_cand_t) p)-cand_stmt); +} use a pointer-map instead. +/* Callback to produce a hash value for a candidate chain header. */ + +static hashval_t +base_cand_hash (const void *p) +{ + tree ssa_name = ((const_cand_chain_t) p)-base_name; + + if (TREE_CODE (ssa_name) != SSA_NAME) +return (hashval_t) 0; + + return (hashval_t) SSA_NAME_VERSION (ssa_name); +} does it ever happen that ssa_name is not an SSA_NAME? Not in this patch, but when I introduce CAND_REF in a later patch it could happen since the base field of a CAND_REF is a MEM_REF. It's a safety valve in case of misuse. I'll think about this some more. I'm not sure the memory savings over simply using a fixed-size (num_ssa_names) array indexed by SSA_NAME_VERSION pointing to the chain is worth using a hashtable for this? That's reasonable. I'll do that. + node = (cand_chain_t) pool_alloc (chain_pool); + node-base_name = c-base_name; If you never free pool entries it's more efficient to use an obstack. alloc-pool only pays off if you get freed item re-use. OK. I'll change both cand_pool and chain_pool to obstacks. + switch (gimple_assign_rhs_code (gs)) +{ +case MULT_EXPR: + rhs2 = gimple_assign_rhs2 (gs); + + if (TREE_CODE (rhs2) == INTEGER_CST) + return multiply_by_cost (TREE_INT_CST_LOW (rhs2), lhs_mode, speed); + + if (TREE_CODE (rhs1) == INTEGER_CST) + return multiply_by_cost (TREE_INT_CST_LOW (rhs1), lhs_mode, speed); In theory all commutative statements should have constant operands only at rhs2 ... I'm glad I'm not the only one who thought that was the theory. ;) I wasn't sure, and I've seen violations of this come up in practice. Should I assert when that happens instead, and track down the offending optimizations? Also you do not verify that the constant fits in a host-wide-int - but maybe you do not care? Thus, I'd do if (host_integerp (rhs2, 0)) return multiply_by_cost (TREE_INT_CST_LOW (rhs2), lhs_mode, speed); or make multiply_by[_const?]_cost take a double-int instead. Likewise below for add. Ok. Name change looks good also, I'll include that in the cost model changes. +case MODIFY_EXPR: + /* Be suspicious of assigning costs to copies that may well go away. */ + return 0; MODIFY_EXPR is never a gimple_assign_rhs_code. Simple copies have a code of SSA_NAME for example. But as you assert if you get to an unhandled code I wonder why you needed the above ... I'll remove this, and document that we are deliberately not touching copies (which was my original intent). +static slsr_cand_t +base_cand_from_table (tree base_in) +{ + slsr_cand mapping_key
Re: [Patch ping] Strength reduction
On Wed, 2012-06-20 at 11:52 -0700, Richard Henderson wrote: On 06/20/2012 04:11 AM, Richard Guenther wrote: I notice (with all of these functions) +unsigned +negate_cost (enum machine_mode mode, bool speed) +{ + static unsigned costs[NUM_MACHINE_MODES]; + rtx seq; + unsigned cost; + + if (costs[mode]) +return costs[mode]; + + start_sequence (); + force_operand (gen_rtx_fmt_e (NEG, mode, + gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1)), +NULL_RTX); I don't suppose there's any way to share data with what init_expmed computes? Not, strictly speaking, the cleanest thing to include expmed.h here, but surely a tad better than re-computing identical data (and without the clever rtl garbage avoidance tricks). Interesting. I was building on what ivopts already has; not sure of the history there. It looks like there is some overlap in function, but expmed doesn't have everything ivopts uses today (particularly the hash table of costs for multiplies by various constants). The stuff I need for type promotion/demotion is also not present (which I'm computing on demand for whatever mode pairs are encountered). Not sure how great it would be to precompute that for all pairs, and obviously precomputing costs of multiplying by all constants isn't going to work. So if the two functionalities were to be combined, it would seem to require some modification to how expmed works. Thanks, Bill r~
Re: [PATCH] Add vector cost model density heuristic
On Tue, 2012-06-19 at 12:08 +0200, Richard Guenther wrote: On Mon, 18 Jun 2012, William J. Schmidt wrote: On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote: On Fri, 8 Jun 2012, William J. Schmidt wrote: snip Hmm. I don't like this patch or its general idea too much. Instead I'd like us to move more of the cost model detail to the target, giving it a chance to look at the whole loop before deciding on a cost. ISTR posting the overall idea at some point, but let me repeat it here instead of trying to find that e-mail. The basic interface of the cost model should be, in targetm.vectorize /* Tell the target to start cost analysis of a loop or a basic-block (if the loop argument is NULL). Returns an opaque pointer to target-private data. */ void *init_cost (struct loop *loop); /* Add cost for N vectorized-stmt-kind statements in vector_mode. */ void add_stmt_cost (void *data, unsigned n, vectorized-stmt-kind, enum machine_mode vector_mode); /* Tell the target to compute and return the cost of the accumulated statements and free any target-private data. */ unsigned finish_cost (void *data); with eventually slightly different signatures for add_stmt_cost (like pass in the original scalar stmt?). It allows the target, at finish_cost time, to evaluate things like register pressure and resource utilization. Thanks, Richard. I've been looking at this in between other projects. I wanted to be sure I understood the SLP infrastructure and whether it would cause any problems. It looks to me like it will be mostly ok. One issue I noticed is a possible difference in the order in which SLP instructions are analyzed and the order in which the instructions are issued during transformation. For both loop analysis and basic block analysis, SLP trees are constructed and analyzed prior to examining other vectorizable instructions. Their costs are calculated and stored in the SLP trees at this time. Later, when transforming statements to their vector equivalents, instructions in the block (or loop body) are processed in order until the first instruction that's part of an SLP tree is encountered. At that point, every instruction that's part of any SLP tree is transformed; then the vectorizer continues with the remaining non-SLP vectorizable statements. So if we do the natural and easy thing of placing calls to add_stmt_cost everywhere that costs are calculated today, the order that those costs are presented to the back end model will possibly be different than the order they are actually emitted. Interesting. But I suppose this is similar to how pattern statements are handled? Thus, the whole pattern sequence is processed when we encounter the main pattern statement? Yes, but the difference is that both vect_analyze_stmt and vect_transform_loop handle the pattern statements in the same order (thankfully -- I would hate to have to deal with the pattern mess). With SLP, all SLP statements are analyzed ahead of time, but they aren't transformed until one of them is encountered in the statement walk. For a first cut at this, I suggest ignoring the problem other than to document it as an opportunity for improvement. Later we could improve it by using an add_stmt_slp_cost () interface (or adding an is_slp flag), and another interface to be called at the time during analysis when the SLP statements will be issued during transformation. This would allow the back end model to queue up the SLP costs in a separate vector and later place them in its internal structures at the appropriate place. It should eventually be possible to remove these fields/accessors: * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST However, I think this should be delayed until we have the basic infrastructure in place for the new model and well-tested. Indeed. The other issue is that we should have the model track both the inside and outside costs if we're going to get everything into the target model. For a first pass we can ignore this and keep the existing logic for the outside costs. Later we should add some interfaces analogous to add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so the model can track this stuff as carefully as it wants to. Outside costs are merely added to the niter * inner-cost metric to be compared with the scalar cost niter * scalar-cost, right? Thus they would be tracked completely separate - eventually similar to how we compute the cost of the scalar loop. Yes, that's the way they're used today, and probably nobody will ever want to get fancier than that. But as you say, the idea would be to let them be tracked similarly
Re: [PATCH] Add vector cost model density heuristic
On Tue, 2012-06-19 at 12:10 +0200, Richard Guenther wrote: On Mon, 18 Jun 2012, William J. Schmidt wrote: On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote: On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote: On Fri, 8 Jun 2012, William J. Schmidt wrote: snip Hmm. I don't like this patch or its general idea too much. Instead I'd like us to move more of the cost model detail to the target, giving it a chance to look at the whole loop before deciding on a cost. ISTR posting the overall idea at some point, but let me repeat it here instead of trying to find that e-mail. The basic interface of the cost model should be, in targetm.vectorize /* Tell the target to start cost analysis of a loop or a basic-block (if the loop argument is NULL). Returns an opaque pointer to target-private data. */ void *init_cost (struct loop *loop); /* Add cost for N vectorized-stmt-kind statements in vector_mode. */ void add_stmt_cost (void *data, unsigned n, vectorized-stmt-kind, enum machine_mode vector_mode); /* Tell the target to compute and return the cost of the accumulated statements and free any target-private data. */ unsigned finish_cost (void *data); By the way, I don't see much point in passing the void *data around here. Too many levels of interfaces that we'd have to pass it around in the vectorizer, so it would just sit in a static variable. Might as well let the data be wholly private to the target. Ok, so you'd have void init_cost (struct loop *) and unsigned finish_cost (void); then? Static variables are of couse not properly abstracted so we can't ever compute two set of costs at the same time ... but that's true all-over-the-place in GCC ... It's a fair point, and perhaps I'll decide to pass the data pointer around anyway to keep that option open. We'll see which looks uglier. With previous discussion the add_stmt_cost hook would be split up to also allow passing the operation code for example. I remember having this discussion, and I was looking for it to check on the details, but I can't seem to find it either in my inbox or in the archives. Can you please point me to that again? Sorry for the bother. Thanks, Bill Richard.
Re: [PATCH] Add vector cost model density heuristic
On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote: On Tue, 19 Jun 2012, William J. Schmidt wrote: I remember having this discussion, and I was looking for it to check on the details, but I can't seem to find it either in my inbox or in the archives. Can you please point me to that again? Sorry for the bother. It was in the Correct cost model for strided loads thread. Ah, right, thanks. I think it will be best to make that a separate patch in the series. Like so: (1) Add calls to the new interface without disturbing existing logic; modify the profitability algorithms to query the new model for inside costs. Default algorithm for the model is to just sum costs as is done today. (1a) Split up the cost hooks (one for loads/stores with misalign parm, one for vector_stmt with tree_code, etc.). (x) Add heuristics to target models as desired. (2) Handle the SLP ordering problem. (3) Handle outside costs in the target model. (4) Remove the now unnecessary cost fields and the calls that set them. I'll start work on this series of patches as I have time between other projects. Thanks, Bill Richard.
Re: [PATCH] Add vector cost model density heuristic
On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote: On Fri, 8 Jun 2012, William J. Schmidt wrote: snip Hmm. I don't like this patch or its general idea too much. Instead I'd like us to move more of the cost model detail to the target, giving it a chance to look at the whole loop before deciding on a cost. ISTR posting the overall idea at some point, but let me repeat it here instead of trying to find that e-mail. The basic interface of the cost model should be, in targetm.vectorize /* Tell the target to start cost analysis of a loop or a basic-block (if the loop argument is NULL). Returns an opaque pointer to target-private data. */ void *init_cost (struct loop *loop); /* Add cost for N vectorized-stmt-kind statements in vector_mode. */ void add_stmt_cost (void *data, unsigned n, vectorized-stmt-kind, enum machine_mode vector_mode); /* Tell the target to compute and return the cost of the accumulated statements and free any target-private data. */ unsigned finish_cost (void *data); with eventually slightly different signatures for add_stmt_cost (like pass in the original scalar stmt?). It allows the target, at finish_cost time, to evaluate things like register pressure and resource utilization. Thanks, Richard. I've been looking at this in between other projects. I wanted to be sure I understood the SLP infrastructure and whether it would cause any problems. It looks to me like it will be mostly ok. One issue I noticed is a possible difference in the order in which SLP instructions are analyzed and the order in which the instructions are issued during transformation. For both loop analysis and basic block analysis, SLP trees are constructed and analyzed prior to examining other vectorizable instructions. Their costs are calculated and stored in the SLP trees at this time. Later, when transforming statements to their vector equivalents, instructions in the block (or loop body) are processed in order until the first instruction that's part of an SLP tree is encountered. At that point, every instruction that's part of any SLP tree is transformed; then the vectorizer continues with the remaining non-SLP vectorizable statements. So if we do the natural and easy thing of placing calls to add_stmt_cost everywhere that costs are calculated today, the order that those costs are presented to the back end model will possibly be different than the order they are actually emitted. For a first cut at this, I suggest ignoring the problem other than to document it as an opportunity for improvement. Later we could improve it by using an add_stmt_slp_cost () interface (or adding an is_slp flag), and another interface to be called at the time during analysis when the SLP statements will be issued during transformation. This would allow the back end model to queue up the SLP costs in a separate vector and later place them in its internal structures at the appropriate place. It should eventually be possible to remove these fields/accessors: * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST However, I think this should be delayed until we have the basic infrastructure in place for the new model and well-tested. The other issue is that we should have the model track both the inside and outside costs if we're going to get everything into the target model. For a first pass we can ignore this and keep the existing logic for the outside costs. Later we should add some interfaces analogous to add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so the model can track this stuff as carefully as it wants to. So, I'd propose going at this in several phases: (1) Add calls to the new interface without disturbing existing logic; modify the profitability algorithms to query the new model for inside costs. Default algorithm for the model is to just sum costs as is done today. (x) Add heuristics to target models as desired. (2) Handle the SLP ordering problem. (3) Handle outside costs in the target model. (4) Remove the now unnecessary cost fields and the calls that set them. Item (x) can happen anytime after item (1). I don't think this work is terribly difficult, just a bit tedious. The only really time-consuming aspect of it will be in very careful testing to keep from changing existing behavior. All comments welcome -- please let me know what you think. Thanks, Bill
Re: [PATCH] Add vector cost model density heuristic
On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote: On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote: On Fri, 8 Jun 2012, William J. Schmidt wrote: snip Hmm. I don't like this patch or its general idea too much. Instead I'd like us to move more of the cost model detail to the target, giving it a chance to look at the whole loop before deciding on a cost. ISTR posting the overall idea at some point, but let me repeat it here instead of trying to find that e-mail. The basic interface of the cost model should be, in targetm.vectorize /* Tell the target to start cost analysis of a loop or a basic-block (if the loop argument is NULL). Returns an opaque pointer to target-private data. */ void *init_cost (struct loop *loop); /* Add cost for N vectorized-stmt-kind statements in vector_mode. */ void add_stmt_cost (void *data, unsigned n, vectorized-stmt-kind, enum machine_mode vector_mode); /* Tell the target to compute and return the cost of the accumulated statements and free any target-private data. */ unsigned finish_cost (void *data); By the way, I don't see much point in passing the void *data around here. Too many levels of interfaces that we'd have to pass it around in the vectorizer, so it would just sit in a static variable. Might as well let the data be wholly private to the target. with eventually slightly different signatures for add_stmt_cost (like pass in the original scalar stmt?). It allows the target, at finish_cost time, to evaluate things like register pressure and resource utilization. Thanks, Richard.
[PATCH] Fix PR53703
The test case exposes a bug that occurs only when a diamond control flow pattern has the arguments of the joining phi in a different order from the successor arcs of the entry block. My logic for setting bb_for_def[12] was just brain-dead. This cleans that up and also prevents wasting time examining phis of virtual ops, which I noticed happening while debugging this. Bootstrapped and regtested on powerpc64-unknown-linux-gnu with no new failures. Ok for trunk? Thanks, Bill gcc: 2012-06-17 Bill Schmidt wschm...@linux.ibm.com PR tree-optimization/53703 * tree-ssa-phiopt.c (hoist_adjacent_loads): Skip virtual phis; correctly set bb_for_def[12]. gcc/testsuite: 2012-06-17 Bill Schmidt wschm...@linux.ibm.com PR tree-optimization/53703 * gcc.dg/torture/pr53703.c: New test. Index: gcc/testsuite/gcc.dg/torture/pr53703.c === --- gcc/testsuite/gcc.dg/torture/pr53703.c (revision 0) +++ gcc/testsuite/gcc.dg/torture/pr53703.c (revision 0) @@ -0,0 +1,149 @@ +/* Reduced test case from PR53703. Used to ICE. */ + +/* { dg-do compile } */ +/* { dg-options -w } */ + +typedef long unsigned int size_t; +typedef unsigned short int sa_family_t; +struct sockaddr {}; +typedef unsigned char __u8; +typedef unsigned short __u16; +typedef unsigned int __u32; +struct nlmsghdr { + __u32 nlmsg_len; + __u16 nlmsg_type; +}; +struct ifaddrmsg { + __u8 ifa_family; +}; +enum { + IFA_ADDRESS, + IFA_LOCAL, +}; +enum { + RTM_NEWLINK = 16, + RTM_NEWADDR = 20, +}; +struct rtattr { + unsigned short rta_len; + unsigned short rta_type; +}; +struct ifaddrs { + struct ifaddrs *ifa_next; + unsigned short ifa_flags; +}; +typedef unsigned short int uint16_t; +typedef unsigned int uint32_t; +struct nlmsg_list { + struct nlmsg_list *nlm_next; + int size; +}; +struct rtmaddr_ifamap { + void *address; + void *local; + int address_len; + int local_len; +}; +int usagi_getifaddrs (struct ifaddrs **ifap) +{ + struct nlmsg_list *nlmsg_list, *nlmsg_end, *nlm; + size_t dlen, xlen, nlen; + int build; + for (build = 0; build = 1; build++) +{ + struct ifaddrs *ifl = ((void *)0), *ifa = ((void *)0); + struct nlmsghdr *nlh, *nlh0; + uint16_t *ifflist = ((void *)0); + struct rtmaddr_ifamap ifamap; + for (nlm = nlmsg_list; nlm; nlm = nlm-nlm_next) + { + int nlmlen = nlm-size; + for (nlh = nlh0; + ((nlmlen) = (int)sizeof(struct nlmsghdr) +(nlh)-nlmsg_len = sizeof(struct nlmsghdr) +(nlh)-nlmsg_len = (nlmlen)); + nlh = ((nlmlen) -= ( (((nlh)-nlmsg_len)+4U -1) ~(4U -1) ), + (struct nlmsghdr*)(((char*)(nlh)) ++ ( (((nlh)-nlmsg_len)+4U -1) + ~(4U -1) + { + struct ifinfomsg *ifim = ((void *)0); + struct ifaddrmsg *ifam = ((void *)0); + struct rtattr *rta; + sa_family_t nlm_family = 0; + uint32_t nlm_scope = 0, nlm_index = 0; + memset (ifamap, 0, sizeof (ifamap)); + switch (nlh-nlmsg_type) + { + case RTM_NEWLINK: + ifim = (struct ifinfomsg *) + ((void*)(((char*)nlh) ++ ((0)+( int) +( ((sizeof(struct nlmsghdr))+4U -1) + ~(4U -1) )))+4U -1) + ~(4U -1) ; + case RTM_NEWADDR: + ifam = (struct ifaddrmsg *) + ((void*)(((char*)nlh) ++ ((0)+( int) +( ((sizeof(struct nlmsghdr))+4U -1) + ~(4U -1) )))+4U -1) + ~(4U -1) ; + nlm_family = ifam-ifa_family; + if (build) + ifa-ifa_flags = ifflist[nlm_index]; + break; + default: + continue; + } + if (!build) + { + void *rtadata = ((void*)(((char*)(rta)) + + (( ((sizeof(struct rtattr))+4 -1) +~(4 -1) ) + (0; + size_t rtapayload = ((int)((rta)-rta_len) + - (( ((sizeof(struct rtattr))+4 -1) +~(4 -1) ) + (0))); + switch (nlh-nlmsg_type) + { + case RTM_NEWLINK: + break; + case RTM_NEWADDR: + if (nlm_family == 17) + break; + switch (rta-rta_type) + { +
Re: [PATCH, RFC] First cut at using vec_construct for strided loads
On Wed, 2012-06-13 at 11:26 +0200, Richard Guenther wrote: On Tue, 12 Jun 2012, William J. Schmidt wrote: This patch is a follow-up to the discussion generated by http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00546.html. I've added vec_construct to the cost model for use in vect_model_load_cost, and implemented a cost calculation that makes sense to me for PowerPC. I'm less certain about the default, i386, and spu implementations. I took a guess at i386 from the discussions we had, and used the same calculation for the default and for spu. I'm hoping you or others can fill in the blanks if I guessed badly. The i386 cost for vec_construct is different from all the others, which are parameterized for each processor description. This should probably be parameterized in some way as well, but thought you'd know better than I how that should be. Perhaps instead of elements / 2 + 1 it should be (elements / 2) * X + Y where X and Y are taken from the processor description, and represent the cost of a merge and a permute, respectively. Let me know what you think. Looks good to me with the gcc_asserts removed - TYPE_VECTOR_SUBPARTS might be 1 for V1TImode for example (heh, not that the vectorizer would vectorize to that). But I don't see any possible breakage with elements == 1, do you? No, that was some unnecessary sanity testing I was doing for my own curiosity. I'll pull them out and pop this in today. Thanks for the review! Bill Target maintainers can improve on the cost calculation if they wish, the default looks sensible to me. Thanks, Richard. Thanks, Bill 2012-06-12 Bill Schmidt wschm...@linux.ibm.com * targhooks.c (default_builtin_vectorized_conversion): Handle vec_construct, using vectype to base cost on subparts. * target.h (enum vect_cost_for_stmt): Add vec_construct. * tree-vect-stmts.c (vect_model_load_cost): Use vec_construct instead of scalar_to-vec. * config/spu/spu.c (spu_builtin_vectorization_cost): Handle vec_construct in same way as default for now. * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise. * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Handle vec_construct, including special case for 32-bit loads. Index: gcc/targhooks.c === --- gcc/targhooks.c (revision 188482) +++ gcc/targhooks.c (working copy) @@ -499,9 +499,11 @@ default_builtin_vectorized_conversion (unsigned in int default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, -tree vectype ATTRIBUTE_UNUSED, +tree vectype, int misalign ATTRIBUTE_UNUSED) { + unsigned elements; + switch (type_of_cost) { case scalar_stmt: @@ -524,6 +526,11 @@ default_builtin_vectorization_cost (enum vect_cost case cond_branch_taken: return 3; + case vec_construct: + elements = TYPE_VECTOR_SUBPARTS (vectype); + gcc_assert (elements 1); + return elements / 2 + 1; + default: gcc_unreachable (); } Index: gcc/target.h === --- gcc/target.h(revision 188482) +++ gcc/target.h(working copy) @@ -146,7 +146,8 @@ enum vect_cost_for_stmt cond_branch_not_taken, cond_branch_taken, vec_perm, - vec_promote_demote + vec_promote_demote, + vec_construct }; /* The target structure. This holds all the backend hooks. */ Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 188482) +++ gcc/tree-vect-stmts.c (working copy) @@ -1031,11 +1031,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int /* The loads themselves. */ if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) { - /* N scalar loads plus gathering them into a vector. - ??? scalar_to_vec isn't the cost for that. */ + /* N scalar loads plus gathering them into a vector. */ + tree vectype = STMT_VINFO_VECTYPE (stmt_info); inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies - * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); + * TYPE_VECTOR_SUBPARTS (vectype)); + inside_cost += ncopies + * targetm.vectorize.builtin_vectorization_cost (vec_construct, + vectype, 0); } else vect_get_load_cost (first_dr, ncopies, Index: gcc/config/spu/spu.c === --- gcc/config/spu/spu.c(revision
[PATCH, committed] Fix PR53647
It turns out we have some old machine descriptions that have no L1 cache, so we must account for a zero line size. Regstrapped on powerpc64-linux-unknown-gnu with no new failures, committed as obvious. Thanks, Bill 2012-06-13 Bill Schmidt wschm...@linux.ibm.com PR tree-optimization/53647 * tree-ssa-phiopt.c (gate_hoist_loads): Skip transformation for targets with no defined cache line size. Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 188482) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -1976,12 +1976,14 @@ hoist_adjacent_loads (basic_block bb0, basic_block /* Determine whether we should attempt to hoist adjacent loads out of diamond patterns in pass_phiopt. Always hoist loads if -fhoist-adjacent-loads is specified and the target machine has - a conditional move instruction. */ + both a conditional move instruction and a defined cache line size. */ static bool gate_hoist_loads (void) { - return (flag_hoist_adjacent_loads == 1 HAVE_conditional_move); + return (flag_hoist_adjacent_loads == 1 + PARAM_VALUE (PARAM_L1_CACHE_LINE_SIZE) + HAVE_conditional_move); } /* Always do these optimizations if we have SSA
[PATCH] Some vector cost model cleanup
This is just some general maintenance to the vectorizer's cost model code: * Corrects a typo in a function name; * Eliminates an unnecessary function; * Combines some duplicate inline functions. Bootstrapped and tested on powerpc64-unknown-linux-gnu, no new regressions. Ok for trunk? Thanks, Bill 2012-06-13 Bill Schmidt wschm...@linux.ibm.com * tree-vectorizer.h (vect_get_stmt_cost): Move from tree-vect-stmts.c. (cost_for_stmt): Remove decl. (vect_get_single_scalar_iteration_cost): Correct typo in name. * tree-vect-loop.c (vect_get_cost): Remove. (vect_get_single_scalar_iteration_cost): Correct typo in name; use vect_get_stmt_cost rather than vect_get_cost. (vect_get_known_peeling_cost): Use vect_get_stmt_cost rather than vect_get_cost. (vect_estimate_min_profitable_iters): Correct typo in call to vect_get_single_scalar_iteration_cost; use vect_get_stmt_cost rather than vect_get_cost. (vect_model_reduction_cost): Use vect_get_stmt_cost rather than vect_get_cost. (vect_model_induction_cost): Likewise. * tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Correct typo in call to vect_get_single_scalar_iteration_cost. * tree-vect-stmts.c (vect_get_stmt_cost): Move to tree-vectorizer.h. (cost_for_stmt): Remove unnecessary function. * Makefile.in (TREE_VECTORIZER_H): Update dependencies. Index: gcc/tree-vectorizer.h === --- gcc/tree-vectorizer.h (revision 188507) +++ gcc/tree-vectorizer.h (working copy) @@ -23,6 +23,7 @@ along with GCC; see the file COPYING3. If not see #define GCC_TREE_VECTORIZER_H #include tree-data-ref.h +#include target.h typedef source_location LOC; #define UNKNOWN_LOC UNKNOWN_LOCATION @@ -769,6 +770,18 @@ vect_pow2 (int x) return res; } +/* Get cost by calling cost target builtin. */ + +static inline +int vect_get_stmt_cost (enum vect_cost_for_stmt type_of_cost) +{ + tree dummy_type = NULL; + int dummy = 0; + + return targetm.vectorize.builtin_vectorization_cost (type_of_cost, + dummy_type, dummy); +} + /*-*/ /* Info on data references alignment. */ /*-*/ @@ -843,7 +856,6 @@ extern void vect_model_load_cost (stmt_vec_info, i extern void vect_finish_stmt_generation (gimple, gimple, gimple_stmt_iterator *); extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info); -extern int cost_for_stmt (gimple); extern tree vect_get_vec_def_for_operand (tree, gimple, tree *); extern tree vect_init_vector (gimple, tree, tree, gimple_stmt_iterator *); @@ -919,7 +931,7 @@ extern int vect_estimate_min_profitable_iters (loo extern tree get_initial_def_for_reduction (gimple, tree, tree *); extern int vect_min_worthwhile_factor (enum tree_code); extern int vect_get_known_peeling_cost (loop_vec_info, int, int *, int); -extern int vect_get_single_scalar_iteraion_cost (loop_vec_info); +extern int vect_get_single_scalar_iteration_cost (loop_vec_info); /* In tree-vect-slp.c. */ extern void vect_free_slp_instance (slp_instance); Index: gcc/tree-vect-loop.c === --- gcc/tree-vect-loop.c(revision 188507) +++ gcc/tree-vect-loop.c(working copy) @@ -1201,19 +1201,6 @@ vect_analyze_loop_form (struct loop *loop) } -/* Get cost by calling cost target builtin. */ - -static inline int -vect_get_cost (enum vect_cost_for_stmt type_of_cost) -{ - tree dummy_type = NULL; - int dummy = 0; - - return targetm.vectorize.builtin_vectorization_cost (type_of_cost, - dummy_type, dummy); -} - - /* Function vect_analyze_loop_operations. Scan the loop stmts and make sure they are all vectorizable. */ @@ -2385,7 +2372,7 @@ vect_force_simple_reduction (loop_vec_info loop_in /* Calculate the cost of one scalar iteration of the loop. */ int -vect_get_single_scalar_iteraion_cost (loop_vec_info loop_vinfo) +vect_get_single_scalar_iteration_cost (loop_vec_info loop_vinfo) { struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); @@ -2434,12 +2421,12 @@ int if (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt))) { if (DR_IS_READ (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt - stmt_cost = vect_get_cost (scalar_load); + stmt_cost = vect_get_stmt_cost (scalar_load); else - stmt_cost = vect_get_cost (scalar_store); + stmt_cost = vect_get_stmt_cost (scalar_store); }
Re: [PATCH] Correct cost model for strided loads
On Tue, 2012-06-12 at 12:59 +0200, Richard Guenther wrote: Btw, with PR53533 I now have a case where multiplications of v4si are really expensive on x86 without SSE 4.1. But we only have vect_stmt_cost and no further subdivision ... Thus we'd need a tree_code argument to the cost hook. Though it gets quite overloaded then, so maybe splitting it into one handling loads/stores (and get the misalign parameter) and one handling only vector_stmt but with a tree_code argument. Or splitting it even further, seeing cond_branch_taken ... Yes, I think subdividing the hook for the vector_stmt kind is pretty much inevitable -- more situations like this expensive multiply will arise. I agree with the interface starting to get messy also. Splitting it is probably the way to go -- a little painful but keeping it all in one hook is going to get ugly. Bill Richard.
[PATCH, RFC] First cut at using vec_construct for strided loads
This patch is a follow-up to the discussion generated by http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00546.html. I've added vec_construct to the cost model for use in vect_model_load_cost, and implemented a cost calculation that makes sense to me for PowerPC. I'm less certain about the default, i386, and spu implementations. I took a guess at i386 from the discussions we had, and used the same calculation for the default and for spu. I'm hoping you or others can fill in the blanks if I guessed badly. The i386 cost for vec_construct is different from all the others, which are parameterized for each processor description. This should probably be parameterized in some way as well, but thought you'd know better than I how that should be. Perhaps instead of elements / 2 + 1 it should be (elements / 2) * X + Y where X and Y are taken from the processor description, and represent the cost of a merge and a permute, respectively. Let me know what you think. Thanks, Bill 2012-06-12 Bill Schmidt wschm...@linux.ibm.com * targhooks.c (default_builtin_vectorized_conversion): Handle vec_construct, using vectype to base cost on subparts. * target.h (enum vect_cost_for_stmt): Add vec_construct. * tree-vect-stmts.c (vect_model_load_cost): Use vec_construct instead of scalar_to-vec. * config/spu/spu.c (spu_builtin_vectorization_cost): Handle vec_construct in same way as default for now. * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise. * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Handle vec_construct, including special case for 32-bit loads. Index: gcc/targhooks.c === --- gcc/targhooks.c (revision 188482) +++ gcc/targhooks.c (working copy) @@ -499,9 +499,11 @@ default_builtin_vectorized_conversion (unsigned in int default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, -tree vectype ATTRIBUTE_UNUSED, +tree vectype, int misalign ATTRIBUTE_UNUSED) { + unsigned elements; + switch (type_of_cost) { case scalar_stmt: @@ -524,6 +526,11 @@ default_builtin_vectorization_cost (enum vect_cost case cond_branch_taken: return 3; + case vec_construct: + elements = TYPE_VECTOR_SUBPARTS (vectype); + gcc_assert (elements 1); + return elements / 2 + 1; + default: gcc_unreachable (); } Index: gcc/target.h === --- gcc/target.h(revision 188482) +++ gcc/target.h(working copy) @@ -146,7 +146,8 @@ enum vect_cost_for_stmt cond_branch_not_taken, cond_branch_taken, vec_perm, - vec_promote_demote + vec_promote_demote, + vec_construct }; /* The target structure. This holds all the backend hooks. */ Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 188482) +++ gcc/tree-vect-stmts.c (working copy) @@ -1031,11 +1031,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int /* The loads themselves. */ if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) { - /* N scalar loads plus gathering them into a vector. - ??? scalar_to_vec isn't the cost for that. */ + /* N scalar loads plus gathering them into a vector. */ + tree vectype = STMT_VINFO_VECTYPE (stmt_info); inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies - * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); + * TYPE_VECTOR_SUBPARTS (vectype)); + inside_cost += ncopies + * targetm.vectorize.builtin_vectorization_cost (vec_construct, + vectype, 0); } else vect_get_load_cost (first_dr, ncopies, Index: gcc/config/spu/spu.c === --- gcc/config/spu/spu.c(revision 188482) +++ gcc/config/spu/spu.c(working copy) @@ -6908,9 +6908,11 @@ spu_builtin_mask_for_load (void) /* Implement targetm.vectorize.builtin_vectorization_cost. */ static int spu_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, -tree vectype ATTRIBUTE_UNUSED, +tree vectype, int misalign ATTRIBUTE_UNUSED) { + unsigned elements; + switch (type_of_cost) { case scalar_stmt: @@ -6937,6 +6939,11 @@ spu_builtin_vectorization_cost (enum vect_cost_for case cond_branch_taken: return 6; + case vec_construct: + elements = TYPE_VECTOR_SUBPARTS (vectype); +
Re: [PATCH] Hoist adjacent pointer loads
On Mon, 2012-06-11 at 13:28 +0200, Richard Guenther wrote: On Mon, Jun 4, 2012 at 3:45 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi Richard, Here's a revision of the hoist-adjacent-loads patch. I'm sorry for the delay since the last revision, but my performance testing has been blocked waiting for a fix to PR53487. I ended up applying a test version of the patch to 4.7 and ran performance numbers with that instead, with no degradations. In addition to addressing your comments, this patch contains one bug fix where local_mem_dependence was called on the wrong blocks after swapping def1 and def2. Bootstrapped with no regressions on powerpc64-unknown-linux-gnu. Is this version ok for trunk? I won't commit it until I can do final testing on trunk in conjunction with a fix for PR53487. Thanks, Bill 2012-06-04 Bill Schmidt wschm...@linux.vnet.ibm.com * opts.c: Add -fhoist_adjacent_loads to -O2 and above. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/opts.c === --- gcc/opts.c (revision 187805) +++ gcc/opts.c (working copy) @@ -489,6 +489,7 @@ static const struct default_options default_option { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 }, { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 }, /* -O3 optimizations. */ { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187805) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking
Re: [PATCH] Correct cost model for strided loads
On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: The fix for PR53331 caused a degradation to 187.facerec on powerpc64-unknown-linux-gnu. The following simple patch reverses the degradation without otherwise affecting SPEC cpu2000 or cpu2006. Bootstrapped and regtested on that platform with no new regressions. Ok for trunk? Well, would the real cost not be subparts * scalar_to_vec plus subparts * vec_perm? At least vec_perm isn't the cost for building up a vector from N scalar elements either (it might be close enough if subparts == 2). What's the case with facerec here? Does it have subparts == 2? In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On PowerPC, this requires two merge instructions and a permute instruction to get the four 32-bit quantities into the right place in a 128-bit register. Currently this is modeled as a vec_perm in other parts of the vectorizer per Ira's earlier patches, so I naively changed this to do the same thing. The types of vectorizer instructions aren't documented, and I can't infer much from the i386.c cost model, so I need a little education. What semantics are represented by scalar_to_vec? On PowerPC, we have this mapping of the floating-point registers and vector float registers where they overlap (the low-order half of each of the first 32 vector float regs corresponds to a scalar float reg). So in this case we have four scalar loads that place things in the bottom half of four vector registers, two vector merge instructions that collapse the four registers into two vector registers, and a vector permute that gets things in the right order.(*) I wonder if what we refer to as a merge instruction is similar to scalar_to_vec. If so, then maybe we need something like subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; inside_cost += ncopies * vect_get_stmt_cost (vec_perm); But then we'd have to change how vec_perm is modeled elsewhere for PowerPC based on Ira's earlier patches. As I said, it's difficult for me to figure out all the intent of cost model decisions that have been made in the past, using current documentation. I really wanted to pessimize this case for say AVX and char elements, thus building up a vector from 32 scalars which certainly does not cost a mere vec_perm. So, maybe special-case the subparts == 2 case and assume vec_perm would match the cost only in that case. (I'm a little confused by this as what you have at the moment is a single scalar_to_vec per copy, which has a cost of 1 on most i386 targets (occasionally 2). The subparts multiplier is only applied to the loads. So changing this to vec_perm seemed to be a no-op for i386.) (*) There are actually a couple more instructions here to convert 64-bit values to 32-bit values, since on PowerPC 32-bit loads are converted to 64-bit values in scalar float registers and they have to be coerced back to 32-bit. Very ugly. The cost model currently doesn't represent this at all, which I'll have to look at fixing at some point in some way that isn't too nasty for the other targets. The cost model for PowerPC seems to need a lot of TLC. Thanks, Bill Thanks, Richard. Thanks, Bill 2012-06-10 Bill Schmidt wschm...@linux.ibm.com * tree-vect-stmts.c (vect_model_load_cost): Change cost model for strided loads. Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 188341) +++ gcc/tree-vect-stmts.c (working copy) @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int /* The loads themselves. */ if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) { - /* N scalar loads plus gathering them into a vector. - ??? scalar_to_vec isn't the cost for that. */ + /* N scalar loads plus gathering them into a vector. */ inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); } else vect_get_load_cost (first_dr, ncopies,
Re: [PATCH] Add vector cost model density heuristic
On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote: On Fri, 8 Jun 2012, William J. Schmidt wrote: This patch adds a heuristic to the vectorizer when estimating the minimum profitable number of iterations. The heuristic is target-dependent, and is currently disabled for all targets except PowerPC. However, the intent is to make it general enough to be useful for other targets that want to opt in. A previous patch addressed some PowerPC SPEC degradations by modifying the vector cost model values for vec_perm and vec_promote_demote. The values were set a little higher than their natural values because the natural values were not sufficient to prevent a poor vectorization choice. However, this is not the right long-term solution, since it can unnecessarily constrain other vectorization choices involving permute instructions. Analysis of the badly vectorized loop (in sphinx3) showed that the problem was overcommitment of vector resources -- too many vector instructions issued without enough non-vector instructions available to cover the delays. The vector cost model assumes that instructions always have a constant cost, and doesn't have a way of judging this kind of density of vector instructions. The present patch adds a heuristic to recognize when a loop is likely to overcommit resources, and adds a small penalty to the inside-loop cost to account for the expected stalls. The heuristic is parameterized with three target-specific values: * Density threshold: The heuristic will apply only when the percentage of inside-loop cost attributable to vectorized instructions exceeds this value. * Size threshold: The heuristic will apply only when the inside-loop cost exceeds this value. * Penalty: The inside-loop cost will be increased by this percentage value when the heuristic applies. Thus only reasonably large loop bodies that are mostly vectorized instructions will be affected. By applying only a small percentage bump to the inside-loop cost, the heuristic will only turn off vectorization for loops that were considered barely profitable to begin with (such as the sphinx3 loop). So the heuristic is quite conservative and should not affect the vast majority of vectorization decisions. Together with the new heuristic, this patch reduces the vec_perm and vec_promote_demote costs for PowerPC to their natural values. I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu and verified that no performance regressions occur on SPEC cpu2006. Is this ok for trunk? Hmm. I don't like this patch or its general idea too much. Instead I'd like us to move more of the cost model detail to the target, giving it a chance to look at the whole loop before deciding on a cost. ISTR posting the overall idea at some point, but let me repeat it here instead of trying to find that e-mail. The basic interface of the cost model should be, in targetm.vectorize /* Tell the target to start cost analysis of a loop or a basic-block (if the loop argument is NULL). Returns an opaque pointer to target-private data. */ void *init_cost (struct loop *loop); /* Add cost for N vectorized-stmt-kind statements in vector_mode. */ void add_stmt_cost (void *data, unsigned n, vectorized-stmt-kind, enum machine_mode vector_mode); /* Tell the target to compute and return the cost of the accumulated statements and free any target-private data. */ unsigned finish_cost (void *data); with eventually slightly different signatures for add_stmt_cost (like pass in the original scalar stmt?). It allows the target, at finish_cost time, to evaluate things like register pressure and resource utilization. OK, I'm trying to understand how you would want this built into the present structure. Taking just the loop case for now: Judging by your suggested API, we would have to call add_stmt_cost () everywhere that we now call stmt_vinfo_set_inside_of_loop_cost (). For now this would be an additional call, not a replacement, though maybe the other goes away eventually. This allows the target to save more data about the vectorized instructions than just an accumulated cost number (order and quantity of various kinds of instructions can be maintained for better modeling). Presumably the call to finish_cost would be done within vect_estimate_min_profitable_iters () to produce the final value of inside_cost for the loop. The default target hook for add_stmt_cost would duplicate what we currently do for calculating the inside_cost of a statement, and the default target hook for finish_cost would just return the sum. I'll have to go hunting where the similar code would fit for SLP in a basic block. If I read you correctly, you don't object to a density heuristic such as the one I implemented here, but you want
Re: [PATCH] Correct cost model for strided loads
On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: On Mon, 11 Jun 2012, William J. Schmidt wrote: On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: The fix for PR53331 caused a degradation to 187.facerec on powerpc64-unknown-linux-gnu. The following simple patch reverses the degradation without otherwise affecting SPEC cpu2000 or cpu2006. Bootstrapped and regtested on that platform with no new regressions. Ok for trunk? Well, would the real cost not be subparts * scalar_to_vec plus subparts * vec_perm? At least vec_perm isn't the cost for building up a vector from N scalar elements either (it might be close enough if subparts == 2). What's the case with facerec here? Does it have subparts == 2? In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On PowerPC, this requires two merge instructions and a permute instruction to get the four 32-bit quantities into the right place in a 128-bit register. Currently this is modeled as a vec_perm in other parts of the vectorizer per Ira's earlier patches, so I naively changed this to do the same thing. I see. The types of vectorizer instructions aren't documented, and I can't infer much from the i386.c cost model, so I need a little education. What semantics are represented by scalar_to_vec? It's a vector splat, thus x - { x, x, x, ... }. You can create { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, as VEC_PERM_EXPR, takes two input vectors). That's by far not the most efficient way to build up such a vector of course (with AVX you could do one splat plus N - 1 inserts for example). The cost is of course dependent on the number of vector elements, so a simple new enum vect_cost_for_stmt kind does not cover it but the target would have to look at the vector type passed and do some reasonable guess. Ah, splat! Yes, that's lingo I understand. I see the intent now. On PowerPC, we have this mapping of the floating-point registers and vector float registers where they overlap (the low-order half of each of the first 32 vector float regs corresponds to a scalar float reg). So in this case we have four scalar loads that place things in the bottom half of four vector registers, two vector merge instructions that collapse the four registers into two vector registers, and a vector permute that gets things in the right order.(*) I wonder if what we refer to as a merge instruction is similar to scalar_to_vec. Looks similar to x86 SSE then. If so, then maybe we need something like subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; inside_cost += ncopies * vect_get_stmt_cost (vec_perm); But then we'd have to change how vec_perm is modeled elsewhere for PowerPC based on Ira's earlier patches. As I said, it's difficult for me to figure out all the intent of cost model decisions that have been made in the past, using current documentation. Heh, usually the intent was to make the changes simple, not to compute a proper cost. I think we simply need a new scalars_to_vec cost kind. That works. Maybe vec_construct gets the point across a little better? I think we need to use the full builtin_vectorization_cost interface instead of vect_get_stmt_cost here, so the targets can parameterize on type. Then we can just do one cost calculation for vec_construct that covers the full costs of getting the vector in order after the loads. I really wanted to pessimize this case for say AVX and char elements, thus building up a vector from 32 scalars which certainly does not cost a mere vec_perm. So, maybe special-case the subparts == 2 case and assume vec_perm would match the cost only in that case. (I'm a little confused by this as what you have at the moment is a single scalar_to_vec per copy, which has a cost of 1 on most i386 targets (occasionally 2). The subparts multiplier is only applied to the loads. So changing this to vec_perm seemed to be a no-op for i386.) Oh, I somehow read the patch as you were removing the multiplication by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted to reflect it with N scalar loads plus N splats plus N - 1 permutes originally. You could also model it with N scalar loads plus N inserts (but we don't have a vec_insert cost either). I think adding a scalars_to_vec or vec_init or however we want to call it - basically what the cost of a vector CONSTRUCTOR would be - is best. (*) There are actually a couple more instructions here to convert 64-bit values to 32-bit values, since on PowerPC 32-bit loads
Re: [PATCH] Add vector cost model density heuristic
On Mon, 2012-06-11 at 16:58 +0200, Richard Guenther wrote: On Mon, 11 Jun 2012, Richard Guenther wrote: On Mon, 11 Jun 2012, William J. Schmidt wrote: On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote: On Fri, 8 Jun 2012, William J. Schmidt wrote: This patch adds a heuristic to the vectorizer when estimating the minimum profitable number of iterations. The heuristic is target-dependent, and is currently disabled for all targets except PowerPC. However, the intent is to make it general enough to be useful for other targets that want to opt in. A previous patch addressed some PowerPC SPEC degradations by modifying the vector cost model values for vec_perm and vec_promote_demote. The values were set a little higher than their natural values because the natural values were not sufficient to prevent a poor vectorization choice. However, this is not the right long-term solution, since it can unnecessarily constrain other vectorization choices involving permute instructions. Analysis of the badly vectorized loop (in sphinx3) showed that the problem was overcommitment of vector resources -- too many vector instructions issued without enough non-vector instructions available to cover the delays. The vector cost model assumes that instructions always have a constant cost, and doesn't have a way of judging this kind of density of vector instructions. The present patch adds a heuristic to recognize when a loop is likely to overcommit resources, and adds a small penalty to the inside-loop cost to account for the expected stalls. The heuristic is parameterized with three target-specific values: * Density threshold: The heuristic will apply only when the percentage of inside-loop cost attributable to vectorized instructions exceeds this value. * Size threshold: The heuristic will apply only when the inside-loop cost exceeds this value. * Penalty: The inside-loop cost will be increased by this percentage value when the heuristic applies. Thus only reasonably large loop bodies that are mostly vectorized instructions will be affected. By applying only a small percentage bump to the inside-loop cost, the heuristic will only turn off vectorization for loops that were considered barely profitable to begin with (such as the sphinx3 loop). So the heuristic is quite conservative and should not affect the vast majority of vectorization decisions. Together with the new heuristic, this patch reduces the vec_perm and vec_promote_demote costs for PowerPC to their natural values. I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu and verified that no performance regressions occur on SPEC cpu2006. Is this ok for trunk? Hmm. I don't like this patch or its general idea too much. Instead I'd like us to move more of the cost model detail to the target, giving it a chance to look at the whole loop before deciding on a cost. ISTR posting the overall idea at some point, but let me repeat it here instead of trying to find that e-mail. The basic interface of the cost model should be, in targetm.vectorize /* Tell the target to start cost analysis of a loop or a basic-block (if the loop argument is NULL). Returns an opaque pointer to target-private data. */ void *init_cost (struct loop *loop); /* Add cost for N vectorized-stmt-kind statements in vector_mode. */ void add_stmt_cost (void *data, unsigned n, vectorized-stmt-kind, enum machine_mode vector_mode); /* Tell the target to compute and return the cost of the accumulated statements and free any target-private data. */ unsigned finish_cost (void *data); with eventually slightly different signatures for add_stmt_cost (like pass in the original scalar stmt?). It allows the target, at finish_cost time, to evaluate things like register pressure and resource utilization. OK, I'm trying to understand how you would want this built into the present structure. Taking just the loop case for now: Judging by your suggested API, we would have to call add_stmt_cost () everywhere that we now call stmt_vinfo_set_inside_of_loop_cost (). For now this would be an additional call, not a replacement, though maybe the other goes away eventually. This allows the target to save more data about the vectorized instructions than just an accumulated cost number (order and quantity of various kinds of instructions can be maintained for better modeling). Presumably the call to finish_cost would
Re: [PATCH] Add vector cost model density heuristic
On Mon, 2012-06-11 at 11:09 -0400, David Edelsohn wrote: On Mon, Jun 11, 2012 at 10:55 AM, Richard Guenther rguent...@suse.de wrote: Well, they are at least magic numbers and heuristics that apply generally and not only to the single issue in sphinx. And in fact how it works for sphinx _is_ magic. Second, I suggest that you need to rephrase I can make you and re-send your reply. Sorry for my bad english. Consider it meaning that I'd rather have you think about a more proper solution. That's what patch review is about after all, no? Sometimes a complete re-write (which gets more difficult which each of the patches enhancing the not ideal current state) is the best thing to do. Richard, The values of the heuristics may be magic, but Bill believes the heuristics are testing the important characteristics. The heuristics themselves are controlled by hooks, so the target can set the correct values for their own requirements. The concern is that a general cost infrastructure is too general. And, based on history, all ports simply will copy the boilerplate from the first implementation. It also may cause more problems because the target has relatively little information to be able to judge heuristics at that point in the middle-end. If the targets start to get too cute or too complicated, it may cause more problems or more confusion about why more complicated heuristics are not effective and not producing the expected results. I worry about creating another machine dependent reorg catch-all pass. Maybe an incremental pre- and/or post- cost hook would be more effective. I will let Bill comment. Thanks David, I can see both sides of this, and it's hard to judge the future from where I stand. My belief is that the number of heuristics targets will implement will be fairly limited, since judgments about cycle-level costs are not accurately predictable during the middle end. All we can do is come up with a few things that seem to make sense. Doing too much in the back end seems impractical. The interesting question to me is whether cost model heuristics are general enough to be reusable. What I saw in this case was what I considered to be a somewhat target-neutral problem: overwhelming those assets of the processor that implement vectorization. It seemed reasonable to provide hooks for others to use the idea if they encounter similar issues. If reusing the heuristic is useful, then having to copy the logic from one target to another isn't the best approach. If nobody else will ever use it, then embedding it in the back end is reasonable. Unfortunately my crystal ball has been on the fritz for several decades, so I can't tell you for sure which is right... Richard, my biggest question is whether you think other targets are likely to take advantage of a more general back-end interface, or whether this will end up just being a PowerPC wart. If you know of ways this will be useful for i386, that would be helpful to know. Perhaps this requires your crystal ball as well; not sure how well yours works... If we look at just this one issue in isolation, then changing all the code in the vectorizer that calculates inside/outside loop costs and moving it to targetm seems more invasive than adding the few hooks. But if this will really be a useful feature for the community as a whole I am certainly willing to tackle it. Thanks, Bill Thanks, David
Re: [PATCH] Hoist adjacent pointer loads
On Mon, 2012-06-11 at 14:59 +0200, Richard Guenther wrote: On Mon, 11 Jun 2012, William J. Schmidt wrote: On Mon, 2012-06-11 at 13:28 +0200, Richard Guenther wrote: On Mon, Jun 4, 2012 at 3:45 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi Richard, Here's a revision of the hoist-adjacent-loads patch. I'm sorry for the delay since the last revision, but my performance testing has been blocked waiting for a fix to PR53487. I ended up applying a test version of the patch to 4.7 and ran performance numbers with that instead, with no degradations. In addition to addressing your comments, this patch contains one bug fix where local_mem_dependence was called on the wrong blocks after swapping def1 and def2. Bootstrapped with no regressions on powerpc64-unknown-linux-gnu. Is this version ok for trunk? I won't commit it until I can do final testing on trunk in conjunction with a fix for PR53487. Thanks, Bill 2012-06-04 Bill Schmidt wschm...@linux.vnet.ibm.com * opts.c: Add -fhoist_adjacent_loads to -O2 and above. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/opts.c === --- gcc/opts.c (revision 187805) +++ gcc/opts.c (working copy) @@ -489,6 +489,7 @@ static const struct default_options default_option { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 }, { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 }, /* -O3 optimizations. */ { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187805) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto
Re: [PATCH] Hoist adjacent pointer loads
On Mon, 2012-06-11 at 12:11 -0500, William J. Schmidt wrote: I found this parameter that seems to correspond to well-predicted conditional jumps: /* When branch is predicted to be taken with probability lower than this threshold (in percent), then it is considered well predictable. */ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME, predictable-branch-outcome, Maximal estimated outcome of branch considered predictable, 2, 0, 50) ...which has an interface predictable_edge_p () in predict.c, so that's what I'll use. Thanks, Bill
Re: [PATCH] Hoist adjacent loads
OK, once more with feeling... :) This patch differs from the previous one in two respects: It disables the optimization when either the then or else edge is well-predicted; and it now uses the existing l1-cache-line-size parameter instead of a new one (with updated commentary). Bootstraps and tests with no new regressions on powerpc64-unknown-linux-gnu. One last performance run is underway, but I don't expect any surprises since both changes are more conservative. The original benchmark issue is still resolved. Is this version ok for trunk? Thanks, Bill 2012-06-11 Bill Schmidt wschm...@linux.vnet.ibm.com * opts.c: Add -fhoist-adjacent-loads to -O2 and above. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. Index: gcc/opts.c === --- gcc/opts.c (revision 188390) +++ gcc/opts.c (working copy) @@ -489,6 +489,7 @@ static const struct default_options default_option { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 }, { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 }, /* -O3 optimizations. */ { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 188390) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt (void) { - return tree_ssa_phiopt_worker (false); + return tree_ssa_phiopt_worker (false, gate_hoist_loads ()); } /* This pass tries to transform conditional stores into unconditional @@ -190,7 +245,7 @@ tree_ssa_phiopt (void) static unsigned int tree_ssa_cs_elim (void) { - return tree_ssa_phiopt_worker (true); +
[PATCH] Correct cost model for strided loads
The fix for PR53331 caused a degradation to 187.facerec on powerpc64-unknown-linux-gnu. The following simple patch reverses the degradation without otherwise affecting SPEC cpu2000 or cpu2006. Bootstrapped and regtested on that platform with no new regressions. Ok for trunk? Thanks, Bill 2012-06-10 Bill Schmidt wschm...@linux.ibm.com * tree-vect-stmts.c (vect_model_load_cost): Change cost model for strided loads. Index: gcc/tree-vect-stmts.c === --- gcc/tree-vect-stmts.c (revision 188341) +++ gcc/tree-vect-stmts.c (working copy) @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int /* The loads themselves. */ if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) { - /* N scalar loads plus gathering them into a vector. - ??? scalar_to_vec isn't the cost for that. */ + /* N scalar loads plus gathering them into a vector. */ inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); } else vect_get_load_cost (first_dr, ncopies,
Re: [PATCH] Hoist adjacent pointer loads
On Mon, 2012-06-04 at 08:45 -0500, William J. Schmidt wrote: Hi Richard, Here's a revision of the hoist-adjacent-loads patch. I'm sorry for the delay since the last revision, but my performance testing has been blocked waiting for a fix to PR53487. I ended up applying a test version of the patch to 4.7 and ran performance numbers with that instead, with no degradations. In addition to addressing your comments, this patch contains one bug fix where local_mem_dependence was called on the wrong blocks after swapping def1 and def2. Bootstrapped with no regressions on powerpc64-unknown-linux-gnu. Is this version ok for trunk? I won't commit it until I can do final testing on trunk in conjunction with a fix for PR53487. Final performance tests are complete and show no degradations on SPEC cpu2006 on powerpc64-unknown-linux-gnu. Is the patch ok for trunk? Thanks! Bill Thanks, Bill 2012-06-04 Bill Schmidt wschm...@linux.vnet.ibm.com * opts.c: Add -fhoist_adjacent_loads to -O2 and above. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/opts.c === --- gcc/opts.c(revision 187805) +++ gcc/opts.c(working copy) @@ -489,6 +489,7 @@ static const struct default_options default_option { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 }, { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 }, /* -O3 optimizations. */ { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187805) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom
Re: [PATCH] Hoist adjacent pointer loads
Hi Richard, Here's a revision of the hoist-adjacent-loads patch. I'm sorry for the delay since the last revision, but my performance testing has been blocked waiting for a fix to PR53487. I ended up applying a test version of the patch to 4.7 and ran performance numbers with that instead, with no degradations. In addition to addressing your comments, this patch contains one bug fix where local_mem_dependence was called on the wrong blocks after swapping def1 and def2. Bootstrapped with no regressions on powerpc64-unknown-linux-gnu. Is this version ok for trunk? I won't commit it until I can do final testing on trunk in conjunction with a fix for PR53487. Thanks, Bill 2012-06-04 Bill Schmidt wschm...@linux.vnet.ibm.com * opts.c: Add -fhoist_adjacent_loads to -O2 and above. * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/opts.c === --- gcc/opts.c (revision 187805) +++ gcc/opts.c (working copy) @@ -489,6 +489,7 @@ static const struct default_options default_option { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 }, { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 }, /* -O3 optimizations. */ { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187805) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt (void) { - return tree_ssa_phiopt_worker (false); + return tree_ssa_phiopt_worker (false, gate_hoist_loads ()); } /* This pass tries to
Re: [PATCH] Hoist adjacent pointer loads
On Wed, 2012-05-23 at 13:25 +0200, Richard Guenther wrote: On Tue, 22 May 2012, William J. Schmidt wrote: Here's a revision of the hoist-adjacent-loads patch. Besides hopefully addressing all your comments, I added a gate of at least -O2 for this transformation. Let me know if you prefer a different minimum opt level. I'm still running SPEC tests to make sure there are no regressions when opening this up to non-pointer arguments. The code bootstraps on powerpc64-unknown-linux-gnu with no regressions. Assuming the SPEC numbers come out as expected, is this ok? Thanks, Bill 2012-05-22 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187728) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt (void) { - return tree_ssa_phiopt_worker (false); + return tree_ssa_phiopt_worker (false, gate_hoist_loads ()); } /* This pass tries to transform conditional stores into unconditional @@ -190,7 +245,7 @@ tree_ssa_phiopt (void) static unsigned int tree_ssa_cs_elim (void) { - return tree_ssa_phiopt_worker (true); + return tree_ssa_phiopt_worker (true, false); } /* Return the singleton PHI in the SEQ of PHIs for edges E0 and E1. */ @@ -227,9 +282,11 @@ static tree condstoretemp; /* The core routine of conditional store replacement and normal phi optimizations. Both share much of the infrastructure in how to match applicable basic block
Re: [PATCH] Hoist adjacent pointer loads
Here's a revision of the hoist-adjacent-loads patch. Besides hopefully addressing all your comments, I added a gate of at least -O2 for this transformation. Let me know if you prefer a different minimum opt level. I'm still running SPEC tests to make sure there are no regressions when opening this up to non-pointer arguments. The code bootstraps on powerpc64-unknown-linux-gnu with no regressions. Assuming the SPEC numbers come out as expected, is this ok? Thanks, Bill 2012-05-22 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_mem_dependence): New function. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187728) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt (void) { - return tree_ssa_phiopt_worker (false); + return tree_ssa_phiopt_worker (false, gate_hoist_loads ()); } /* This pass tries to transform conditional stores into unconditional @@ -190,7 +245,7 @@ tree_ssa_phiopt (void) static unsigned int tree_ssa_cs_elim (void) { - return tree_ssa_phiopt_worker (true); + return tree_ssa_phiopt_worker (true, false); } /* Return the singleton PHI in the SEQ of PHIs for edges E0 and E1. */ @@ -227,9 +282,11 @@ static tree condstoretemp; /* The core routine of conditional store replacement and normal phi optimizations. Both share much of the infrastructure in how to match applicable basic block patterns. DO_STORE_ELIM is true - when we want to do conditional store replacement, false otherwise. */ + when we want to do conditional store replacement, false otherwise. + DO_HOIST_LOADS is true when we want to hoist adjacent loads out + of diamond control flow patterns, false otherwise. */ static unsigned int
Re: [PATCH] Hoist adjacent pointer loads
On Mon, 2012-05-21 at 14:17 +0200, Richard Guenther wrote: On Thu, May 3, 2012 at 4:33 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This patch was posted for comment back in February during stage 4. It addresses a performance issue noted in the EEMBC routelookup benchmark on a common idiom: if (...) x = y-left; else x = y-right; If the two loads can be hoisted out of the if/else, the if/else can be replaced by a conditional move instruction on architectures that support one. Because this speculates one of the loads, the patch constrains the optimization to avoid introducing page faults. Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no new failures. The patch provides significant improvement to the routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006. One question is what optimization level should be required for this. Because of the speculation, -O3 might be in order. I don't believe -Ofast is required as there is no potential correctness issue involved. Right now the patch doesn't check the optimization level (like the rest of the phi-opt transforms), which is likely a poor choice. Ok for trunk? Thanks, Bill 2012-05-03 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_reg_dependence): New function. (local_mem_dependence): Likewise. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187057) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt
[PATCH, rs6000] Fix PR53385
This repairs the bootstrap issue due to unsafe signed overflow assumptions. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-05-18 Bill Schmidt wschm...@linux.vnet.ibm.com * config/rs6000/rs6000.c (print_operand): Revise code that unsafely relied on signed overflow behavior. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 187651) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -14679,7 +14679,6 @@ void print_operand (FILE *file, rtx x, int code) { int i; - HOST_WIDE_INT val; unsigned HOST_WIDE_INT uval; switch (code) @@ -15120,34 +15119,17 @@ print_operand (FILE *file, rtx x, int code) case 'W': /* MB value for a PowerPC64 rldic operand. */ - val = (GET_CODE (x) == CONST_INT -? INTVAL (x) : CONST_DOUBLE_HIGH (x)); + i = clz_hwi (GET_CODE (x) == CONST_INT + ? INTVAL (x) : CONST_DOUBLE_HIGH (x)); - if (val 0) - i = -1; - else - for (i = 0; i HOST_BITS_PER_WIDE_INT; i++) - if ((val = 1) 0) - break; - #if HOST_BITS_PER_WIDE_INT == 32 - if (GET_CODE (x) == CONST_INT i = 0) + if (GET_CODE (x) == CONST_INT i 0) i += 32; /* zero-extend high-part was all 0's */ else if (GET_CODE (x) == CONST_DOUBLE i == 32) - { - val = CONST_DOUBLE_LOW (x); - - gcc_assert (val); - if (val 0) - --i; - else - for ( ; i 64; i++) - if ((val = 1) 0) - break; - } + i = clz_hwi (CONST_DOUBLE_LOW (x)) + 32; #endif - fprintf (file, %d, i + 1); + fprintf (file, %d, i); return; case 'x':
[PATCH] Simplify attempt_builtin_powi logic
This patch gives up on using the reassociation rank algorithm to correctly place __builtin_powi calls and their feeding multiplies. In the end this proved to introduce more complexity than it saved, due in part to the poor fit of introducing DAG expressions into the reassociated operand tree. This patch returns to generating explicit multiplies to bind the builtin calls together and to the results of the expression tree rewrite. I feel this version is smaller, easier to understand, and less fragile than the existing code. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. Ok for trunk? Thanks, Bill 2012-05-17 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-reassoc.c (bip_map): Remove decl. (completely_remove_stmt): Remove function. (remove_def_if_absorbed_call): Remove function. (remove_visited_stmt_chain): Remove __builtin_powi handling. (possibly_move_powi): Remove function. (rewrite_expr_tree): Remove calls to possibly_move_powi. (rewrite_expr_tree_parallel): Likewise. (attempt_builtin_powi): Build multiplies explicitly rather than relying on the ops vector and rank system. (transform_stmt_to_copy): New function. (transform_stmt_to_multiply): Likewise. (reassociate_bb): Handle leftover operations after __builtin_powi optimization; build a final multiply if necessary. Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 187626) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -200,10 +200,6 @@ static long *bb_rank; /* Operand-rank hashtable. */ static struct pointer_map_t *operand_rank; -/* Map from inserted __builtin_powi calls to multiply chains that - feed them. */ -static struct pointer_map_t *bip_map; - /* Forward decls. */ static long get_rank (tree); @@ -2184,32 +2180,6 @@ is_phi_for_stmt (gimple stmt, tree operand) return false; } -/* Remove STMT, unlink its virtual defs, and release its SSA defs. */ - -static inline void -completely_remove_stmt (gimple stmt) -{ - gimple_stmt_iterator gsi = gsi_for_stmt (stmt); - gsi_remove (gsi, true); - unlink_stmt_vdef (stmt); - release_defs (stmt); -} - -/* If OP is defined by a builtin call that has been absorbed by - reassociation, remove its defining statement completely. */ - -static inline void -remove_def_if_absorbed_call (tree op) -{ - gimple stmt; - - if (TREE_CODE (op) == SSA_NAME - has_zero_uses (op) - is_gimple_call ((stmt = SSA_NAME_DEF_STMT (op))) - gimple_visited_p (stmt)) -completely_remove_stmt (stmt); -} - /* Remove def stmt of VAR if VAR has zero uses and recurse on rhs1 operand if so. */ @@ -2218,7 +2188,6 @@ remove_visited_stmt_chain (tree var) { gimple stmt; gimple_stmt_iterator gsi; - tree var2; while (1) { @@ -2228,95 +2197,15 @@ remove_visited_stmt_chain (tree var) if (is_gimple_assign (stmt) gimple_visited_p (stmt)) { var = gimple_assign_rhs1 (stmt); - var2 = gimple_assign_rhs2 (stmt); gsi = gsi_for_stmt (stmt); gsi_remove (gsi, true); release_defs (stmt); - /* A multiply whose operands are both fed by builtin pow/powi -calls must check whether to remove rhs2 as well. */ - remove_def_if_absorbed_call (var2); } - else if (is_gimple_call (stmt) gimple_visited_p (stmt)) - { - completely_remove_stmt (stmt); - return; - } else return; } } -/* If OP is an SSA name, find its definition and determine whether it - is a call to __builtin_powi. If so, move the definition prior to - STMT. Only do this during early reassociation. */ - -static void -possibly_move_powi (gimple stmt, tree op) -{ - gimple stmt2, *mpy; - tree fndecl; - gimple_stmt_iterator gsi1, gsi2; - - if (!first_pass_instance - || !flag_unsafe_math_optimizations - || TREE_CODE (op) != SSA_NAME) -return; - - stmt2 = SSA_NAME_DEF_STMT (op); - - if (!is_gimple_call (stmt2) - || !has_single_use (gimple_call_lhs (stmt2))) -return; - - fndecl = gimple_call_fndecl (stmt2); - - if (!fndecl - || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL) -return; - - switch (DECL_FUNCTION_CODE (fndecl)) -{ -CASE_FLT_FN (BUILT_IN_POWI): - break; -default: - return; -} - - /* Move the __builtin_powi. */ - gsi1 = gsi_for_stmt (stmt); - gsi2 = gsi_for_stmt (stmt2); - gsi_move_before (gsi2, gsi1); - - /* See if there are multiplies feeding the __builtin_powi base - argument that must also be moved. */ - while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL) -{ - /* If we've already moved this statement, we're done. This is - identified by a NULL entry for the statement in bip_map. */ - gimple *next = (gimple *)
Re: PING: [PATCH] Fix PR53217
On Wed, 2012-05-16 at 11:45 +0200, Richard Guenther wrote: On Tue, 15 May 2012, William J. Schmidt wrote: Ping. I don't like it too much - but pondering a bit over it I can't find a nicer solution. So, ok. Thanks, Richard. Agreed. I'm not fond of it either, and I feel it's a bit fragile. An alternative would be to go back to handling the exponentiation expressions outside of the ops list (generating an explicit multiply to hook them up with the results of normal linear/parallel expansion). In hindsight, placing the exponentiation results in the ops list and letting the rank order handle things introduces some complexity as well as saving some. The DAG'd nature of the exponentiation expressions isn't a perfect fit for the pure tree form of the reassociated multiplies. Let me know if you'd like me to pursue that instead. Thanks, Bill Thanks, Bill On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote: This fixes another statement-placement issue when reassociating expressions with repeated factors. Multiplies feeding into __builtin_powi calls were not getting placed properly ahead of them in some cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. I've also run SPEC cpu2006 with no build or correctness issues. OK for trunk? Thanks, Bill gcc: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * tree-ssa-reassoc.c (bip_map): New static variable. (possibly_move_powi): Move feeding multiplies with __builtin_powi call. (attempt_builtin_powi): Save feeding multiplies on a stack. (reassociate_bb): Create and destroy bip_map. gcc/testsuite: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * gfortran.dg/pr53217.f90: New test. Index: gcc/testsuite/gfortran.dg/pr53217.f90 === --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) @@ -0,0 +1,28 @@ +! { dg-do compile } +! { dg-options -O1 -ffast-math } + +! This tests only for compile-time failure, which formerly occurred +! when statements were emitted out of order, failing verify_ssa. + +MODULE xc_cs1 + INTEGER, PARAMETER :: dp=KIND(0.0D0) + REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, + c = 0.2533_dp, + d = 0.349_dp +CONTAINS + SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, e_ndrho_ndrho, + npoints, error) +REAL(KIND=dp), DIMENSION(*), + INTENT(INOUT) :: e_rho_rho, e_rho_ndrho, +e_ndrho_ndrho +DO ip = 1, npoints + IF ( rho(ip) eps_rho ) THEN + oc = 1.0_dp/(r*r*r3*r3 + c*g*g) + d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 + +104*r**6)*od**3*oc**4 + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4 + END IF +END DO + END SUBROUTINE cs1_u_2 +END MODULE xc_cs1 Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c(revision 187117) +++ gcc/tree-ssa-reassoc.c(working copy) @@ -200,6 +200,10 @@ static long *bb_rank; /* Operand-rank hashtable. */ static struct pointer_map_t *operand_rank; +/* Map from inserted __builtin_powi calls to multiply chains that + feed them. */ +static struct pointer_map_t *bip_map; + /* Forward decls. */ static long get_rank (tree); @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var) static void possibly_move_powi (gimple stmt, tree op) { - gimple stmt2; + gimple stmt2, *mpy; tree fndecl; gimple_stmt_iterator gsi1, gsi2; @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op) return; } + /* Move the __builtin_powi. */ gsi1 = gsi_for_stmt (stmt); gsi2 = gsi_for_stmt (stmt2); gsi_move_before (gsi2, gsi1); + + /* See if there are multiplies feeding the __builtin_powi base + argument that must also be moved. */ + while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL) +{ + /* If we've already moved this statement, we're done. This is + identified by a NULL entry for the statement in bip_map. */ + gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy); + if (next !*next) + return; + + stmt = stmt2; + stmt2 = *mpy; + gsi1 = gsi_for_stmt (stmt); + gsi2 = gsi_for_stmt (stmt2); + gsi_move_before
Re: PING: [PATCH] Fix PR53217
On Wed, 2012-05-16 at 14:05 +0200, Richard Guenther wrote: On Wed, 16 May 2012, William J. Schmidt wrote: On Wed, 2012-05-16 at 11:45 +0200, Richard Guenther wrote: On Tue, 15 May 2012, William J. Schmidt wrote: Ping. I don't like it too much - but pondering a bit over it I can't find a nicer solution. So, ok. Thanks, Richard. Agreed. I'm not fond of it either, and I feel it's a bit fragile. An alternative would be to go back to handling the exponentiation expressions outside of the ops list (generating an explicit multiply to hook them up with the results of normal linear/parallel expansion). In hindsight, placing the exponentiation results in the ops list and letting the rank order handle things introduces some complexity as well as saving some. The DAG'd nature of the exponentiation expressions isn't a perfect fit for the pure tree form of the reassociated multiplies. True. Let me know if you'd like me to pursue that instead. You can try - if the result looks better I'm all for it ;) OK. :) I'll commit this for now to deal with the fallout, and work on the alternative version in my spare time. Thanks, Bill Thanks, Richard. Thanks, Bill Thanks, Bill On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote: This fixes another statement-placement issue when reassociating expressions with repeated factors. Multiplies feeding into __builtin_powi calls were not getting placed properly ahead of them in some cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. I've also run SPEC cpu2006 with no build or correctness issues. OK for trunk? Thanks, Bill gcc: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * tree-ssa-reassoc.c (bip_map): New static variable. (possibly_move_powi): Move feeding multiplies with __builtin_powi call. (attempt_builtin_powi): Save feeding multiplies on a stack. (reassociate_bb): Create and destroy bip_map. gcc/testsuite: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * gfortran.dg/pr53217.f90: New test. Index: gcc/testsuite/gfortran.dg/pr53217.f90 === --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) @@ -0,0 +1,28 @@ +! { dg-do compile } +! { dg-options -O1 -ffast-math } + +! This tests only for compile-time failure, which formerly occurred +! when statements were emitted out of order, failing verify_ssa. + +MODULE xc_cs1 + INTEGER, PARAMETER :: dp=KIND(0.0D0) + REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, + c = 0.2533_dp, + d = 0.349_dp +CONTAINS + SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, e_ndrho_ndrho, + npoints, error) +REAL(KIND=dp), DIMENSION(*), + INTENT(INOUT) :: e_rho_rho, e_rho_ndrho, +e_ndrho_ndrho +DO ip = 1, npoints + IF ( rho(ip) eps_rho ) THEN + oc = 1.0_dp/(r*r*r3*r3 + c*g*g) + d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 + +104*r**6)*od**3*oc**4 + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4 + END IF +END DO + END SUBROUTINE cs1_u_2 +END MODULE xc_cs1 Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c(revision 187117) +++ gcc/tree-ssa-reassoc.c(working copy) @@ -200,6 +200,10 @@ static long *bb_rank; /* Operand-rank hashtable. */ static struct pointer_map_t *operand_rank; +/* Map from inserted __builtin_powi calls to multiply chains that + feed them. */ +static struct pointer_map_t *bip_map; + /* Forward decls. */ static long get_rank (tree); @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var) static void possibly_move_powi (gimple stmt, tree op) { - gimple stmt2; + gimple stmt2, *mpy; tree fndecl; gimple_stmt_iterator gsi1, gsi2; @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op) return; } + /* Move the __builtin_powi. */ gsi1 = gsi_for_stmt (stmt); gsi2 = gsi_for_stmt (stmt2); gsi_move_before (gsi2, gsi1
Ping: [PATCH] Hoist adjacent pointer loads
Ping. Thanks, Bill On Thu, 2012-05-03 at 09:33 -0500, William J. Schmidt wrote: This patch was posted for comment back in February during stage 4. It addresses a performance issue noted in the EEMBC routelookup benchmark on a common idiom: if (...) x = y-left; else x = y-right; If the two loads can be hoisted out of the if/else, the if/else can be replaced by a conditional move instruction on architectures that support one. Because this speculates one of the loads, the patch constrains the optimization to avoid introducing page faults. Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no new failures. The patch provides significant improvement to the routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006. One question is what optimization level should be required for this. Because of the speculation, -O3 might be in order. I don't believe -Ofast is required as there is no potential correctness issue involved. Right now the patch doesn't check the optimization level (like the rest of the phi-opt transforms), which is likely a poor choice. Ok for trunk? Thanks, Bill 2012-05-03 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_reg_dependence): New function. (local_mem_dependence): Likewise. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187057) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt (void) { - return tree_ssa_phiopt_worker (false); + return tree_ssa_phiopt_worker (false, gate_hoist_loads ()); } /* This pass tries to transform conditional stores into unconditional
Re: [PATCH][1/n] Improve vectorization in PR53355
On Tue, 2012-05-15 at 14:17 +0200, Richard Guenther wrote: This is the first patch to make the generated code for the testcase in PR53355 better. It teaches VRP about LSHIFT_EXPRs (albeit only of a very simple form). Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. This appears to have caused http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53385. Thanks, Bill Richard. 2012-05-15 Richard Guenther rguent...@suse.de PR tree-optimization/53355 * tree-vrp.c (extract_range_from_binary_expr_1): Handle LSHIFT_EXPRs by constants. * gcc.dg/tree-ssa/vrp67.c: New testcase. Index: gcc/tree-vrp.c === *** gcc/tree-vrp.c(revision 187503) --- gcc/tree-vrp.c(working copy) *** extract_range_from_binary_expr_1 (value_ *** 2403,2408 --- 2403,2409 code != ROUND_DIV_EXPR code != TRUNC_MOD_EXPR code != RSHIFT_EXPR +code != LSHIFT_EXPR code != MIN_EXPR code != MAX_EXPR code != BIT_AND_EXPR *** extract_range_from_binary_expr_1 (value_ *** 2596,2601 --- 2597,2636 extract_range_from_multiplicative_op_1 (vr, code, vr0, vr1); return; } + else if (code == LSHIFT_EXPR) + { + /* If we have a LSHIFT_EXPR with any shift values outside [0..prec-1], + then drop to VR_VARYING. Outside of this range we get undefined + behavior from the shift operation. We cannot even trust + SHIFT_COUNT_TRUNCATED at this stage, because that applies to rtl + shifts, and the operation at the tree level may be widened. */ + if (vr1.type != VR_RANGE + || !value_range_nonnegative_p (vr1) + || TREE_CODE (vr1.max) != INTEGER_CST + || compare_tree_int (vr1.max, TYPE_PRECISION (expr_type) - 1) == 1) + { + set_value_range_to_varying (vr); + return; + } + + /* We can map shifts by constants to MULT_EXPR handling. */ + if (range_int_cst_singleton_p (vr1)) + { + value_range_t vr1p = { VR_RANGE, NULL_TREE, NULL_TREE, NULL }; + vr1p.min + = double_int_to_tree (expr_type, + double_int_lshift (double_int_one, + TREE_INT_CST_LOW (vr1.min), + TYPE_PRECISION (expr_type), + false)); + vr1p.max = vr1p.min; + extract_range_from_multiplicative_op_1 (vr, MULT_EXPR, vr0, vr1p); + return; + } + + set_value_range_to_varying (vr); + return; + } else if (code == TRUNC_DIV_EXPR || code == FLOOR_DIV_EXPR || code == CEIL_DIV_EXPR Index: gcc/testsuite/gcc.dg/tree-ssa/vrp67.c === *** gcc/testsuite/gcc.dg/tree-ssa/vrp67.c (revision 0) --- gcc/testsuite/gcc.dg/tree-ssa/vrp67.c (revision 0) *** *** 0 --- 1,38 + /* { dg-do compile } */ + /* { dg-options -O2 -fdump-tree-vrp1 } */ + + unsigned foo (unsigned i) + { + if (i == 2) + { + i = i 2; + if (i != 8) + link_error (); + } + return i; + } + unsigned bar (unsigned i) + { + if (i == 1 (sizeof (unsigned) * 8 - 1)) + { + i = i 1; + if (i != 0) + link_error (); + } + return i; + } + unsigned baz (unsigned i) + { + i = i 15; + if (i == 0) + return 0; + i = 1000 - i; + i = 1; + i = 1; + if (i == 0) + link_error (); + return i; + } + + /* { dg-final { scan-tree-dump-times Folding predicate 3 vrp1 } } */ + /* { dg-final { cleanup-tree-dump vrp1 } } */
PING: [PATCH] Fix PR53217
Ping. Thanks, Bill On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote: This fixes another statement-placement issue when reassociating expressions with repeated factors. Multiplies feeding into __builtin_powi calls were not getting placed properly ahead of them in some cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. I've also run SPEC cpu2006 with no build or correctness issues. OK for trunk? Thanks, Bill gcc: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * tree-ssa-reassoc.c (bip_map): New static variable. (possibly_move_powi): Move feeding multiplies with __builtin_powi call. (attempt_builtin_powi): Save feeding multiplies on a stack. (reassociate_bb): Create and destroy bip_map. gcc/testsuite: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * gfortran.dg/pr53217.f90: New test. Index: gcc/testsuite/gfortran.dg/pr53217.f90 === --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) @@ -0,0 +1,28 @@ +! { dg-do compile } +! { dg-options -O1 -ffast-math } + +! This tests only for compile-time failure, which formerly occurred +! when statements were emitted out of order, failing verify_ssa. + +MODULE xc_cs1 + INTEGER, PARAMETER :: dp=KIND(0.0D0) + REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, + c = 0.2533_dp, + d = 0.349_dp +CONTAINS + SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, e_ndrho_ndrho, + npoints, error) +REAL(KIND=dp), DIMENSION(*), + INTENT(INOUT) :: e_rho_rho, e_rho_ndrho, +e_ndrho_ndrho +DO ip = 1, npoints + IF ( rho(ip) eps_rho ) THEN + oc = 1.0_dp/(r*r*r3*r3 + c*g*g) + d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 + +104*r**6)*od**3*oc**4 + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4 + END IF +END DO + END SUBROUTINE cs1_u_2 +END MODULE xc_cs1 Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c(revision 187117) +++ gcc/tree-ssa-reassoc.c(working copy) @@ -200,6 +200,10 @@ static long *bb_rank; /* Operand-rank hashtable. */ static struct pointer_map_t *operand_rank; +/* Map from inserted __builtin_powi calls to multiply chains that + feed them. */ +static struct pointer_map_t *bip_map; + /* Forward decls. */ static long get_rank (tree); @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var) static void possibly_move_powi (gimple stmt, tree op) { - gimple stmt2; + gimple stmt2, *mpy; tree fndecl; gimple_stmt_iterator gsi1, gsi2; @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op) return; } + /* Move the __builtin_powi. */ gsi1 = gsi_for_stmt (stmt); gsi2 = gsi_for_stmt (stmt2); gsi_move_before (gsi2, gsi1); + + /* See if there are multiplies feeding the __builtin_powi base + argument that must also be moved. */ + while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL) +{ + /* If we've already moved this statement, we're done. This is + identified by a NULL entry for the statement in bip_map. */ + gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy); + if (next !*next) + return; + + stmt = stmt2; + stmt2 = *mpy; + gsi1 = gsi_for_stmt (stmt); + gsi2 = gsi_for_stmt (stmt2); + gsi_move_before (gsi2, gsi1); + + /* The moved multiply may be DAG'd from multiple calls if it + was the result of a cached multiply. Only move it once. + Rank order ensures we move it to the right place the first + time. */ + if (next) + *next = NULL; + else + { + next = (gimple *) pointer_map_insert (bip_map, *mpy); + *next = NULL; + } +} } /* This function checks three consequtive operands in @@ -3281,6 +3315,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent while (true) { HOST_WIDE_INT power; + gimple last_mul = NULL; /* First look for the largest cached product of factors from preceding iterations. If found, create a builtin_powi for @@ -3318,16 +3353,25 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent } else { + gimple *value; + iter_result = get_reassoc_pow_ssa_name (target, type); pow_stmt = gimple_build_call (powi_fndecl, 2, rf1-repr
[PATCH, 4.7] Backport fix to [un]signed_type_for
Backporting this patch to 4.7 fixes a problem building Fedora 17. Bootstrapped and regression tested on powerpc64-unknown-linux-gnu. Is the backport OK? Thanks, Bill 2012-05-10 Bill Schmidt wschm...@vnet.linux.ibm.com Backport from trunk: 2012-03-12 Richard Guenther rguent...@suse.de * tree.c (signed_or_unsigned_type_for): Use build_nonstandard_integer_type. (signed_type_for): Adjust documentation. (unsigned_type_for): Likewise. * tree-pretty-print.c (dump_generic_node): Use standard names for non-standard integer types if available. Index: gcc/tree-pretty-print.c === --- gcc/tree-pretty-print.c (revision 187368) +++ gcc/tree-pretty-print.c (working copy) @@ -723,11 +723,41 @@ dump_generic_node (pretty_printer *buffer, tree no } else if (TREE_CODE (node) == INTEGER_TYPE) { - pp_string (buffer, (TYPE_UNSIGNED (node) - ? unnamed-unsigned: - : unnamed-signed:)); - pp_decimal_int (buffer, TYPE_PRECISION (node)); - pp_string (buffer, ); + if (TYPE_PRECISION (node) == CHAR_TYPE_SIZE) + pp_string (buffer, (TYPE_UNSIGNED (node) + ? unsigned char + : signed char)); + else if (TYPE_PRECISION (node) == SHORT_TYPE_SIZE) + pp_string (buffer, (TYPE_UNSIGNED (node) + ? unsigned short + : signed short)); + else if (TYPE_PRECISION (node) == INT_TYPE_SIZE) + pp_string (buffer, (TYPE_UNSIGNED (node) + ? unsigned int + : signed int)); + else if (TYPE_PRECISION (node) == LONG_TYPE_SIZE) + pp_string (buffer, (TYPE_UNSIGNED (node) + ? unsigned long + : signed long)); + else if (TYPE_PRECISION (node) == LONG_LONG_TYPE_SIZE) + pp_string (buffer, (TYPE_UNSIGNED (node) + ? unsigned long long + : signed long long)); + else if (TYPE_PRECISION (node) = CHAR_TYPE_SIZE + exact_log2 (TYPE_PRECISION (node))) + { + pp_string (buffer, (TYPE_UNSIGNED (node) ? uint : int)); + pp_decimal_int (buffer, TYPE_PRECISION (node)); + pp_string (buffer, _t); + } + else + { + pp_string (buffer, (TYPE_UNSIGNED (node) + ? unnamed-unsigned: + : unnamed-signed:)); + pp_decimal_int (buffer, TYPE_PRECISION (node)); + pp_string (buffer, ); + } } else if (TREE_CODE (node) == COMPLEX_TYPE) { Index: gcc/tree.c === --- gcc/tree.c (revision 187368) +++ gcc/tree.c (working copy) @@ -10162,32 +10162,26 @@ widest_int_cst_value (const_tree x) return val; } -/* If TYPE is an integral type, return an equivalent type which is -unsigned iff UNSIGNEDP is true. If TYPE is not an integral type, -return TYPE itself. */ +/* If TYPE is an integral or pointer type, return an integer type with + the same precision which is unsigned iff UNSIGNEDP is true, or itself + if TYPE is already an integer type of signedness UNSIGNEDP. */ tree signed_or_unsigned_type_for (int unsignedp, tree type) { - tree t = type; - if (POINTER_TYPE_P (type)) -{ - /* If the pointer points to the normal address space, use the -size_type_node. Otherwise use an appropriate size for the pointer -based on the named address space it points to. */ - if (!TYPE_ADDR_SPACE (TREE_TYPE (t))) - t = size_type_node; - else - return lang_hooks.types.type_for_size (TYPE_PRECISION (t), unsignedp); -} + if (TREE_CODE (type) == INTEGER_TYPE TYPE_UNSIGNED (type) == unsignedp) +return type; - if (!INTEGRAL_TYPE_P (t) || TYPE_UNSIGNED (t) == unsignedp) -return t; + if (!INTEGRAL_TYPE_P (type) + !POINTER_TYPE_P (type)) +return NULL_TREE; - return lang_hooks.types.type_for_size (TYPE_PRECISION (t), unsignedp); + return build_nonstandard_integer_type (TYPE_PRECISION (type), unsignedp); } -/* Returns unsigned variant of TYPE. */ +/* If TYPE is an integral or pointer type, return an integer type with + the same precision which is unsigned, or itself if TYPE is already an + unsigned integer type. */ tree unsigned_type_for
Re: [PATCH, 4.7] Backport fix to [un]signed_type_for
On Thu, 2012-05-10 at 18:49 +0200, Jakub Jelinek wrote: On Thu, May 10, 2012 at 11:44:27AM -0500, William J. Schmidt wrote: Backporting this patch to 4.7 fixes a problem building Fedora 17. Bootstrapped and regression tested on powerpc64-unknown-linux-gnu. Is the backport OK? For 4.7 I'd very much prefer a less intrusive change (i.e. change the java langhook) instead, but I'll defer to Richard if he prefers this over that. OK. If that's desired, this is the possible change to the langhook: Index: gcc/java/typeck.c === --- gcc/java/typeck.c (revision 187158) +++ gcc/java/typeck.c (working copy) @@ -189,6 +189,12 @@ java_type_for_size (unsigned bits, int unsignedp) return unsignedp ? unsigned_int_type_node : int_type_node; if (bits = TYPE_PRECISION (long_type_node)) return unsignedp ? unsigned_long_type_node : long_type_node; + /* A 64-bit target with TImode requires 128-bit type definitions + for bitsizetype. */ + if (int128_integer_type_node + bits == TYPE_PRECISION (int128_integer_type_node)) +return (unsignedp ? int128_unsigned_type_node + : int128_integer_type_node); return 0; } which also fixed the problem and bootstraps without regressions. Whichever you guys prefer is fine with me. Thanks, Bill 2012-05-10 Bill Schmidt wschm...@vnet.linux.ibm.com Backport from trunk: 2012-03-12 Richard Guenther rguent...@suse.de * tree.c (signed_or_unsigned_type_for): Use build_nonstandard_integer_type. (signed_type_for): Adjust documentation. (unsigned_type_for): Likewise. * tree-pretty-print.c (dump_generic_node): Use standard names for non-standard integer types if available. Jakub
[PATCH] Fix PR53217
This fixes another statement-placement issue when reassociating expressions with repeated factors. Multiplies feeding into __builtin_powi calls were not getting placed properly ahead of them in some cases. Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new regressions. I've also run SPEC cpu2006 with no build or correctness issues. OK for trunk? Thanks, Bill gcc: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * tree-ssa-reassoc.c (bip_map): New static variable. (possibly_move_powi): Move feeding multiplies with __builtin_powi call. (attempt_builtin_powi): Save feeding multiplies on a stack. (reassociate_bb): Create and destroy bip_map. gcc/testsuite: 2012-05-08 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/53217 * gfortran.dg/pr53217.f90: New test. Index: gcc/testsuite/gfortran.dg/pr53217.f90 === --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0) @@ -0,0 +1,28 @@ +! { dg-do compile } +! { dg-options -O1 -ffast-math } + +! This tests only for compile-time failure, which formerly occurred +! when statements were emitted out of order, failing verify_ssa. + +MODULE xc_cs1 + INTEGER, PARAMETER :: dp=KIND(0.0D0) + REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, + c = 0.2533_dp, + d = 0.349_dp +CONTAINS + SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, e_ndrho_ndrho, + npoints, error) +REAL(KIND=dp), DIMENSION(*), + INTENT(INOUT) :: e_rho_rho, e_rho_ndrho, +e_ndrho_ndrho +DO ip = 1, npoints + IF ( rho(ip) eps_rho ) THEN + oc = 1.0_dp/(r*r*r3*r3 + c*g*g) + d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 + +104*r**6)*od**3*oc**4 + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4 + END IF +END DO + END SUBROUTINE cs1_u_2 +END MODULE xc_cs1 Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 187117) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -200,6 +200,10 @@ static long *bb_rank; /* Operand-rank hashtable. */ static struct pointer_map_t *operand_rank; +/* Map from inserted __builtin_powi calls to multiply chains that + feed them. */ +static struct pointer_map_t *bip_map; + /* Forward decls. */ static long get_rank (tree); @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var) static void possibly_move_powi (gimple stmt, tree op) { - gimple stmt2; + gimple stmt2, *mpy; tree fndecl; gimple_stmt_iterator gsi1, gsi2; @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op) return; } + /* Move the __builtin_powi. */ gsi1 = gsi_for_stmt (stmt); gsi2 = gsi_for_stmt (stmt2); gsi_move_before (gsi2, gsi1); + + /* See if there are multiplies feeding the __builtin_powi base + argument that must also be moved. */ + while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL) +{ + /* If we've already moved this statement, we're done. This is + identified by a NULL entry for the statement in bip_map. */ + gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy); + if (next !*next) + return; + + stmt = stmt2; + stmt2 = *mpy; + gsi1 = gsi_for_stmt (stmt); + gsi2 = gsi_for_stmt (stmt2); + gsi_move_before (gsi2, gsi1); + + /* The moved multiply may be DAG'd from multiple calls if it +was the result of a cached multiply. Only move it once. +Rank order ensures we move it to the right place the first +time. */ + if (next) + *next = NULL; + else + { + next = (gimple *) pointer_map_insert (bip_map, *mpy); + *next = NULL; + } +} } /* This function checks three consequtive operands in @@ -3281,6 +3315,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent while (true) { HOST_WIDE_INT power; + gimple last_mul = NULL; /* First look for the largest cached product of factors from preceding iterations. If found, create a builtin_powi for @@ -3318,16 +3353,25 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent } else { + gimple *value; + iter_result = get_reassoc_pow_ssa_name (target, type); pow_stmt = gimple_build_call (powi_fndecl, 2, rf1-repr, build_int_cst (integer_type_node, power)); gimple_call_set_lhs (pow_stmt,
[PATCH] Hoist adjacent pointer loads
This patch was posted for comment back in February during stage 4. It addresses a performance issue noted in the EEMBC routelookup benchmark on a common idiom: if (...) x = y-left; else x = y-right; If the two loads can be hoisted out of the if/else, the if/else can be replaced by a conditional move instruction on architectures that support one. Because this speculates one of the loads, the patch constrains the optimization to avoid introducing page faults. Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no new failures. The patch provides significant improvement to the routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006. One question is what optimization level should be required for this. Because of the speculation, -O3 might be in order. I don't believe -Ofast is required as there is no potential correctness issue involved. Right now the patch doesn't check the optimization level (like the rest of the phi-opt transforms), which is likely a poor choice. Ok for trunk? Thanks, Bill 2012-05-03 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward declaration. (hoist_adjacent_loads, gate_hoist_loads): New forward declarations. (tree_ssa_phiopt): Call gate_hoist_loads. (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call. (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call hoist_adjacent_loads. (local_reg_dependence): New function. (local_mem_dependence): Likewise. (hoist_adjacent_loads): Likewise. (gate_hoist_loads): Likewise. * common.opt (fhoist-adjacent-loads): New switch. * Makefile.in (tree-ssa-phiopt.o): Added dependencies. * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param. Index: gcc/tree-ssa-phiopt.c === --- gcc/tree-ssa-phiopt.c (revision 187057) +++ gcc/tree-ssa-phiopt.c (working copy) @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3. If not see #include cfgloop.h #include tree-data-ref.h #include tree-pretty-print.h +#include gimple-pretty-print.h +#include insn-config.h +#include expr.h +#include optabs.h +#ifndef HAVE_conditional_move +#define HAVE_conditional_move (0) +#endif + static unsigned int tree_ssa_phiopt (void); -static unsigned int tree_ssa_phiopt_worker (bool); +static unsigned int tree_ssa_phiopt_worker (bool, bool); static bool conditional_replacement (basic_block, basic_block, edge, edge, gimple, tree, tree); static int value_replacement (basic_block, basic_block, @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block); static struct pointer_set_t * get_non_trapping (void); static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree); +static void hoist_adjacent_loads (basic_block, basic_block, + basic_block, basic_block); +static bool gate_hoist_loads (void); /* This pass tries to replaces an if-then-else block with an assignment. We have four kinds of transformations. Some of these @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_ bb2: x = PHI x' (bb0), ...; - A similar transformation is done for MAX_EXPR. */ + A similar transformation is done for MAX_EXPR. + + This pass also performs a fifth transformation of a slightly different + flavor. + + Adjacent Load Hoisting + -- + + This transformation replaces + + bb0: + if (...) goto bb2; else goto bb1; + bb1: + x1 = (expr).field1; + goto bb3; + bb2: + x2 = (expr).field2; + bb3: + # x = PHI x1, x2; + + with + + bb0: + x1 = (expr).field1; + x2 = (expr).field2; + if (...) goto bb2; else goto bb1; + bb1: + goto bb3; + bb2: + bb3: + # x = PHI x1, x2; + + The purpose of this transformation is to enable generation of conditional + move instructions such as Intel CMOVE or PowerPC ISEL. Because one of + the loads is speculative, the transformation is restricted to very + specific cases to avoid introducing a page fault. We are looking for + the common idiom: + + if (...) + x = y-left; + else + x = y-right; + + where left and right are typically adjacent pointers in a tree structure. */ + static unsigned int tree_ssa_phiopt (void) { - return tree_ssa_phiopt_worker (false); + return tree_ssa_phiopt_worker (false, gate_hoist_loads ()); } /* This pass tries to transform conditional stores into unconditional @@ -190,7 +245,7 @@ tree_ssa_phiopt (void) static unsigned int tree_ssa_cs_elim (void) { - return tree_ssa_phiopt_worker (true); + return tree_ssa_phiopt_worker (true, false); } /*
Re: [PATCH] Hoist adjacent pointer loads
On Thu, 2012-05-03 at 09:40 -0600, Jeff Law wrote: On 05/03/2012 08:33 AM, William J. Schmidt wrote: This patch was posted for comment back in February during stage 4. It addresses a performance issue noted in the EEMBC routelookup benchmark on a common idiom: if (...) x = y-left; else x = y-right; If the two loads can be hoisted out of the if/else, the if/else can be replaced by a conditional move instruction on architectures that support one. Because this speculates one of the loads, the patch constrains the optimization to avoid introducing page faults. Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no new failures. The patch provides significant improvement to the routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006. One question is what optimization level should be required for this. Because of the speculation, -O3 might be in order. I don't believe -Ofast is required as there is no potential correctness issue involved. Right now the patch doesn't check the optimization level (like the rest of the phi-opt transforms), which is likely a poor choice. Doesn't this need to be conditionalized on the memory model that's currently active? Yes and no. What's important is that you don't want to introduce page faults (or less urgently, cache misses) by speculating the load. So the patch is currently extremely constrained, and likely will always stay that way. Only fields that are pointers and that are strictly adjacent are hoisted, and only if they're in the same 16-byte block. (The number 16 is a parameter that can be adjusted.) Hopefully I didn't miss your point -- let me know if I did and I'll try again. :) Thanks, Bill jeff
Re: [PATCH] Hoist adjacent pointer loads
On Thu, 2012-05-03 at 11:44 -0600, Jeff Law wrote: On 05/03/2012 10:47 AM, William J. Schmidt wrote: Yes and no. What's important is that you don't want to introduce page faults (or less urgently, cache misses) by speculating the load. So the patch is currently extremely constrained, and likely will always stay that way. Only fields that are pointers and that are strictly adjacent are hoisted, and only if they're in the same 16-byte block. (The number 16 is a parameter that can be adjusted.) Hopefully I didn't miss your point -- let me know if I did and I'll try again. :) You missed the point :-) Under the C++11 memory model you can't introduce new data races on objects which might be visible to multiple threads. This requirement can restrict speculation in many cases. Furthermore, it sounds like C11 will have similar constraints. I believe there's a wiki page which touches on these kinds of issues. That doesn't mean we can't ever do the optimization, just that we have to be more careful than we have in the past when mucking around with memory optimizations. OK, thanks! Looks like I have some reading to do about the new memory models. However, from the wiki page I see: A speculative load which has its results thrown away are considered to not have changed the semantics of the program, and are therefore allowed. That seems to cover the case here: the load is hoisted, but if the path where it was originally loaded is not executed, its result is discarded. If needed, though, what flags/detection mechanisms are available for determining that the load speculation should be disabled? Thanks, Bill jeff
Re: [PATCH] Improve COND_EXPR expansion
On Mon, 2012-04-30 at 20:22 -0700, Andrew Pinski wrote: Hi, This patch improves the expansion of COND_EXPR into RTL, directly using conditional moves. I had to fix a bug in the x86 backend where emit_conditional_move could cause a crash as we had a comparison mode of DImode which is not handled by the 32bit part. can_conditionally_move_p return true as we had an SImode for the other operands. Note other targets might need a similar fix as x86 had but I could not test those targets and this is really the first time where emit_conditional_move is being called with different modes for the comparison and the other operands mode and the comparison mode is not of the CC class. Hi Andrew, I verified your patch on powerpc64-unknown-linux-gnu. There were no new testcase regressions, and SPEC cpu2006 built ok with your changes. Hope this helps! Bill The main reasoning to do this conversion early rather than wait for ifconv as the resulting code is slightly better. Also the compiler is slightly faster. OK? Bootstrapped and tested on both mips64-linux-gnu (where it was originally written for) and x86_64-linux-gnu. Thanks, Andrew Pinski ChangeLog: * expr.c (convert_tree_comp_to_rtx): New function. (expand_expr_real_2): Try using conditional moves for COND_EXPRs if they exist. * config/i386/i386.c (ix86_expand_int_movcc): Disallow comparison modes of DImode for 32bits and TImode.
[Patch ping] Strength reduction
Thought I'd ping http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01225.html since it's been about six weeks. Any initial feedback would be very much appreciated! Thanks, Bill
[PATCH, powerpc] Fix PR47197
This fixes an error wherein a nontrivial expression oassed to an Altivec built-in results in an ICE, following Joseph Myers's suggested approach in the bugzilla. Bootstrapped and tested with no new regressions on powerpc64-unknown-linux-gnu. Ok for trunk? Thanks, Bill gcc: 2012-04-24 Bill Schmidt wschm...@linux.vnet.ibm.com PR target/47197 * config/rs6000/rs6000-c.c (fully_fold_convert): New function. (altivec_build_resolved_builtin): Call fully_fold_convert. gcc/testsuite: 2012-04-24 Bill Schmidt wschm...@linux.vnet.ibm.com PR target/47197 * gcc.target/powerpc/pr47197.c: New test. Index: gcc/testsuite/gcc.target/powerpc/pr47197.c === --- gcc/testsuite/gcc.target/powerpc/pr47197.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/pr47197.c (revision 0) @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options -maltivec } */ + +/* Compile-only test to ensure that expressions can be passed to + Altivec builtins without error. */ + +#include altivec.h + +void func(unsigned char *buf, unsigned len) +{ +vec_dst(buf, (len = 256 ? 0 : len) | 512, 2); +} Index: gcc/config/rs6000/rs6000-c.c === --- gcc/config/rs6000/rs6000-c.c(revision 186761) +++ gcc/config/rs6000/rs6000-c.c(working copy) @@ -3421,6 +3421,22 @@ rs6000_builtin_type_compatible (tree t, int id) } +/* In addition to calling fold_convert for EXPR of type TYPE, also + call c_fully_fold to remove any C_MAYBE_CONST_EXPRs that could be + hiding there (PR47197). */ + +static tree +fully_fold_convert (tree type, tree expr) +{ + tree result = fold_convert (type, expr); + bool maybe_const = true; + + if (!c_dialect_cxx ()) +result = c_fully_fold (result, false, maybe_const); + + return result; +} + /* Build a tree for a function call to an Altivec non-overloaded builtin. The overloaded builtin that matched the types and args is described by DESC. The N arguments are given in ARGS, respectively. @@ -3470,18 +3486,18 @@ altivec_build_resolved_builtin (tree *args, int n, break; case 1: call = build_call_expr (impl_fndecl, 1, - fold_convert (arg_type[0], args[0])); + fully_fold_convert (arg_type[0], args[0])); break; case 2: call = build_call_expr (impl_fndecl, 2, - fold_convert (arg_type[0], args[0]), - fold_convert (arg_type[1], args[1])); + fully_fold_convert (arg_type[0], args[0]), + fully_fold_convert (arg_type[1], args[1])); break; case 3: call = build_call_expr (impl_fndecl, 3, - fold_convert (arg_type[0], args[0]), - fold_convert (arg_type[1], args[1]), - fold_convert (arg_type[2], args[2])); + fully_fold_convert (arg_type[0], args[0]), + fully_fold_convert (arg_type[1], args[1]), + fully_fold_convert (arg_type[2], args[2])); break; default: gcc_unreachable ();
Re: [PATCH] Fix PR44214
On Mon, 2012-04-23 at 11:09 +0200, Richard Guenther wrote: On Fri, 20 Apr 2012, William J. Schmidt wrote: On Fri, 2012-04-20 at 11:32 -0700, H.J. Lu wrote: On Thu, Apr 19, 2012 at 6:58 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This enhances constant folding for division by complex and vector constants. When -freciprocal-math is present, such divisions are converted into multiplies by the constant reciprocal. When an exact reciprocal is available, this is done for vector constants when optimizing. I did not implement logic for exact reciprocals of complex constants because either (a) the complexity doesn't justify the likelihood of occurrence, or (b) I'm lazy. Your choice. ;) Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu. Ok for trunk? Thanks, Bill gcc: 2012-04-19 Bill Schmidt wschm...@linux.vnet.ibm.com PR rtl-optimization/44214 * fold-const.c (exact_inverse): New function. (fold_binary_loc): Fold vector and complex division by constant into multiply by recripocal with flag_reciprocal_math; fold vector division by constant into multiply by reciprocal with exact inverse. gcc/testsuite: It caused: FAIL: gcc.dg/torture/builtin-explog-1.c -O0 (test for excess errors) FAIL: gcc.dg/torture/builtin-power-1.c -O0 (test for excess errors) on x86. Hm, sorry, I don't know how that escaped my testing. This was due to the suggestion to have the optimize test encompass the -freciprocal-math test. Looks like this changes some expected behavior, at least for these two tests. Two options: Revert the move of the optimize test, or change the tests to require -O1 or above. Richard, what's your preference? Change the test to require -O1 or above. Richard. OK, following committed as obvious. Thanks, Bill gcc-testsuite: 2012-04-23 Bill Schmidt wschm...@linux.ibm.com PR regression/53076 * gcc.dg/torture/builtin-explog-1.c: Skip if -O0. * gcc.dg/torture/builtin-power-1.c: Likewise. Index: gcc/testsuite/gcc.dg/torture/builtin-explog-1.c === --- gcc/testsuite/gcc.dg/torture/builtin-explog-1.c (revision 186624) +++ gcc/testsuite/gcc.dg/torture/builtin-explog-1.c (working copy) @@ -7,6 +7,7 @@ /* { dg-do link } */ /* { dg-options -ffast-math } */ +/* { dg-skip-if PR44214 { *-*-* } { -O0 } { } } */ /* Define e with as many bits as found in builtins.c:dconste. */ #define M_E 2.7182818284590452353602874713526624977572470936999595749669676277241 Index: gcc/testsuite/gcc.dg/torture/builtin-power-1.c === --- gcc/testsuite/gcc.dg/torture/builtin-power-1.c (revision 186624) +++ gcc/testsuite/gcc.dg/torture/builtin-power-1.c (working copy) @@ -8,6 +8,7 @@ /* { dg-do link } */ /* { dg-options -ffast-math } */ /* { dg-add-options c99_runtime } */ +/* { dg-skip-if PR44214 { *-*-* } { -O0 } { } } */ #include ../builtins-config.h
Re: [PATCH] Fix PR44214
On Fri, 2012-04-20 at 10:04 +0200, Richard Guenther wrote: On Thu, 19 Apr 2012, William J. Schmidt wrote: This enhances constant folding for division by complex and vector constants. When -freciprocal-math is present, such divisions are converted into multiplies by the constant reciprocal. When an exact reciprocal is available, this is done for vector constants when optimizing. I did not implement logic for exact reciprocals of complex constants because either (a) the complexity doesn't justify the likelihood of occurrence, or (b) I'm lazy. Your choice. ;) Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu. Ok for trunk? See below ... Thanks, Bill gcc: 2012-04-19 Bill Schmidt wschm...@linux.vnet.ibm.com PR rtl-optimization/44214 * fold-const.c (exact_inverse): New function. (fold_binary_loc): Fold vector and complex division by constant into multiply by recripocal with flag_reciprocal_math; fold vector division by constant into multiply by reciprocal with exact inverse. gcc/testsuite: 2012-04-19 Bill Schmidt wschm...@linux.vnet.ibm.com PR rtl-optimization/44214 * gcc.target/powerpc/pr44214-1.c: New test. * gcc.dg/pr44214-2.c: Likewise. * gcc.target/powerpc/pr44214-3.c: Likewise. Index: gcc/fold-const.c === --- gcc/fold-const.c(revision 186573) +++ gcc/fold-const.c(working copy) @@ -9693,6 +9693,48 @@ fold_addr_of_array_ref_difference (location_t loc, return NULL_TREE; } +/* If the real or vector real constant CST of type TYPE has an exact + inverse, return it, else return NULL. */ + +static tree +exact_inverse (tree type, tree cst) +{ + REAL_VALUE_TYPE r; + tree unit_type, *elts; + enum machine_mode mode; + unsigned vec_nelts, i; + + switch (TREE_CODE (cst)) +{ +case REAL_CST: + r = TREE_REAL_CST (cst); + + if (exact_real_inverse (TYPE_MODE (type), r)) + return build_real (type, r); + + return NULL_TREE; + +case VECTOR_CST: + vec_nelts = VECTOR_CST_NELTS (cst); + elts = XALLOCAVEC (tree, vec_nelts); + unit_type = TREE_TYPE (type); + mode = TYPE_MODE (unit_type); + + for (i = 0; i vec_nelts; i++) + { + r = TREE_REAL_CST (VECTOR_CST_ELT (cst, i)); + if (!exact_real_inverse (mode, r)) + return NULL_TREE; + elts[i] = build_real (unit_type, r); + } + + return build_vector (type, elts); + +default: + return NULL_TREE; +} +} + /* Fold a binary expression of code CODE and type TYPE with operands OP0 and OP1. LOC is the location of the resulting expression. Return the folded expression if folding is successful. Otherwise, @@ -11734,23 +11776,25 @@ fold_binary_loc (location_t loc, so only do this if -freciprocal-math. We can actually always safely do it if ARG1 is a power of two, but it's hard to tell if it is or not in a portable manner. */ - if (TREE_CODE (arg1) == REAL_CST) + if (TREE_CODE (arg1) == REAL_CST + || (TREE_CODE (arg1) == COMPLEX_CST + COMPLEX_FLOAT_TYPE_P (TREE_TYPE (arg1))) + || (TREE_CODE (arg1) == VECTOR_CST + VECTOR_FLOAT_TYPE_P (TREE_TYPE (arg1 { if (flag_reciprocal_math - 0 != (tem = const_binop (code, build_real (type, dconst1), + 0 != (tem = fold_binary (code, type, build_one_cst (type), arg1))) Any reason for not using const_binop? As it turns out, no. I (blindly) made this change based on your comment in the PR... The fold code should probably simply use fold_binary to do the constant folding (which already should handle 1/x for x vector and complex. There is a build_one_cst to build the constant 1 for any type). ...but now that I've looked at it that was unnecessary, so I must have misinterpreted this. I'll revert to using const_binop. return fold_build2_loc (loc, MULT_EXPR, type, arg0, tem); - /* Find the reciprocal if optimizing and the result is exact. */ - if (optimize) + /* Find the reciprocal if optimizing and the result is exact. +TODO: Complex reciprocal not implemented. */ + if (optimize + TREE_CODE (arg1) != COMPLEX_CST) I know this is all pre-existing, but really the flag_reciprocal_math case should be under if (optimize), too. So, can you move this check to the toplevel covering both cases? Sure. The testcases should apply to generic vectors, too, and should scan the .original dump (where folding first applied). So they should not be target specific (and they should use -freciprocal-math). OK. I was ignorant of the generic vector syntax using __attribute__. If I change pr44214-1.c to use
Re: [PATCH] Fix PR44214
On Fri, 2012-04-20 at 11:32 -0700, H.J. Lu wrote: On Thu, Apr 19, 2012 at 6:58 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This enhances constant folding for division by complex and vector constants. When -freciprocal-math is present, such divisions are converted into multiplies by the constant reciprocal. When an exact reciprocal is available, this is done for vector constants when optimizing. I did not implement logic for exact reciprocals of complex constants because either (a) the complexity doesn't justify the likelihood of occurrence, or (b) I'm lazy. Your choice. ;) Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu. Ok for trunk? Thanks, Bill gcc: 2012-04-19 Bill Schmidt wschm...@linux.vnet.ibm.com PR rtl-optimization/44214 * fold-const.c (exact_inverse): New function. (fold_binary_loc): Fold vector and complex division by constant into multiply by recripocal with flag_reciprocal_math; fold vector division by constant into multiply by reciprocal with exact inverse. gcc/testsuite: It caused: FAIL: gcc.dg/torture/builtin-explog-1.c -O0 (test for excess errors) FAIL: gcc.dg/torture/builtin-power-1.c -O0 (test for excess errors) on x86. Hm, sorry, I don't know how that escaped my testing. This was due to the suggestion to have the optimize test encompass the -freciprocal-math test. Looks like this changes some expected behavior, at least for these two tests. Two options: Revert the move of the optimize test, or change the tests to require -O1 or above. Richard, what's your preference? Thanks, Bill
[PATCH] Fix PR44214
This enhances constant folding for division by complex and vector constants. When -freciprocal-math is present, such divisions are converted into multiplies by the constant reciprocal. When an exact reciprocal is available, this is done for vector constants when optimizing. I did not implement logic for exact reciprocals of complex constants because either (a) the complexity doesn't justify the likelihood of occurrence, or (b) I'm lazy. Your choice. ;) Bootstrapped with no new regressions on powerpc64-unknown-linux-gnu. Ok for trunk? Thanks, Bill gcc: 2012-04-19 Bill Schmidt wschm...@linux.vnet.ibm.com PR rtl-optimization/44214 * fold-const.c (exact_inverse): New function. (fold_binary_loc): Fold vector and complex division by constant into multiply by recripocal with flag_reciprocal_math; fold vector division by constant into multiply by reciprocal with exact inverse. gcc/testsuite: 2012-04-19 Bill Schmidt wschm...@linux.vnet.ibm.com PR rtl-optimization/44214 * gcc.target/powerpc/pr44214-1.c: New test. * gcc.dg/pr44214-2.c: Likewise. * gcc.target/powerpc/pr44214-3.c: Likewise. Index: gcc/fold-const.c === --- gcc/fold-const.c(revision 186573) +++ gcc/fold-const.c(working copy) @@ -9693,6 +9693,48 @@ fold_addr_of_array_ref_difference (location_t loc, return NULL_TREE; } +/* If the real or vector real constant CST of type TYPE has an exact + inverse, return it, else return NULL. */ + +static tree +exact_inverse (tree type, tree cst) +{ + REAL_VALUE_TYPE r; + tree unit_type, *elts; + enum machine_mode mode; + unsigned vec_nelts, i; + + switch (TREE_CODE (cst)) +{ +case REAL_CST: + r = TREE_REAL_CST (cst); + + if (exact_real_inverse (TYPE_MODE (type), r)) + return build_real (type, r); + + return NULL_TREE; + +case VECTOR_CST: + vec_nelts = VECTOR_CST_NELTS (cst); + elts = XALLOCAVEC (tree, vec_nelts); + unit_type = TREE_TYPE (type); + mode = TYPE_MODE (unit_type); + + for (i = 0; i vec_nelts; i++) + { + r = TREE_REAL_CST (VECTOR_CST_ELT (cst, i)); + if (!exact_real_inverse (mode, r)) + return NULL_TREE; + elts[i] = build_real (unit_type, r); + } + + return build_vector (type, elts); + +default: + return NULL_TREE; +} +} + /* Fold a binary expression of code CODE and type TYPE with operands OP0 and OP1. LOC is the location of the resulting expression. Return the folded expression if folding is successful. Otherwise, @@ -11734,23 +11776,25 @@ fold_binary_loc (location_t loc, so only do this if -freciprocal-math. We can actually always safely do it if ARG1 is a power of two, but it's hard to tell if it is or not in a portable manner. */ - if (TREE_CODE (arg1) == REAL_CST) + if (TREE_CODE (arg1) == REAL_CST + || (TREE_CODE (arg1) == COMPLEX_CST + COMPLEX_FLOAT_TYPE_P (TREE_TYPE (arg1))) + || (TREE_CODE (arg1) == VECTOR_CST + VECTOR_FLOAT_TYPE_P (TREE_TYPE (arg1 { if (flag_reciprocal_math - 0 != (tem = const_binop (code, build_real (type, dconst1), + 0 != (tem = fold_binary (code, type, build_one_cst (type), arg1))) return fold_build2_loc (loc, MULT_EXPR, type, arg0, tem); - /* Find the reciprocal if optimizing and the result is exact. */ - if (optimize) + /* Find the reciprocal if optimizing and the result is exact. +TODO: Complex reciprocal not implemented. */ + if (optimize + TREE_CODE (arg1) != COMPLEX_CST) { - REAL_VALUE_TYPE r; - r = TREE_REAL_CST (arg1); - if (exact_real_inverse (TYPE_MODE(TREE_TYPE(arg0)), r)) - { - tem = build_real (type, r); - return fold_build2_loc (loc, MULT_EXPR, type, - fold_convert_loc (loc, type, arg0), tem); - } + tree inverse = exact_inverse (TREE_TYPE (arg0), arg1); + + if (inverse) + return fold_build2_loc (loc, MULT_EXPR, type, arg0, inverse); } } /* Convert A/B/C to A/(B*C). */ Index: gcc/testsuite/gcc.target/powerpc/pr44214-3.c === --- gcc/testsuite/gcc.target/powerpc/pr44214-3.c(revision 0) +++ gcc/testsuite/gcc.target/powerpc/pr44214-3.c(revision 0) @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -mcpu=power7 -fdump-tree-optimized } */ + +void do_div (vector double *a, vector double *b) +{ + *a = *b / (vector double) { 2.0, 2.0 }; +} + +/* Since 2.0 has an exact reciprocal, constant folding should multiply *b + by the
[PATCH] Allow un-distribution with repeated factors (PR52976 follow-up)
The emergency reassociation patch for PR52976 disabled un-distribution in the presence of repeated factors to avoid ICEs in zero_one_operation. This patch fixes such cases properly by teaching zero_one_operation about __builtin_pow* calls. Bootstrapped with no new regressions on powerpc64-linux. Also built SPEC cpu2000 and cpu2006 successfully. Ok for trunk? Thanks, Bill gcc: 2012-04-17 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-reassoc.c (stmt_is_power_of_op): New function. (decrement_power): Likewise. (propagate_op_to_single_use): Likewise. (zero_one_operation): Handle __builtin_pow* calls in linearized expression trees; factor logic into propagate_op_to_single_use. (undistribute_ops_list): Allow operands with repeat counts 1. gcc/testsuite: 2012-04-17 Bill Schmidt wschm...@linux.vnet.ibm.com gfortran.dg/reassoc_7.f: New test. gfortran.dg/reassoc_8.f: Likewise. gfortran.dg/reassoc_9.f: Likewise. gfortran.dg/reassoc_10.f: Likewise. Index: gcc/testsuite/gfortran.dg/reassoc_10.f === --- gcc/testsuite/gfortran.dg/reassoc_10.f (revision 0) +++ gcc/testsuite/gfortran.dg/reassoc_10.f (revision 0) @@ -0,0 +1,17 @@ +! { dg-do compile } +! { dg-options -O3 -ffast-math -fdump-tree-optimized } + + SUBROUTINE S55199(P,Q,Dvdph) + implicit none + real(8) :: c1,c2,c3,P,Q,Dvdph + c1=0.1d0 + c2=0.2d0 + c3=0.3d0 + Dvdph = c1 + 2.*P*c2 + 3.*P**2*Q**3*c3 + END + +! There should be five multiplies following un-distribution +! and power expansion. + +! { dg-final { scan-tree-dump-times \\\* 5 optimized } } +! { dg-final { cleanup-tree-dump optimized } } Index: gcc/testsuite/gfortran.dg/reassoc_7.f === --- gcc/testsuite/gfortran.dg/reassoc_7.f (revision 0) +++ gcc/testsuite/gfortran.dg/reassoc_7.f (revision 0) @@ -0,0 +1,16 @@ +! { dg-do compile } +! { dg-options -O3 -ffast-math -fdump-tree-optimized } + + SUBROUTINE S55199(P,Dvdph) + implicit none + real(8) :: c1,c2,c3,P,Dvdph + c1=0.1d0 + c2=0.2d0 + c3=0.3d0 + Dvdph = c1 + 2.*P*c2 + 3.*P**2*c3 + END + +! There should be two multiplies following un-distribution. + +! { dg-final { scan-tree-dump-times \\\* 2 optimized } } +! { dg-final { cleanup-tree-dump optimized } } Index: gcc/testsuite/gfortran.dg/reassoc_8.f === --- gcc/testsuite/gfortran.dg/reassoc_8.f (revision 0) +++ gcc/testsuite/gfortran.dg/reassoc_8.f (revision 0) @@ -0,0 +1,17 @@ +! { dg-do compile } +! { dg-options -O3 -ffast-math -fdump-tree-optimized } + + SUBROUTINE S55199(P,Dvdph) + implicit none + real(8) :: c1,c2,c3,P,Dvdph + c1=0.1d0 + c2=0.2d0 + c3=0.3d0 + Dvdph = c1 + 2.*P**2*c2 + 3.*P**3*c3 + END + +! There should be three multiplies following un-distribution +! and power expansion. + +! { dg-final { scan-tree-dump-times \\\* 3 optimized } } +! { dg-final { cleanup-tree-dump optimized } } Index: gcc/testsuite/gfortran.dg/reassoc_9.f === --- gcc/testsuite/gfortran.dg/reassoc_9.f (revision 0) +++ gcc/testsuite/gfortran.dg/reassoc_9.f (revision 0) @@ -0,0 +1,17 @@ +! { dg-do compile } +! { dg-options -O3 -ffast-math -fdump-tree-optimized } + + SUBROUTINE S55199(P,Dvdph) + implicit none + real(8) :: c1,c2,c3,P,Dvdph + c1=0.1d0 + c2=0.2d0 + c3=0.3d0 + Dvdph = c1 + 2.*P**2*c2 + 3.*P**4*c3 + END + +! There should be three multiplies following un-distribution +! and power expansion. + +! { dg-final { scan-tree-dump-times \\\* 3 optimized } } +! { dg-final { cleanup-tree-dump optimized } } Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 186495) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -1020,6 +1020,98 @@ oecount_cmp (const void *p1, const void *p2) return c1-id - c2-id; } +/* Return TRUE iff STMT represents a builtin call that raises OP + to some exponent. */ + +static bool +stmt_is_power_of_op (gimple stmt, tree op) +{ + tree fndecl; + + if (!is_gimple_call (stmt)) +return false; + + fndecl = gimple_call_fndecl (stmt); + + if (!fndecl + || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL) +return false; + + switch (DECL_FUNCTION_CODE (gimple_call_fndecl (stmt))) +{ +CASE_FLT_FN (BUILT_IN_POW): +CASE_FLT_FN (BUILT_IN_POWI): + return (operand_equal_p (gimple_call_arg (stmt, 0), op, 0)); + +default: + return false; +} +} + +/* Given STMT which is a __builtin_pow* call, decrement its exponent + in place and return the result. Assumes that stmt_is_power_of_op + was previously called
[PATCH] Fix __builtin_powi placement (PR52976 follow-up)
The emergency patch for PR52976 manipulated the operand rank system to force inserted __builtin_powi calls to occur before uses of the call results. However, this is generally the wrong approach, as it forces other computations to move unnecessarily, and extends the lifetimes of other operands. This patch fixes the problem in the proper way, by letting the rank system determine where the __builtin_powi call belongs, and moving the call to that location during the expression rewrite. Bootstrapped with no new regressions on powerpc64-linux. SPEC cpu2000 and cpu2006 also build cleanly. Ok for trunk? Thanks, Bill gcc: 2012-04-17 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-reassoc.c (add_to_ops_vec_max_rank): Delete. (possibly_move_powi): New function. (rewrite_expr_tree): Call possibly_move_powi. (rewrite_expr_tree_parallel): Likewise. (attempt_builtin_powi): Change call of add_to_ops_vec_max_rank to call add_to_ops_vec instead. gcc/testsuite: 2012-04-17 Bill Schmidt wschm...@linux.vnet.ibm.com gfortran.dg/reassoc_11.f: New test. Index: gcc/testsuite/gfortran.dg/reassoc_11.f === --- gcc/testsuite/gfortran.dg/reassoc_11.f (revision 0) +++ gcc/testsuite/gfortran.dg/reassoc_11.f (revision 0) @@ -0,0 +1,17 @@ +! { dg-do compile } +! { dg-options -O3 -ffast-math } + +! This tests only for compile-time failure, which formerly occurred +! when a __builtin_powi was introduced by reassociation in a bad place. + + SUBROUTINE GRDURBAN(URBWSTR, ZIURB, GRIDHT) + + IMPLICIT NONE + INTEGER :: I + REAL :: SW2, URBWSTR, ZIURB, GRIDHT(87) + + SAVE + + SW2 = 1.6*(GRIDHT(I)/ZIURB)**0.667*URBWSTR**2 + + END Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 186495) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -544,28 +544,6 @@ add_repeat_to_ops_vec (VEC(operand_entry_t, heap) reassociate_stats.pows_encountered++; } -/* Add an operand entry to *OPS for the tree operand OP, giving the - new entry a larger rank than any other operand already in *OPS. */ - -static void -add_to_ops_vec_max_rank (VEC(operand_entry_t, heap) **ops, tree op) -{ - operand_entry_t oe = (operand_entry_t) pool_alloc (operand_entry_pool); - operand_entry_t oe1; - unsigned i; - unsigned max_rank = 0; - - FOR_EACH_VEC_ELT (operand_entry_t, *ops, i, oe1) -if (oe1-rank max_rank) - max_rank = oe1-rank; - - oe-op = op; - oe-rank = max_rank + 1; - oe-id = next_operand_entry_id++; - oe-count = 1; - VEC_safe_push (operand_entry_t, heap, *ops, oe); -} - /* Return true if STMT is reassociable operation containing a binary operation with tree code CODE, and is inside LOOP. */ @@ -2162,6 +2242,47 @@ remove_visited_stmt_chain (tree var) } } +/* If OP is an SSA name, find its definition and determine whether it + is a call to __builtin_powi. If so, move the definition prior to + STMT. Only do this during early reassociation. */ + +static void +possibly_move_powi (gimple stmt, tree op) +{ + gimple stmt2; + tree fndecl; + gimple_stmt_iterator gsi1, gsi2; + + if (!first_pass_instance + || !flag_unsafe_math_optimizations + || TREE_CODE (op) != SSA_NAME) +return; + + stmt2 = SSA_NAME_DEF_STMT (op); + + if (!is_gimple_call (stmt2) + || !has_single_use (gimple_call_lhs (stmt2))) +return; + + fndecl = gimple_call_fndecl (stmt2); + + if (!fndecl + || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL) +return; + + switch (DECL_FUNCTION_CODE (fndecl)) +{ +CASE_FLT_FN (BUILT_IN_POWI): + break; +default: + return; +} + + gsi1 = gsi_for_stmt (stmt); + gsi2 = gsi_for_stmt (stmt2); + gsi_move_before (gsi2, gsi1); +} + /* This function checks three consequtive operands in passed operands vector OPS starting from OPINDEX and swaps two operands if it is profitable for binary operation @@ -2267,6 +2388,8 @@ rewrite_expr_tree (gimple stmt, unsigned int opind print_gimple_stmt (dump_file, stmt, 0, 0); } + possibly_move_powi (stmt, oe1-op); + possibly_move_powi (stmt, oe2-op); } return; } @@ -2312,6 +2435,8 @@ rewrite_expr_tree (gimple stmt, unsigned int opind fprintf (dump_file, into ); print_gimple_stmt (dump_file, stmt, 0, 0); } + + possibly_move_powi (stmt, oe-op); } /* Recurse on the LHS of the binary operator, which is guaranteed to be the non-leaf side. */ @@ -2485,6 +2610,9 @@ rewrite_expr_tree_parallel (gimple stmt, int width fprintf (dump_file, into ); print_gimple_stmt (dump_file, stmts[i], 0, 0); } + + possibly_move_powi (stmts[i], op1); + possibly_move_powi (stmts[i], op2); } remove_visited_stmt_chain (last_rhs1);
Re: [PATCH] Fix PR52976
On Mon, 2012-04-16 at 11:01 +0200, Richard Guenther wrote: On Sat, Apr 14, 2012 at 7:05 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: This patch corrects two errors in reassociating expressions with repeated factors. First, undistribution needs to recognize repeated factors. For now, repeated factors will be ineligible for this optimization. In the future, this can be improved. Second, when a __builtin_powi call is introduced, its target SSA name must be given a rank higher than other operands in the operand list. Otherwise, uses of the call result may be introduced prior to the call. Bootstrapped and regression tested on powerpc64-linux. Confirmed that cpu2000 and cpu2006 SPEC tests build cleanly. OK for trunk? Ok, given it fixes quite some fallout. OK, thanks. But I wonder why the rank computation does not properly work automagically in the powi case. The reassociator generally tries to replace expressions in place unless the rank system tells it otherwise. At the moment, __builtin_powi calls are added right before the root of the reassociation chain (the last multiply). In the cases that failed, the natural rank of the call was one greater than the rank of the repeated factors, and there were other factors with higher rank than that. So the call was in the middle of the ranks but placement required it to have the highest rank. Because the call can't be further reassociated, it sort of ruins the flexibility of the rank system's placement algorithm. It would probably be better to insert the calls as early as necessary, but no earlier, to properly order things while letting the rank system do its job normally. That would help reduce lifetimes of reassociated values. I didn't see an obvious way to do that with a quick fix; I'm planning to think about it some more. Also for undistribution it looks like this might introduce missed optimizations? Thus, how hard would it be to teach it to properly handle -count != 1? ISTR it does some counting itself. I'm planning to work on that as well. I looked at it enough over the weekend to know it wasn't completely trivial, so I wanted to get the problem papered over for now. It shouldn't be too hard to get right. Thanks, Bill Thanks, Richard. Thanks, Bill 2012-04-14 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/52976 * tree-ssa-reassoc.c (add_to_ops_vec_max_rank): New function. (undistribute_ops_list): Ops with repeat counts aren't eligible for undistribution. (attempt_builtin_powi): Call add_to_ops_vec_max_rank. Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 186393) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -544,6 +544,28 @@ add_repeat_to_ops_vec (VEC(operand_entry_t, heap) reassociate_stats.pows_encountered++; } +/* Add an operand entry to *OPS for the tree operand OP, giving the + new entry a larger rank than any other operand already in *OPS. */ + +static void +add_to_ops_vec_max_rank (VEC(operand_entry_t, heap) **ops, tree op) +{ + operand_entry_t oe = (operand_entry_t) pool_alloc (operand_entry_pool); + operand_entry_t oe1; + unsigned i; + unsigned max_rank = 0; + + FOR_EACH_VEC_ELT (operand_entry_t, *ops, i, oe1) +if (oe1-rank max_rank) + max_rank = oe1-rank; + + oe-op = op; + oe-rank = max_rank + 1; + oe-id = next_operand_entry_id++; + oe-count = 1; + VEC_safe_push (operand_entry_t, heap, *ops, oe); +} + /* Return true if STMT is reassociable operation containing a binary operation with tree code CODE, and is inside LOOP. */ @@ -1200,6 +1222,7 @@ undistribute_ops_list (enum tree_code opcode, dcode = gimple_assign_rhs_code (oe1def); if ((dcode != MULT_EXPR dcode != RDIV_EXPR) + || oe1-count != 1 || !is_reassociable_op (oe1def, dcode, loop)) continue; @@ -1243,6 +1266,8 @@ undistribute_ops_list (enum tree_code opcode, oecount c; void **slot; size_t idx; + if (oe1-count != 1) + continue; c.oecode = oecode; c.cnt = 1; c.id = next_oecount_id++; @@ -1311,7 +1336,7 @@ undistribute_ops_list (enum tree_code opcode, FOR_EACH_VEC_ELT (operand_entry_t, subops[i], j, oe1) { - if (oe1-op == c-op) + if (oe1-op == c-op oe1-count == 1) { SET_BIT (candidates2, i); ++nr_candidates2; @@ -3275,8 +3300,10 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent gsi_insert_before (gsi, pow_stmt, GSI_SAME_STMT); } - /* Append the result of this iteration to the ops vector. */ - add_to_ops_vec (ops, iter_result); + /* Append the result
[PATCH] Fix PR52976
This patch corrects two errors in reassociating expressions with repeated factors. First, undistribution needs to recognize repeated factors. For now, repeated factors will be ineligible for this optimization. In the future, this can be improved. Second, when a __builtin_powi call is introduced, its target SSA name must be given a rank higher than other operands in the operand list. Otherwise, uses of the call result may be introduced prior to the call. Bootstrapped and regression tested on powerpc64-linux. Confirmed that cpu2000 and cpu2006 SPEC tests build cleanly. OK for trunk? Thanks, Bill 2012-04-14 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/52976 * tree-ssa-reassoc.c (add_to_ops_vec_max_rank): New function. (undistribute_ops_list): Ops with repeat counts aren't eligible for undistribution. (attempt_builtin_powi): Call add_to_ops_vec_max_rank. Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 186393) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -544,6 +544,28 @@ add_repeat_to_ops_vec (VEC(operand_entry_t, heap) reassociate_stats.pows_encountered++; } +/* Add an operand entry to *OPS for the tree operand OP, giving the + new entry a larger rank than any other operand already in *OPS. */ + +static void +add_to_ops_vec_max_rank (VEC(operand_entry_t, heap) **ops, tree op) +{ + operand_entry_t oe = (operand_entry_t) pool_alloc (operand_entry_pool); + operand_entry_t oe1; + unsigned i; + unsigned max_rank = 0; + + FOR_EACH_VEC_ELT (operand_entry_t, *ops, i, oe1) +if (oe1-rank max_rank) + max_rank = oe1-rank; + + oe-op = op; + oe-rank = max_rank + 1; + oe-id = next_operand_entry_id++; + oe-count = 1; + VEC_safe_push (operand_entry_t, heap, *ops, oe); +} + /* Return true if STMT is reassociable operation containing a binary operation with tree code CODE, and is inside LOOP. */ @@ -1200,6 +1222,7 @@ undistribute_ops_list (enum tree_code opcode, dcode = gimple_assign_rhs_code (oe1def); if ((dcode != MULT_EXPR dcode != RDIV_EXPR) + || oe1-count != 1 || !is_reassociable_op (oe1def, dcode, loop)) continue; @@ -1243,6 +1266,8 @@ undistribute_ops_list (enum tree_code opcode, oecount c; void **slot; size_t idx; + if (oe1-count != 1) + continue; c.oecode = oecode; c.cnt = 1; c.id = next_oecount_id++; @@ -1311,7 +1336,7 @@ undistribute_ops_list (enum tree_code opcode, FOR_EACH_VEC_ELT (operand_entry_t, subops[i], j, oe1) { - if (oe1-op == c-op) + if (oe1-op == c-op oe1-count == 1) { SET_BIT (candidates2, i); ++nr_candidates2; @@ -3275,8 +3300,10 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent gsi_insert_before (gsi, pow_stmt, GSI_SAME_STMT); } - /* Append the result of this iteration to the ops vector. */ - add_to_ops_vec (ops, iter_result); + /* Append the result of this iteration to the ops vector. + Give it a rank higher than all other ranks in the ops vector + so that all uses of it will be forced to come after it. */ + add_to_ops_vec_max_rank (ops, iter_result); /* Decrement the occurrence count of each element in the product by the count found above, and remove this many copies of each
Re: [PATCH] Fix PR18589
On Thu, 2012-04-12 at 09:50 -0700, H.J. Lu wrote: On Thu, Apr 5, 2012 at 6:49 AM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: On Thu, 2012-04-05 at 11:23 +0200, Richard Guenther wrote: On Wed, Apr 4, 2012 at 9:15 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Unfortunately this seems to be necessary if I name the two passes reassoc1 and reassoc2. If I try to name both of them reassoc I get failures in other tests like gfortran.dg/reassoc_4, where -fdump-tree-reassoc1 doesn't work. Unless I'm missing something obvious, I think I need to keep that change. Hm, naming them reassoc1 and reassoc2 is a hack. Naming both reassoc will not trigger re-naming them to reassoc1 and reassoc2 I think. How ugly. Especially that -fdump-tree-reassoc will no longer work. Maybe instead of using two pass structs resort to using the existing hack with using first_pass_instance and TODO_mark_first_instance. OK, that seems to be the best among evils. Using the first_pass_instance hack, the patch is transformed as below. Regstrapped on powerpc64-linux, no additional failures. OK for trunk? Thanks, Bill gcc: 2012-04-05 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * tree-ssa-reassoc.c (reassociate_stats): Add two fields. (operand_entry): Add count field. (add_repeat_to_ops_vec): New function. (completely_remove_stmt): Likewise. (remove_def_if_absorbed_call): Likewise. (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls. (acceptable_pow_call): New function. (linearize_expr_tree): Look for builtin pow/powi calls and add operand entries with repeat counts when found. (repeat_factor_d): New struct and associated typedefs. (repeat_factor_vec): New static vector variable. (compare_repeat_factors): New function. (get_reassoc_pow_ssa_name): Likewise. (attempt_builtin_powi): Likewise. (reassociate_bb): Call attempt_builtin_powi. (fini_reassoc): Two new calls to statistics_counter_event. It breaks bootstrap on Linux/ia32: ../../src-trunk/gcc/tree-ssa-reassoc.c: In function 'void attempt_builtin_powi(gimple, VEC_operand_entry_t_heap**, tree_node**)': ../../src-trunk/gcc/tree-ssa-reassoc.c:3189:41: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'long long int' [-Werror=format] fprintf (dump_file, )^%ld\n, power); ^ ../../src-trunk/gcc/tree-ssa-reassoc.c:3222:44: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'long long int' [-Werror=format] fprintf (dump_file, )^%ld\n, power); ^ cc1plus: all warnings being treated as errors H.J. Whoops. Looks like I need to use HOST_WIDE_INT_PRINT_DEC instead of %ld in those spots. I'll get a fix prepared.
Re: [PATCH] Fix PR18589
On Thu, 2012-04-12 at 09:50 -0700, H.J. Lu wrote: On Thu, Apr 5, 2012 at 6:49 AM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: On Thu, 2012-04-05 at 11:23 +0200, Richard Guenther wrote: On Wed, Apr 4, 2012 at 9:15 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Unfortunately this seems to be necessary if I name the two passes reassoc1 and reassoc2. If I try to name both of them reassoc I get failures in other tests like gfortran.dg/reassoc_4, where -fdump-tree-reassoc1 doesn't work. Unless I'm missing something obvious, I think I need to keep that change. Hm, naming them reassoc1 and reassoc2 is a hack. Naming both reassoc will not trigger re-naming them to reassoc1 and reassoc2 I think. How ugly. Especially that -fdump-tree-reassoc will no longer work. Maybe instead of using two pass structs resort to using the existing hack with using first_pass_instance and TODO_mark_first_instance. OK, that seems to be the best among evils. Using the first_pass_instance hack, the patch is transformed as below. Regstrapped on powerpc64-linux, no additional failures. OK for trunk? Thanks, Bill gcc: 2012-04-05 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * tree-ssa-reassoc.c (reassociate_stats): Add two fields. (operand_entry): Add count field. (add_repeat_to_ops_vec): New function. (completely_remove_stmt): Likewise. (remove_def_if_absorbed_call): Likewise. (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls. (acceptable_pow_call): New function. (linearize_expr_tree): Look for builtin pow/powi calls and add operand entries with repeat counts when found. (repeat_factor_d): New struct and associated typedefs. (repeat_factor_vec): New static vector variable. (compare_repeat_factors): New function. (get_reassoc_pow_ssa_name): Likewise. (attempt_builtin_powi): Likewise. (reassociate_bb): Call attempt_builtin_powi. (fini_reassoc): Two new calls to statistics_counter_event. It breaks bootstrap on Linux/ia32: ../../src-trunk/gcc/tree-ssa-reassoc.c: In function 'void attempt_builtin_powi(gimple, VEC_operand_entry_t_heap**, tree_node**)': ../../src-trunk/gcc/tree-ssa-reassoc.c:3189:41: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'long long int' [-Werror=format] fprintf (dump_file, )^%ld\n, power); ^ ../../src-trunk/gcc/tree-ssa-reassoc.c:3222:44: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'long long int' [-Werror=format] fprintf (dump_file, )^%ld\n, power); ^ cc1plus: all warnings being treated as errors H.J. Thanks, H.J. Sorry for the problem! Fixing as follows. I'll plan to commit as obvious shortly. 2012-04-12 Bill Schmidt wschm...@linux.vnet.ibm.com * tree-ssa-reassoc.c (attempt_builtin_powi_stats): Change %ld to HOST_WIDE_INT_PRINT_DEC in format strings. Index: gcc/tree-ssa-reassoc.c === --- gcc/tree-ssa-reassoc.c (revision 186384) +++ gcc/tree-ssa-reassoc.c (working copy) @@ -3186,7 +3186,8 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent if (elt vec_len - 1) fputs ( * , dump_file); } - fprintf (dump_file, )^%ld\n, power); + fprintf (dump_file, )^HOST_WIDE_INT_PRINT_DEC\n, + power); } } } @@ -3219,7 +3220,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent if (elt vec_len - 1) fputs ( * , dump_file); } - fprintf (dump_file, )^%ld\n, power); + fprintf (dump_file, )^HOST_WIDE_INT_PRINT_DEC\n, power); } reassociate_stats.pows_created++;
Re: [PATCH] Fix PR52614
On Thu, 2012-04-05 at 11:30 +0200, Richard Guenther wrote: On Thu, Apr 5, 2012 at 6:22 AM, Mike Stump mikest...@comcast.net wrote: On Apr 4, 2012, at 7:56 PM, William J. Schmidt wrote: There seems to be tacit agreement that the vector tests should use -fno-common on all targets to avoid the recent spate of failures (see discussion in 52571 and 52603). OK for trunk? Ok. Any other solution I think will be real work and we shouldn't loose the testing between now and then by not having the test cases working. Ian, you are the source of all of these problems. While I did not notice any degradations in SPEC (on x86) with handling commons correctly now, the fact that our testsuite needs -fno-common to make things vectorizable shows that users might be impacted negatively by this, which is only a real problem in corner cases. Why can the link editor not promote the definitions alignment when merging with a common with bigger alignment? Richard. Follow-up question: Should -ftree-vectorize imply -fno-common in the short term? Thanks, Bill
Re: [PATCH] Fix PR18589
On Thu, 2012-04-05 at 11:23 +0200, Richard Guenther wrote: On Wed, Apr 4, 2012 at 9:15 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Unfortunately this seems to be necessary if I name the two passes reassoc1 and reassoc2. If I try to name both of them reassoc I get failures in other tests like gfortran.dg/reassoc_4, where -fdump-tree-reassoc1 doesn't work. Unless I'm missing something obvious, I think I need to keep that change. Hm, naming them reassoc1 and reassoc2 is a hack. Naming both reassoc will not trigger re-naming them to reassoc1 and reassoc2 I think. How ugly. Especially that -fdump-tree-reassoc will no longer work. Maybe instead of using two pass structs resort to using the existing hack with using first_pass_instance and TODO_mark_first_instance. OK, that seems to be the best among evils. Using the first_pass_instance hack, the patch is transformed as below. Regstrapped on powerpc64-linux, no additional failures. OK for trunk? Thanks, Bill gcc: 2012-04-05 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * tree-ssa-reassoc.c (reassociate_stats): Add two fields. (operand_entry): Add count field. (add_repeat_to_ops_vec): New function. (completely_remove_stmt): Likewise. (remove_def_if_absorbed_call): Likewise. (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls. (acceptable_pow_call): New function. (linearize_expr_tree): Look for builtin pow/powi calls and add operand entries with repeat counts when found. (repeat_factor_d): New struct and associated typedefs. (repeat_factor_vec): New static vector variable. (compare_repeat_factors): New function. (get_reassoc_pow_ssa_name): Likewise. (attempt_builtin_powi): Likewise. (reassociate_bb): Call attempt_builtin_powi. (fini_reassoc): Two new calls to statistics_counter_event. gcc/testsuite: 2012-04-05 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * gcc.dg/tree-ssa/pr18589-1.c: New test. * gcc.dg/tree-ssa/pr18589-2.c: Likewise. * gcc.dg/tree-ssa/pr18589-3.c: Likewise. * gcc.dg/tree-ssa/pr18589-4.c: Likewise. * gcc.dg/tree-ssa/pr18589-5.c: Likewise. * gcc.dg/tree-ssa/pr18589-6.c: Likewise. * gcc.dg/tree-ssa/pr18589-7.c: Likewise. * gcc.dg/tree-ssa/pr18589-8.c: Likewise. * gcc.dg/tree-ssa/pr18589-9.c: Likewise. * gcc.dg/tree-ssa/pr18589-10.c: Likewise. Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c (revision 0) @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */ + +double baz (double x, double y, double z, double u) +{ + return x * x * y * y * y * z * z * z * z * u; +} + +/* { dg-final { scan-tree-dump-times \\* 6 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-5.c (revision 0) @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */ + +double baz (double x, double y, double z, double u) +{ + return x * x * x * y * y * y * z * z * z * z * u * u * u * u; +} + +/* { dg-final { scan-tree-dump-times \\* 6 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-6.c (revision 0) @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */ + +double baz (double x, double y) +{ + return __builtin_pow (x, 3.0) * __builtin_pow (y, 4.0); +} + +/* { dg-final { scan-tree-dump-times \\* 4 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/pr18589-7.c (revision 0) @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -ffast-math -fdump-tree-optimized } */ + +float baz (float x, float y) +{ + return x * x * x * x * y * y * y * y; +} + +/* { dg-final { scan-tree-dump-times \\* 3 optimized } } */ +/* { dg-final { cleanup-tree-dump optimized } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-8.c === --- gcc/testsuite/gcc.dg/tree-ssa/pr18589-8.c
Re: [PATCH] Fix PR18589
On Wed, 2012-04-04 at 13:35 +0200, Richard Guenther wrote: On Tue, Apr 3, 2012 at 10:25 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote: On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi, This is a re-post of the patch I posted for comments in January to address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589. The patch modifies reassociation to expose repeated factors from __builtin_pow* calls, optimally reassociate repeated factors, and possibly reconstitute __builtin_powi calls from the results of reassociation. Bootstrapped and passes regression tests for powerpc64-linux-gnu. I expect there may need to be some small changes, but I am targeting this for trunk approval. Thanks very much for the review, Hmm. How much work would it be to extend the reassoc 'IL' to allow a repeat factor per op? I realize what you do is all within what reassoc already does though ideally we would not require any GIMPLE IL changes for building up / optimizing the reassoc IL but only do so when we commit changes. Thanks, Richard. Hi Richard, I've revised my patch along these lines; see the new version below. While testing it I realized I could do a better job of reducing the number of multiplies, so there are some changes to that logic as well, and a couple of additional test cases. Regstrapped successfully on powerpc64-linux. Hope this looks better! Yes indeed. A few observations though. You didn't integrate attempt_builtin_powi with optimize_ops_list - presumably because it's result does not really fit the single-operation assumption? But note that undistribute_ops_list and optimize_range_tests have the same issue. Thus, I'd have prefered if attempt_builtin_powi worked in the same way, remove the parts of the ops list it consumed and stick an operand for its result there instead. That should simplify things (not having that special powi_result) and allow for multiple powi results in a single op list? Multiple powi results are already handled, but yes, what you're suggesting would simplify things by eliminating the need to create explicit multiplies to join them and the cached-multiply results together. Sounds reasonable on the surface; it just hadn't occurred to me to do it this way. I'll have a look. Any other major concerns while I'm reworking this? Thanks, Bill Thanks, Richard.
Re: [PATCH] Fix PR18589
On Wed, 2012-04-04 at 15:08 +0200, Richard Guenther wrote: On Wed, Apr 4, 2012 at 2:35 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: On Wed, 2012-04-04 at 13:35 +0200, Richard Guenther wrote: On Tue, Apr 3, 2012 at 10:25 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote: On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi, This is a re-post of the patch I posted for comments in January to address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589. The patch modifies reassociation to expose repeated factors from __builtin_pow* calls, optimally reassociate repeated factors, and possibly reconstitute __builtin_powi calls from the results of reassociation. Bootstrapped and passes regression tests for powerpc64-linux-gnu. I expect there may need to be some small changes, but I am targeting this for trunk approval. Thanks very much for the review, Hmm. How much work would it be to extend the reassoc 'IL' to allow a repeat factor per op? I realize what you do is all within what reassoc already does though ideally we would not require any GIMPLE IL changes for building up / optimizing the reassoc IL but only do so when we commit changes. Thanks, Richard. Hi Richard, I've revised my patch along these lines; see the new version below. While testing it I realized I could do a better job of reducing the number of multiplies, so there are some changes to that logic as well, and a couple of additional test cases. Regstrapped successfully on powerpc64-linux. Hope this looks better! Yes indeed. A few observations though. You didn't integrate attempt_builtin_powi with optimize_ops_list - presumably because it's result does not really fit the single-operation assumption? But note that undistribute_ops_list and optimize_range_tests have the same issue. Thus, I'd have prefered if attempt_builtin_powi worked in the same way, remove the parts of the ops list it consumed and stick an operand for its result there instead. That should simplify things (not having that special powi_result) and allow for multiple powi results in a single op list? Multiple powi results are already handled, but yes, what you're suggesting would simplify things by eliminating the need to create explicit multiplies to join them and the cached-multiply results together. Sounds reasonable on the surface; it just hadn't occurred to me to do it this way. I'll have a look. Any other major concerns while I'm reworking this? No, the rest looks fine (you should not need to repace -fdump-tree-reassoc-details with -fdump-tree-reassoc1-details -fdump-tree-reassoc2-details in the first testcase). Unfortunately this seems to be necessary if I name the two passes reassoc1 and reassoc2. If I try to name both of them reassoc I get failures in other tests like gfortran.dg/reassoc_4, where -fdump-tree-reassoc1 doesn't work. Unless I'm missing something obvious, I think I need to keep that change. Frankly I was surprised and relieved that there weren't more tests that used the generic -fdump-tree-reassoc. Thanks, Bill Thanks, Richard. Thanks, Bill Thanks, Richard.
Re: [PATCH] Fix PR18589
On Wed, 2012-04-04 at 13:35 +0200, Richard Guenther wrote: On Tue, Apr 3, 2012 at 10:25 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi Richard, I've revised my patch along these lines; see the new version below. While testing it I realized I could do a better job of reducing the number of multiplies, so there are some changes to that logic as well, and a couple of additional test cases. Regstrapped successfully on powerpc64-linux. Hope this looks better! Yes indeed. A few observations though. You didn't integrate attempt_builtin_powi with optimize_ops_list - presumably because it's result does not really fit the single-operation assumption? But note that undistribute_ops_list and optimize_range_tests have the same issue. Thus, I'd have prefered if attempt_builtin_powi worked in the same way, remove the parts of the ops list it consumed and stick an operand for its result there instead. That should simplify things (not having that special powi_result) and allow for multiple powi results in a single op list? An excellent suggestion. I've implemented it below and it is indeed much cleaner this way. Bootstrapped/regression tested with no new failures on powerpc64-linux. Is this incarnation OK for trunk? Thanks, Bill Thanks, Richard. Thanks, Bill gcc: 2012-04-04 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * tree-pass.h: Replace pass_reassoc with pass_early_reassoc and pass_late_reassoc. * passes.c (init_optimization_passes): Change pass_reassoc calls to pass_early_reassoc and pass_late_reassoc. * tree-ssa-reassoc.c (reassociate_stats): Add two fields. (operand_entry): Add count field. (early_reassoc): New static var. (add_repeat_to_ops_vec): New function. (completely_remove_stmt): Likewise. (remove_def_if_absorbed_call): Likewise. (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls. (acceptable_pow_call): New function. (linearize_expr_tree): Look for builtin pow/powi calls and add operand entries with repeat counts when found. (repeat_factor_d): New struct and associated typedefs. (repeat_factor_vec): New static vector variable. (compare_repeat_factors): New function. (get_reassoc_pow_ssa_name): Likewise. (attempt_builtin_powi): Likewise. (reassociate_bb): Call attempt_builtin_powi. (fini_reassoc): Two new calls to statistics_counter_event. (execute_early_reassoc): New function. (execute_late_reassoc): Likewise. (pass_early_reassoc): Rename from pass_reassoc, call execute_early_reassoc. (pass_late_reassoc): New gimple_opt_pass that calls execute_late_reassoc. gcc/testsuite: 2012-04-04 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * gcc.dg/pr46309.c: Change -fdump-tree-reassoc-details to -fdump-tree-reassoc1-details and -fdump-tree-reassoc2-details. * gcc.dg/tree-ssa/pr18589-1.c: New test. * gcc.dg/tree-ssa/pr18589-2.c: Likewise. * gcc.dg/tree-ssa/pr18589-3.c: Likewise. * gcc.dg/tree-ssa/pr18589-4.c: Likewise. * gcc.dg/tree-ssa/pr18589-5.c: Likewise. * gcc.dg/tree-ssa/pr18589-6.c: Likewise. * gcc.dg/tree-ssa/pr18589-7.c: Likewise. * gcc.dg/tree-ssa/pr18589-8.c: Likewise. * gcc.dg/tree-ssa/pr18589-9.c: Likewise. * gcc.dg/tree-ssa/pr18589-10.c: Likewise. Index: gcc/tree-pass.h === --- gcc/tree-pass.h (revision 186108) +++ gcc/tree-pass.h (working copy) @@ -441,7 +441,8 @@ extern struct gimple_opt_pass pass_copy_prop; extern struct gimple_opt_pass pass_vrp; extern struct gimple_opt_pass pass_uncprop; extern struct gimple_opt_pass pass_return_slot; -extern struct gimple_opt_pass pass_reassoc; +extern struct gimple_opt_pass pass_early_reassoc; +extern struct gimple_opt_pass pass_late_reassoc; extern struct gimple_opt_pass pass_rebuild_cgraph_edges; extern struct gimple_opt_pass pass_remove_cgraph_callee_edges; extern struct gimple_opt_pass pass_build_cgraph_edges; Index: gcc/testsuite/gcc.dg/pr46309.c === --- gcc/testsuite/gcc.dg/pr46309.c (revision 186108) +++ gcc/testsuite/gcc.dg/pr46309.c (working copy) @@ -1,6 +1,6 @@ /* PR tree-optimization/46309 */ /* { dg-do compile } */ -/* { dg-options -O2 -fdump-tree-reassoc-details } */ +/* { dg-options -O2 -fdump-tree-reassoc1-details -fdump-tree-reassoc2-details } */ /* The transformation depends on BRANCH_COST being greater than 1 (see the notes in the PR), so try to force that. */ /* { dg-additional-options -mtune=octeon2 { target mips*-*-* } } */ Index: gcc/testsuite/gcc.dg/tree-ssa/pr18589-4.c
[PATCH] Fix PR52614
There seems to be tacit agreement that the vector tests should use -fno-common on all targets to avoid the recent spate of failures (see discussion in 52571 and 52603). This patch (proposed by Dominique D'Humieures) does just that. I agreed to shepherd the patch through. I've verified that it removes the failures for powerpc64-linux. Various others have verified for arm, sparc, and darwin. OK for trunk? Thanks, Bill gcc/testsuite: 2012-04-04 Bill Schmidt wschm...@linux.vnet.ibm.com Dominique D'Humieures domi...@lps.ens.fr PR testsuite/52614 * gcc.dg/vect/vect.exp: Use -fno-common on all targets. * gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp: Likewise. Index: gcc/testsuite/gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp === --- gcc/testsuite/gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp (revision 186108) +++ gcc/testsuite/gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp (working copy) @@ -34,7 +34,7 @@ if ![is-effective-target powerpc_altivec_ok] { set DEFAULT_VECTCFLAGS # These flags are used for all targets. -lappend DEFAULT_VECTCFLAGS -O2 -ftree-vectorize -fvect-cost-model +lappend DEFAULT_VECTCFLAGS -O2 -ftree-vectorize -fvect-cost-model -fno-common # If the target system supports vector instructions, the default action # for a test is 'run', otherwise it's 'compile'. Save current default. Index: gcc/testsuite/gcc.dg/vect/vect.exp === --- gcc/testsuite/gcc.dg/vect/vect.exp (revision 186108) +++ gcc/testsuite/gcc.dg/vect/vect.exp (working copy) @@ -40,7 +40,7 @@ if ![check_vect_support_and_set_flags] { } # These flags are used for all targets. -lappend DEFAULT_VECTCFLAGS -ftree-vectorize -fno-vect-cost-model +lappend DEFAULT_VECTCFLAGS -ftree-vectorize -fno-vect-cost-model -fno-common # Initialize `dg'. dg-init
Re: [PATCH] Fix PR18589
On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote: On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi, This is a re-post of the patch I posted for comments in January to address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589. The patch modifies reassociation to expose repeated factors from __builtin_pow* calls, optimally reassociate repeated factors, and possibly reconstitute __builtin_powi calls from the results of reassociation. Bootstrapped and passes regression tests for powerpc64-linux-gnu. I expect there may need to be some small changes, but I am targeting this for trunk approval. Thanks very much for the review, Hmm. How much work would it be to extend the reassoc 'IL' to allow a repeat factor per op? I realize what you do is all within what reassoc already does though ideally we would not require any GIMPLE IL changes for building up / optimizing the reassoc IL but only do so when we commit changes. Thanks, Richard. Hi Richard, I've revised my patch along these lines; see the new version below. While testing it I realized I could do a better job of reducing the number of multiplies, so there are some changes to that logic as well, and a couple of additional test cases. Regstrapped successfully on powerpc64-linux. Hope this looks better! Thanks, Bill gcc: 2012-04-03 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * tree-pass.h: Replace pass_reassoc with pass_early_reassoc and pass_late_reassoc. * passes.c (init_optimization_passes): Change pass_reassoc calls to pass_early_reassoc and pass_late_reassoc. * tree-ssa-reassoc.c (reassociate_stats): Add two fields. (operand_entry): Add count field. (early_reassoc): New static var. (add_repeat_to_ops_vec): New function. (completely_remove_stmt): Likewise. (remove_def_if_absorbed_call): Likewise. (remove_visited_stmt_chain): Remove feeding builtin pow/powi calls. (acceptable_pow_call): New function. (linearize_expr_tree): Look for builtin pow/powi calls and add operand entries with repeat counts when found. (repeat_factor_d): New struct and associated typedefs. (repeat_factor_vec): New static vector variable. (compare_repeat_factors): New function. (get_reassoc_pow_ssa_name): Likewise. (attempt_builtin_powi): Likewise. (reassociate_bb): Attempt to create __builtin_powi calls, and multiply their results by any leftover reassociated factors; remove builtin pow/powi calls that were absorbed by reassociation. (fini_reassoc): Two new calls to statistics_counter_event. (execute_early_reassoc): New function. (execute_late_reassoc): Likewise. (pass_early_reassoc): Replace pass_reassoc, renamed to reassoc1, call execute_early_reassoc. (pass_late_reassoc): New gimple_opt_pass named reassoc2 that calls execute_late_reassoc. gcc/testsuite: 2012-04-03 Bill Schmidt wschm...@linux.vnet.ibm.com PR tree-optimization/18589 * gcc.dg/pr46309.c: Change -fdump-tree-reassoc-details to -fdump-tree-reassoc[12]-details. * gcc.dg/tree-ssa/pr18589-1.c: New test. * gcc.dg/tree-ssa/pr18589-2.c: Likewise. * gcc.dg/tree-ssa/pr18589-3.c: Likewise. * gcc.dg/tree-ssa/pr18589-4.c: Likewise. * gcc.dg/tree-ssa/pr18589-5.c: Likewise. * gcc.dg/tree-ssa/pr18589-6.c: Likewise. * gcc.dg/tree-ssa/pr18589-7.c: Likewise. * gcc.dg/tree-ssa/pr18589-8.c: Likewise. * gcc.dg/tree-ssa/pr18589-9.c: Likewise. * gcc.dg/tree-ssa/pr18589-10.c: Likewise. Index: gcc/tree-pass.h === --- gcc/tree-pass.h (revision 186108) +++ gcc/tree-pass.h (working copy) @@ -441,7 +441,8 @@ extern struct gimple_opt_pass pass_copy_prop; extern struct gimple_opt_pass pass_vrp; extern struct gimple_opt_pass pass_uncprop; extern struct gimple_opt_pass pass_return_slot; -extern struct gimple_opt_pass pass_reassoc; +extern struct gimple_opt_pass pass_early_reassoc; +extern struct gimple_opt_pass pass_late_reassoc; extern struct gimple_opt_pass pass_rebuild_cgraph_edges; extern struct gimple_opt_pass pass_remove_cgraph_callee_edges; extern struct gimple_opt_pass pass_build_cgraph_edges; Index: gcc/testsuite/gcc.dg/pr46309.c === --- gcc/testsuite/gcc.dg/pr46309.c (revision 186108) +++ gcc/testsuite/gcc.dg/pr46309.c (working copy) @@ -1,6 +1,6 @@ /* PR tree-optimization/46309 */ /* { dg-do compile } */ -/* { dg-options -O2 -fdump-tree-reassoc-details } */ +/* { dg-options -O2 -fdump-tree-reassoc1-details -fdump-tree-reassoc2-details } */ /* The transformation depends on BRANCH_COST being greater than 1 (see
Re: [PATCH] Fix PR18589
On Wed, 2012-03-28 at 15:57 +0200, Richard Guenther wrote: On Tue, Mar 6, 2012 at 9:49 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Hi, This is a re-post of the patch I posted for comments in January to address http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18589. The patch modifies reassociation to expose repeated factors from __builtin_pow* calls, optimally reassociate repeated factors, and possibly reconstitute __builtin_powi calls from the results of reassociation. Bootstrapped and passes regression tests for powerpc64-linux-gnu. I expect there may need to be some small changes, but I am targeting this for trunk approval. Thanks very much for the review, Hmm. How much work would it be to extend the reassoc 'IL' to allow a repeat factor per op? I realize what you do is all within what reassoc already does though ideally we would not require any GIMPLE IL changes for building up / optimizing the reassoc IL but only do so when we commit changes. Ah, I take your point. I will look into it. We still need the additional data structures to allow sorting by factor repeat counts, but perhaps expanding the builtins can be avoided until it's proven necessary. The patch as submitted may be slightly easier to implement and understand, but I agree it would be better to avoid changing GIMPLE unnecessarily if possible. I'll get back to you shortly. Thanks, Bill Thanks, Richard.
Re: [PATCH] Straight line strength reduction, part 1
On Wed, 2012-03-21 at 10:33 +0100, Richard Guenther wrote: On Mon, Mar 19, 2012 at 2:19 AM, Andrew Pinski pins...@gmail.com wrote: On Sun, Mar 18, 2012 at 6:12 PM, William J. Schmidt wschm...@linux.vnet.ibm.com wrote: Greetings, Now that we're into stage 1 again, I'd like to submit the first round of changes for dominator-based strength reduction, which will address issues from PR22586, PR35308, PR46556, and perhaps others. I'm attaching two patches: the smaller (slsr-part1) is the patch I'm submitting for approval today, while the larger (slsr-fyi) is for reference only, but may be useful if questions arise about how the small patch fits into the intended whole. This patch contains the logic for identifying strength reduction candidates, and makes replacements only for those candidates where the stride is a fixed constant. Replacement for candidates with fixed but unknown strides are not implemented herein, but that logic can be viewed in the larger patch. This patch does not address strength reduction of data reference expressions, or candidates with conditional increments; those issues will be dealt with in future patches. The cost model is built on the one used by tree-ssa-ivopts.c, and I've added some new instruction costs to that model in place. It might eventually be good to divorce that modeling code from IVOPTS, but that's an orthogonal patch and somewhat messy. I think this is the wrong way to do straight line strength reduction considering we have a nice value numbering system which should be easy to extended to support it. Well, it is easy to handle very specific easy cases like a = i * 2; b = i * 3; c = i * 4; to transform it to a = i * 2; b = a + i; c = b + i; but already a = i * 2; b = i * 4; c = i * 6; would need extra special code. The easy case could be handled in eliminate () by, when seeing A * CST, looking up A * (CST - 1) and if that succeeds, transform it to VAL + A. Cost issues are increasing the lifetime of VAL. I've done this simple case at some point, but it failed to handle the common associated cases, when we transform (a + 1) * 2, (a + 1) * 3, etc. to a * 2 + 2, a * 3 + 3, etc. I think it is the re-association in case of a strength-reduction opportunity that makes the separate pass better? How would you suggest handling this case in the VN framework? Detect the a * 3 + 3 pattern and then do two lookups, one for a * 2 and one for val + 2? But then we still don't have a value for a + 1 to re-use ... And it becomes even more difficult with more complex scenarios. Consider: a = x + (3 * s); b = x + (5 * s); c = x + (7 * s); The framework I've developed recognizes that this group of instructions is related, and that it is profitable to replace them as follows: a = x + (3 * s); t = 2 * s; b = a + t; c = b + t; The introduced multiply by 2 (one shift) is far cheaper than the two multiplies that it replaces. However, suppose you have instead: a = x + (2 * s); b = x + (8 * s); Now it isn't profitable to replace this by: a = x + (2 * s); t = 6 * s; b = a + t; since a multiply by 6 (2 shifts, one add) is more costly than a multiply by 8 (one shift). To make these decisions correctly requires analyzing all the related statements together, which value numbering as it stands is not equipped to do. Logic to handle these cases is included in my larger fyi patch. As another example, consider conditionally-executed increments: a = i * 5; if (...) i = i + 1; b = i * 5; This can be correctly and profitably strength-reduced as: a = i * 5; t = a; if (...) { i = i + 1; t = t + 5; } b = t; (This is an approximation to the actual phi representation, which I've omitted for clarity.) Again, this kind of analysis is not something that fits naturally into value numbering. I don't yet have this in the fyi patch, but have it largely working in a private version. My conclusion is that if strength reduction is done in value numbering, it must either be a very limited form of strength reduction, or the kind of logic I've developed that considers chains of related candidates together must be glued onto value numbering. I think the latter would be a mistake, as it would introduce much unnecessary complexity to what is currently a very clean approach to PRE; the strength reduction would become an ugly wart that people would complain about. I think it's far cleaner to keep the two issues separate. Bill, experimenting with pattern detection in eliminate () would be a possibility. For the reasons expressed above, I don't think that would get very far or make anyone very happy... I appreciate Andrew's view that value numbering is a logical place to do strength reduction, but after considering the problem over the last few months I have to disagree. If you don't mind, at this point I would prefer to have my current patch considered on its merits. Thanks, Bill