[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-12-08 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #15 from Pat Haugen  ---
Just confirming that the changes have eliminated the bwaves degradation on
PowerPC that started with r249919.

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-12-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #14 from Richard Biener  ---
Author: rguenth
Date: Fri Dec  8 08:22:08 2017
New Revision: 255499

URL: https://gcc.gnu.org/viewcvs?rev=255499=gcc=rev
Log:
2017-12-08  Richard Biener  

PR tree-optimization/81303
* gfortran.dg/pr81303.f: New testcase.
* gfortran.dg/vect/pr81303.f: Likewise.

Added:
trunk/gcc/testsuite/gfortran.dg/pr81303.f
trunk/gcc/testsuite/gfortran.dg/vect/pr81303.f
Modified:
trunk/gcc/testsuite/ChangeLog

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-12-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #12 from Richard Biener  ---
Author: rguenth
Date: Fri Dec  8 08:06:31 2017
New Revision: 255497

URL: https://gcc.gnu.org/viewcvs?rev=255497=gcc=rev
Log:
2017-12-08  Richard Biener  

PR tree-optimization/81303
* tree-vect-stmts.c (vect_is_simple_cond): For invariant
conditions try to create a comparison vector type matching
the data vector type.
(vectorizable_condition): Adjust.
* tree-vect-patterns.c (vect_recog_mask_conversion_pattern):
Leave invariant conditions alone in case we can vectorize those.

* gcc.target/i386/vectorize9.c: New testcase.
* gcc.target/i386/vectorize10.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.target/i386/vectorize10.c
trunk/gcc/testsuite/gcc.target/i386/vectorize9.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-vect-patterns.c
trunk/gcc/tree-vect-stmts.c

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-12-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Richard Biener  ---
Fixed.

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-12-07 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #11 from amker at gcc dot gnu.org ---
Author: amker
Date: Thu Dec  7 18:03:53 2017
New Revision: 255472

URL: https://gcc.gnu.org/viewcvs?rev=255472=gcc=rev
Log:
PR tree-optimization/81303
* Makefile.in (gimple-loop-interchange.o): New object file.
* common.opt (floop-interchange): Reuse the option from graphite.
* doc/invoke.texi (-floop-interchange): Ditto.  New document for
-floop-interchange and mention it for -O3.
* opts.c (default_options_table): Enable -floop-interchange at -O3.
* gimple-loop-interchange.cc: New file.
* params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
(PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
* passes.def (pass_linterchange): New pass.
* timevar.def (TV_LINTERCHANGE): New time var.
* tree-pass.h (make_pass_linterchange): New declaration.
* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
interchange.  Record IV before/after increment in new parameters.
* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
* tree-vect-loop.c (vect_is_simple_reduction): Factor out reduction
path check into...
(check_reduction_path): ...New function here.
* tree-vectorizer.h (check_reduction_path): New declaration.

gcc/testsuite
* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
* gcc.dg/tree-ssa/loop-interchange-1b.c: New test.
* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
* gcc.dg/tree-ssa/loop-interchange-10.c: New test.
* gcc.dg/tree-ssa/loop-interchange-11.c: New test.
* gcc.dg/tree-ssa/loop-interchange-12.c: New test.
* gcc.dg/tree-ssa/loop-interchange-13.c: New test.

Added:
trunk/gcc/gimple-loop-interchange.cc
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-11.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-12.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-13.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1b.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/Makefile.in
trunk/gcc/common.opt
trunk/gcc/doc/invoke.texi
trunk/gcc/opts.c
trunk/gcc/params.def
trunk/gcc/passes.def
trunk/gcc/testsuite/ChangeLog
trunk/gcc/timevar.def
trunk/gcc/tree-pass.h
trunk/gcc/tree-ssa-loop-ivcanon.c
trunk/gcc/tree-ssa-loop-ivopts.h
trunk/gcc/tree-vect-loop.c
trunk/gcc/tree-vectorizer.h

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-08-16 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

amker at gcc dot gnu.org changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #10 from amker at gcc dot gnu.org ---
(In reply to Wilco from comment #8)
> (In reply to Richard Biener from comment #7)
> 
> Unfortunately these commits have had no effect on AArch64...

Because we (as well as powerpc?) don't peel for alignment now, thus the change
in peeling cost has no impact on AArch64.  I still believe interchange (on the
basis of distribution) is the correct direction to fix this regression (and
bring further improvement).  Of course, vectorization should be avoided even
after interchange on x86.

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #9 from Richard Biener  ---
Author: rguenth
Date: Tue Jul 25 10:59:15 2017
New Revision: 250503

URL: https://gcc.gnu.org/viewcvs?rev=250503=gcc=rev
Log:
2017-07-25  Richard Biener  

PR tree-optimization/81303
* tree-vect-loop-manip.c (vect_loop_versioning): Build
profitability check against LOOP_VINFO_NITERSM1.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-loop-manip.c

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-24 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #8 from Wilco  ---
(In reply to Richard Biener from comment #7)

Unfortunately these commits have had no effect on AArch64...

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #7 from Richard Biener  ---
Author: rguenth
Date: Fri Jul 21 11:32:39 2017
New Revision: 250424

URL: https://gcc.gnu.org/viewcvs?rev=250424=gcc=rev
Log:
2017-07-21  Richard Biener  

PR tree-optimization/81303
* tree-vect-data-refs.c (vect_get_peeling_costs_all_drs): Pass
in datarefs vector.  Allow NULL dr0 for no peeling cost estimate.
(vect_peeling_hash_get_lowest_cost): Adjust.
(vect_enhance_data_refs_alignment): Likewise.  Use
vect_get_peeling_costs_all_drs to compute the penalty for no
peeling to match up costs.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-data-refs.c

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #6 from Richard Biener  ---
So if in addition to this patch we do

Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c(revision 250386)
+++ gcc/tree-vect-loop.c(working copy)
@@ -7376,7 +7377,7 @@ vect_transform_loop (loop_vec_info loop_
   /* Version the loop first, if required, so the profitability check
  comes first.  */

-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (check_profitability || LOOP_REQUIRES_VERSIONING (loop_vinfo))
 {
   vect_loop_versioning (loop_vinfo, th, check_profitability);
   check_profitability = false;

thus always do the profitability check by versioning which means not sharing
the epilogue loop with the scalar execution (plus reliably executing the
cost model check first) we get down to 212s from 250s.  This might be solely
because we do not completely peel the versioned copy as we are not able to
analyze its number of iterations (despite the dominating > 7 check).

  _33 = (unsigned int) _1;
  if (_33 > 7)
goto ; [80.01%] [count: INV]
  else
goto ; [19.99%] [count: INV]

   [3.00%] [count: INV]:

   [16.99%] [count: INV] loop 6 header:
  # m_23 = PHI <1(34), m_449(36)>
...
  m_449 = m_23 + 1;
  if (_1 < m_449)
goto ; [17.65%] [count: INV]
  else
goto ; [82.35%] [count: INV]

   [13.99%] [count: INV] loop 6 latch:
  goto ; [100.00%] [count: INV]

we're probably confused by the casting here and infering a range from just
the above for _1 would result in [INT_MIN, 7] only (good enough I guess).

We peel the vector epilogue because:

Loop 8 iterates at most 5 times.
Loop 8 likely iterates at most 5 times.
Estimating sizes for loop 8
 BB: 29, after_exit: 0
  size:   0 _372 = (integer(kind=8)) m_375;
  size:   1 _371 = _372 * stride.88_115;
...
  size:   1 _332 = _349 + _333;
  size:   1 m_331 = m_375 + 1;
  size:   2 if (_1 < m_331)
   Exit condition will be eliminated in last copy.
 BB: 30, after_exit: 1
size: 41-0, last_iteration: 41-2
  Loop size: 41
  Estimated size after unrolling: 162

that is we determine that no stmts will be optimized away due to propagating
constants but then apply our usual 2/3 optimistic heuristic leading to
that estimate (max-completely-peeled-insns is 200).

For small trip count loops the advantage of peeling (irrespective of size)
is better branch predictor hitrate.

There's quite a mistake in cost modeling peeling for alignment but still
with fixing that we end up with a nopeel inside-cost of 14 and a best peel
inside-cost of 13 (we manage to align one load).  Now it doesn't take into
account outside cost at all which is 59 vs 115, but it's hard to combine
both in a sensible way ...

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #5 from Richard Biener  ---
Author: rguenth
Date: Fri Jul 21 07:13:57 2017
New Revision: 250416

URL: https://gcc.gnu.org/viewcvs?rev=250416=gcc=rev
Log:
2016-07-21  Richard Biener  

PR tree-optimization/81303
* tree-vect-loop.c (vect_estimate_min_profitable_iters): Take
into account prologue and epilogue iterations when raising
min_profitable_iters to sth at least covering one vector iteration.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-loop.c

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-20 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #4 from Richard Biener  ---
So one useful change is the following which makes the runtime profitability
threshold 6 and thus the vector loop is never entered.  Even though that should
be quite a predictable conditional jump it turns out we mess up BB placement
so that the result isn't a big improvement (254s -> 250s).  This is probably
also due to the fact that we end up peeling the inner loop completely
(we know it iterates <= profitability threshold times).  Plus we do not
version the loop but share the non-profitable part with the peeled copy
making RAs job harder :/

Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c(revision 250384)
+++ gcc/tree-vect-loop.c(working copy)
@@ -3702,8 +3702,9 @@ vect_estimate_min_profitable_iters (loop
   "  Calculated minimum iters for profitability: %d\n",
   min_profitable_iters);

-  min_profitable_iters =
-   min_profitable_iters < vf ? vf : min_profitable_iters;
+  /* We want the vectorized loop to execute at least once.  */
+  if (min_profitable_iters < (vf + peel_iters_prologue + peel_iters_epilogue))
+min_profitable_iters = vf + peel_iters_prologue + peel_iters_epilogue;

   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location,

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-12 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-07-12
 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #3 from Wilco  ---
Confirmed, on AArch64 bwaves is ~20% slower in SPEC2006 and ~30% slower in
SPEC2017. There are twice as many spills (outside the inner loop) and the
vectors are created in an inefficient way:

ldr d4, [x5,x27]
ld1r{v6.2d}, [x5]
mov v6.d[1], v4.d[0]
add x5, x5, x26
fmlav1.2d, v20.2d, v6.2d

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #2 from Richard Biener  ---
Without peeling for alignment the numbers improve but we still regress from
176s to 205s.  The innermost (unrolled) loop is:

.L11:
vmovsd  (%rdi,%r15,2), %xmm2
vmovsd  (%rsi,%r15,2), %xmm1
movq-56(%rbp), %rbx
vmovhpd (%rdi,%r14), %xmm2, %xmm0
vmovsd  (%rdi), %xmm2
vmovhpd (%rsi,%r14), %xmm1, %xmm6
vmovsd  (%rsi), %xmm1
vmovhpd (%rdi,%r15), %xmm2, %xmm2
vmovhpd (%rsi,%r15), %xmm1, %xmm1
addq%r11, %rdi
addq%r11, %rsi
vinsertf128 $0x1, %xmm0, %ymm2, %ymm2
vmovsd  (%rcx,%r15,2), %xmm0
vinsertf128 $0x1, %xmm6, %ymm1, %ymm1
vmulpd  (%rbx,%rax), %ymm2, %ymm3
movq-72(%rbp), %rbx
vmovapd %ymm1, %ymm2
vmovhpd (%rcx,%r14), %xmm0, %xmm6
vmovsd  (%rcx), %xmm0
vfmadd132pd (%rbx,%rax), %ymm3, %ymm2
movq-88(%rbp), %rbx
vmovhpd (%rcx,%r15), %xmm0, %xmm0
addq%r11, %rcx
vinsertf128 $0x1, %xmm6, %ymm0, %ymm0
vmulpd  (%rbx,%rax), %ymm0, %ymm3
vmovsd  (%rdx,%r15,2), %xmm0
movq-64(%rbp), %rbx
vmovhpd (%rdx,%r14), %xmm0, %xmm6
vmovsd  (%rdx), %xmm0
vmovhpd (%rdx,%r15), %xmm0, %xmm0
addq%r11, %rdx
vinsertf128 $0x1, %xmm6, %ymm0, %ymm0
vfmadd132pd (%rbx,%rax), %ymm3, %ymm0
vaddpd  %ymm0, %ymm2, %ymm1
vmovsd  (%r9,%r15,2), %xmm0
vmovhpd (%r9,%r14), %xmm0, %xmm3
vmovsd  (%r9), %xmm0
vmovhpd (%r9,%r15), %xmm0, %xmm0
addq%r11, %r9
vinsertf128 $0x1, %xmm3, %ymm0, %ymm0
vmulpd  (%r12,%rax), %ymm0, %ymm2
vmovsd  (%r8,%r15,2), %xmm0
vmovhpd (%r8,%r14), %xmm0, %xmm3
vmovsd  (%r8), %xmm0
vmovhpd (%r8,%r15), %xmm0, %xmm0
movq-80(%rbp), %rbx
addq%r11, %r8
vinsertf128 $0x1, %xmm3, %ymm0, %ymm0
vfmadd132pd (%rbx,%rax), %ymm2, %ymm0
vaddpd  %ymm0, %ymm1, %ymm0
vmovsd  (%r10,%r15,2), %xmm1
vmovhpd (%r10,%r14), %xmm1, %xmm2
vmovsd  (%r10), %xmm1
vmovhpd (%r10,%r15), %xmm1, %xmm1
addq%r11, %r10
vinsertf128 $0x1, %xmm2, %ymm1, %ymm1
vfmadd231pd 0(%r13,%rax), %ymm1, %ymm4
addq$32, %rax
vaddpd  %ymm4, %ymm0, %ymm4
cmpq-96(%rbp), %rax
jne .L11

vs

.L10:
vmovsd  (%rax,%rbx,8), %xmm0
vmulsd  (%r15,%rdx), %xmm0, %xmm0
vmovsd  (%r8,%rdx), %xmm1
vfmadd132sd (%rax,%r11,8), %xmm0, %xmm1
vmovsd  (%rax,%rsi,8), %xmm0
vmulsd  (%r12,%rdx), %xmm0, %xmm0
vmovsd  0(%rbp,%rdx), %xmm4
vfmadd231sd (%rax), %xmm4, %xmm0
vmovsd  (%r14,%rdx), %xmm5
vmovsd  (%rdi,%rdx), %xmm6
vfmadd231sd (%rax,%r9,8), %xmm6, %xmm2
vaddsd  %xmm0, %xmm1, %xmm0
vmovsd  (%rax,%r10,8), %xmm1
vmulsd  0(%r13,%rdx), %xmm1, %xmm1
vfmadd231sd (%rax,%rcx,8), %xmm5, %xmm1
addq-112(%rsp), %rdx
addq$8, %rax
vaddsd  %xmm2, %xmm1, %xmm2
vaddsd  %xmm2, %xmm0, %xmm2
cmpq-120(%rsp), %rax
jne .L10

looks like register pressure is high and IVO doesn't do the best job either.
The vectorized loop might also run into CPU arch limits with respect to
loop cache (it's 310 bytes long).

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #1 from Richard Biener  ---
Looks like we peel for alignment which, for the loop is quite pointless at it
only runs 5 times, so for AVX256 we're likely running into peel for alignment,
no vector iteration, epilogue.

Need to tame down that damn alignment peeling more ...

It peels 'x' btw.

block_solver.f:178:0: note: Cost model analysis:
  Vector inside of loop cost: 76
  Vector prologue cost: 61
  Vector epilogue cost: 62
  Scalar iteration cost: 28
  Scalar outside cost: 7
  Vector outside cost: 123
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 5
block_solver.f:178:0: note:   Runtime profitability threshold = 4
block_solver.f:178:0: note:   Static estimate profitability threshold = 5

but that doesn't take into account that we eventually spend 3 scalar iterations
in the alignment prologue and thus with niter < 7 we'll eventually never enter
the vector loop.  The static estimate is similarly affected by this.

[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919

2017-07-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |8.0