[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #15 from Pat Haugen --- Just confirming that the changes have eliminated the bwaves degradation on PowerPC that started with r249919.
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #14 from Richard Biener --- Author: rguenth Date: Fri Dec 8 08:22:08 2017 New Revision: 255499 URL: https://gcc.gnu.org/viewcvs?rev=255499=gcc=rev Log: 2017-12-08 Richard BienerPR tree-optimization/81303 * gfortran.dg/pr81303.f: New testcase. * gfortran.dg/vect/pr81303.f: Likewise. Added: trunk/gcc/testsuite/gfortran.dg/pr81303.f trunk/gcc/testsuite/gfortran.dg/vect/pr81303.f Modified: trunk/gcc/testsuite/ChangeLog
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #12 from Richard Biener --- Author: rguenth Date: Fri Dec 8 08:06:31 2017 New Revision: 255497 URL: https://gcc.gnu.org/viewcvs?rev=255497=gcc=rev Log: 2017-12-08 Richard BienerPR tree-optimization/81303 * tree-vect-stmts.c (vect_is_simple_cond): For invariant conditions try to create a comparison vector type matching the data vector type. (vectorizable_condition): Adjust. * tree-vect-patterns.c (vect_recog_mask_conversion_pattern): Leave invariant conditions alone in case we can vectorize those. * gcc.target/i386/vectorize9.c: New testcase. * gcc.target/i386/vectorize10.c: New testcase. Added: trunk/gcc/testsuite/gcc.target/i386/vectorize10.c trunk/gcc/testsuite/gcc.target/i386/vectorize9.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-patterns.c trunk/gcc/tree-vect-stmts.c
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #13 from Richard Biener --- Fixed.
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #11 from amker at gcc dot gnu.org --- Author: amker Date: Thu Dec 7 18:03:53 2017 New Revision: 255472 URL: https://gcc.gnu.org/viewcvs?rev=255472=gcc=rev Log: PR tree-optimization/81303 * Makefile.in (gimple-loop-interchange.o): New object file. * common.opt (floop-interchange): Reuse the option from graphite. * doc/invoke.texi (-floop-interchange): Ditto. New document for -floop-interchange and mention it for -O3. * opts.c (default_options_table): Enable -floop-interchange at -O3. * gimple-loop-interchange.cc: New file. * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. * passes.def (pass_linterchange): New pass. * timevar.def (TV_LINTERCHANGE): New time var. * tree-pass.h (make_pass_linterchange): New declaration. * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external interchange. Record IV before/after increment in new parameters. * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. * tree-vect-loop.c (vect_is_simple_reduction): Factor out reduction path check into... (check_reduction_path): ...New function here. * tree-vectorizer.h (check_reduction_path): New declaration. gcc/testsuite * gcc.dg/tree-ssa/loop-interchange-1.c: New test. * gcc.dg/tree-ssa/loop-interchange-1b.c: New test. * gcc.dg/tree-ssa/loop-interchange-2.c: New test. * gcc.dg/tree-ssa/loop-interchange-3.c: New test. * gcc.dg/tree-ssa/loop-interchange-4.c: New test. * gcc.dg/tree-ssa/loop-interchange-5.c: New test. * gcc.dg/tree-ssa/loop-interchange-6.c: New test. * gcc.dg/tree-ssa/loop-interchange-7.c: New test. * gcc.dg/tree-ssa/loop-interchange-8.c: New test. * gcc.dg/tree-ssa/loop-interchange-9.c: New test. * gcc.dg/tree-ssa/loop-interchange-10.c: New test. * gcc.dg/tree-ssa/loop-interchange-11.c: New test. * gcc.dg/tree-ssa/loop-interchange-12.c: New test. * gcc.dg/tree-ssa/loop-interchange-13.c: New test. Added: trunk/gcc/gimple-loop-interchange.cc trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-11.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-12.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-13.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1b.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/common.opt trunk/gcc/doc/invoke.texi trunk/gcc/opts.c trunk/gcc/params.def trunk/gcc/passes.def trunk/gcc/testsuite/ChangeLog trunk/gcc/timevar.def trunk/gcc/tree-pass.h trunk/gcc/tree-ssa-loop-ivcanon.c trunk/gcc/tree-ssa-loop-ivopts.h trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vectorizer.h
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 amker at gcc dot gnu.org changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #10 from amker at gcc dot gnu.org --- (In reply to Wilco from comment #8) > (In reply to Richard Biener from comment #7) > > Unfortunately these commits have had no effect on AArch64... Because we (as well as powerpc?) don't peel for alignment now, thus the change in peeling cost has no impact on AArch64. I still believe interchange (on the basis of distribution) is the correct direction to fix this regression (and bring further improvement). Of course, vectorization should be avoided even after interchange on x86.
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #9 from Richard Biener --- Author: rguenth Date: Tue Jul 25 10:59:15 2017 New Revision: 250503 URL: https://gcc.gnu.org/viewcvs?rev=250503=gcc=rev Log: 2017-07-25 Richard BienerPR tree-optimization/81303 * tree-vect-loop-manip.c (vect_loop_versioning): Build profitability check against LOOP_VINFO_NITERSM1. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-loop-manip.c
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #8 from Wilco --- (In reply to Richard Biener from comment #7) Unfortunately these commits have had no effect on AArch64...
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #7 from Richard Biener --- Author: rguenth Date: Fri Jul 21 11:32:39 2017 New Revision: 250424 URL: https://gcc.gnu.org/viewcvs?rev=250424=gcc=rev Log: 2017-07-21 Richard BienerPR tree-optimization/81303 * tree-vect-data-refs.c (vect_get_peeling_costs_all_drs): Pass in datarefs vector. Allow NULL dr0 for no peeling cost estimate. (vect_peeling_hash_get_lowest_cost): Adjust. (vect_enhance_data_refs_alignment): Likewise. Use vect_get_peeling_costs_all_drs to compute the penalty for no peeling to match up costs. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-data-refs.c
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #6 from Richard Biener --- So if in addition to this patch we do Index: gcc/tree-vect-loop.c === --- gcc/tree-vect-loop.c(revision 250386) +++ gcc/tree-vect-loop.c(working copy) @@ -7376,7 +7377,7 @@ vect_transform_loop (loop_vec_info loop_ /* Version the loop first, if required, so the profitability check comes first. */ - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) + if (check_profitability || LOOP_REQUIRES_VERSIONING (loop_vinfo)) { vect_loop_versioning (loop_vinfo, th, check_profitability); check_profitability = false; thus always do the profitability check by versioning which means not sharing the epilogue loop with the scalar execution (plus reliably executing the cost model check first) we get down to 212s from 250s. This might be solely because we do not completely peel the versioned copy as we are not able to analyze its number of iterations (despite the dominating > 7 check). _33 = (unsigned int) _1; if (_33 > 7) goto ; [80.01%] [count: INV] else goto ; [19.99%] [count: INV] [3.00%] [count: INV]: [16.99%] [count: INV] loop 6 header: # m_23 = PHI <1(34), m_449(36)> ... m_449 = m_23 + 1; if (_1 < m_449) goto ; [17.65%] [count: INV] else goto ; [82.35%] [count: INV] [13.99%] [count: INV] loop 6 latch: goto ; [100.00%] [count: INV] we're probably confused by the casting here and infering a range from just the above for _1 would result in [INT_MIN, 7] only (good enough I guess). We peel the vector epilogue because: Loop 8 iterates at most 5 times. Loop 8 likely iterates at most 5 times. Estimating sizes for loop 8 BB: 29, after_exit: 0 size: 0 _372 = (integer(kind=8)) m_375; size: 1 _371 = _372 * stride.88_115; ... size: 1 _332 = _349 + _333; size: 1 m_331 = m_375 + 1; size: 2 if (_1 < m_331) Exit condition will be eliminated in last copy. BB: 30, after_exit: 1 size: 41-0, last_iteration: 41-2 Loop size: 41 Estimated size after unrolling: 162 that is we determine that no stmts will be optimized away due to propagating constants but then apply our usual 2/3 optimistic heuristic leading to that estimate (max-completely-peeled-insns is 200). For small trip count loops the advantage of peeling (irrespective of size) is better branch predictor hitrate. There's quite a mistake in cost modeling peeling for alignment but still with fixing that we end up with a nopeel inside-cost of 14 and a best peel inside-cost of 13 (we manage to align one load). Now it doesn't take into account outside cost at all which is 59 vs 115, but it's hard to combine both in a sensible way ...
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #5 from Richard Biener --- Author: rguenth Date: Fri Jul 21 07:13:57 2017 New Revision: 250416 URL: https://gcc.gnu.org/viewcvs?rev=250416=gcc=rev Log: 2016-07-21 Richard BienerPR tree-optimization/81303 * tree-vect-loop.c (vect_estimate_min_profitable_iters): Take into account prologue and epilogue iterations when raising min_profitable_iters to sth at least covering one vector iteration. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-loop.c
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #4 from Richard Biener --- So one useful change is the following which makes the runtime profitability threshold 6 and thus the vector loop is never entered. Even though that should be quite a predictable conditional jump it turns out we mess up BB placement so that the result isn't a big improvement (254s -> 250s). This is probably also due to the fact that we end up peeling the inner loop completely (we know it iterates <= profitability threshold times). Plus we do not version the loop but share the non-profitable part with the peeled copy making RAs job harder :/ Index: gcc/tree-vect-loop.c === --- gcc/tree-vect-loop.c(revision 250384) +++ gcc/tree-vect-loop.c(working copy) @@ -3702,8 +3702,9 @@ vect_estimate_min_profitable_iters (loop " Calculated minimum iters for profitability: %d\n", min_profitable_iters); - min_profitable_iters = - min_profitable_iters < vf ? vf : min_profitable_iters; + /* We want the vectorized loop to execute at least once. */ + if (min_profitable_iters < (vf + peel_iters_prologue + peel_iters_epilogue)) +min_profitable_iters = vf + peel_iters_prologue + peel_iters_epilogue; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location,
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-07-12 CC||wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from Wilco --- Confirmed, on AArch64 bwaves is ~20% slower in SPEC2006 and ~30% slower in SPEC2017. There are twice as many spills (outside the inner loop) and the vectors are created in an inefficient way: ldr d4, [x5,x27] ld1r{v6.2d}, [x5] mov v6.d[1], v4.d[0] add x5, x5, x26 fmlav1.2d, v20.2d, v6.2d
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #2 from Richard Biener --- Without peeling for alignment the numbers improve but we still regress from 176s to 205s. The innermost (unrolled) loop is: .L11: vmovsd (%rdi,%r15,2), %xmm2 vmovsd (%rsi,%r15,2), %xmm1 movq-56(%rbp), %rbx vmovhpd (%rdi,%r14), %xmm2, %xmm0 vmovsd (%rdi), %xmm2 vmovhpd (%rsi,%r14), %xmm1, %xmm6 vmovsd (%rsi), %xmm1 vmovhpd (%rdi,%r15), %xmm2, %xmm2 vmovhpd (%rsi,%r15), %xmm1, %xmm1 addq%r11, %rdi addq%r11, %rsi vinsertf128 $0x1, %xmm0, %ymm2, %ymm2 vmovsd (%rcx,%r15,2), %xmm0 vinsertf128 $0x1, %xmm6, %ymm1, %ymm1 vmulpd (%rbx,%rax), %ymm2, %ymm3 movq-72(%rbp), %rbx vmovapd %ymm1, %ymm2 vmovhpd (%rcx,%r14), %xmm0, %xmm6 vmovsd (%rcx), %xmm0 vfmadd132pd (%rbx,%rax), %ymm3, %ymm2 movq-88(%rbp), %rbx vmovhpd (%rcx,%r15), %xmm0, %xmm0 addq%r11, %rcx vinsertf128 $0x1, %xmm6, %ymm0, %ymm0 vmulpd (%rbx,%rax), %ymm0, %ymm3 vmovsd (%rdx,%r15,2), %xmm0 movq-64(%rbp), %rbx vmovhpd (%rdx,%r14), %xmm0, %xmm6 vmovsd (%rdx), %xmm0 vmovhpd (%rdx,%r15), %xmm0, %xmm0 addq%r11, %rdx vinsertf128 $0x1, %xmm6, %ymm0, %ymm0 vfmadd132pd (%rbx,%rax), %ymm3, %ymm0 vaddpd %ymm0, %ymm2, %ymm1 vmovsd (%r9,%r15,2), %xmm0 vmovhpd (%r9,%r14), %xmm0, %xmm3 vmovsd (%r9), %xmm0 vmovhpd (%r9,%r15), %xmm0, %xmm0 addq%r11, %r9 vinsertf128 $0x1, %xmm3, %ymm0, %ymm0 vmulpd (%r12,%rax), %ymm0, %ymm2 vmovsd (%r8,%r15,2), %xmm0 vmovhpd (%r8,%r14), %xmm0, %xmm3 vmovsd (%r8), %xmm0 vmovhpd (%r8,%r15), %xmm0, %xmm0 movq-80(%rbp), %rbx addq%r11, %r8 vinsertf128 $0x1, %xmm3, %ymm0, %ymm0 vfmadd132pd (%rbx,%rax), %ymm2, %ymm0 vaddpd %ymm0, %ymm1, %ymm0 vmovsd (%r10,%r15,2), %xmm1 vmovhpd (%r10,%r14), %xmm1, %xmm2 vmovsd (%r10), %xmm1 vmovhpd (%r10,%r15), %xmm1, %xmm1 addq%r11, %r10 vinsertf128 $0x1, %xmm2, %ymm1, %ymm1 vfmadd231pd 0(%r13,%rax), %ymm1, %ymm4 addq$32, %rax vaddpd %ymm4, %ymm0, %ymm4 cmpq-96(%rbp), %rax jne .L11 vs .L10: vmovsd (%rax,%rbx,8), %xmm0 vmulsd (%r15,%rdx), %xmm0, %xmm0 vmovsd (%r8,%rdx), %xmm1 vfmadd132sd (%rax,%r11,8), %xmm0, %xmm1 vmovsd (%rax,%rsi,8), %xmm0 vmulsd (%r12,%rdx), %xmm0, %xmm0 vmovsd 0(%rbp,%rdx), %xmm4 vfmadd231sd (%rax), %xmm4, %xmm0 vmovsd (%r14,%rdx), %xmm5 vmovsd (%rdi,%rdx), %xmm6 vfmadd231sd (%rax,%r9,8), %xmm6, %xmm2 vaddsd %xmm0, %xmm1, %xmm0 vmovsd (%rax,%r10,8), %xmm1 vmulsd 0(%r13,%rdx), %xmm1, %xmm1 vfmadd231sd (%rax,%rcx,8), %xmm5, %xmm1 addq-112(%rsp), %rdx addq$8, %rax vaddsd %xmm2, %xmm1, %xmm2 vaddsd %xmm2, %xmm0, %xmm2 cmpq-120(%rsp), %rax jne .L10 looks like register pressure is high and IVO doesn't do the best job either. The vectorized loop might also run into CPU arch limits with respect to loop cache (it's 310 bytes long).
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 --- Comment #1 from Richard Biener --- Looks like we peel for alignment which, for the loop is quite pointless at it only runs 5 times, so for AVX256 we're likely running into peel for alignment, no vector iteration, epilogue. Need to tame down that damn alignment peeling more ... It peels 'x' btw. block_solver.f:178:0: note: Cost model analysis: Vector inside of loop cost: 76 Vector prologue cost: 61 Vector epilogue cost: 62 Scalar iteration cost: 28 Scalar outside cost: 7 Vector outside cost: 123 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 5 block_solver.f:178:0: note: Runtime profitability threshold = 4 block_solver.f:178:0: note: Static estimate profitability threshold = 5 but that doesn't take into account that we eventually spend 3 scalar iterations in the alignment prologue and thus with niter < 7 we'll eventually never enter the vector loop. The static estimate is similarly affected by this.
[Bug tree-optimization/81303] [8 Regression] 410.bwaves regression caused by r249919
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303 Richard Biener changed: What|Removed |Added Target Milestone|--- |8.0