https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862
--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> --- First of all, thanks a lot for reproducer! Here are times with vectorizer enabled, disabled and no costmodel Performance counter stats for './a.out-vect 21 1000000' (10 runs): 4588.055614 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.49% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 88 page-faults:u # 0.019 K/sec ( +- 0.44% ) 14,911,755,271 cycles:u # 3.250 GHz ( +- 0.37% ) 52,564,741,152 instructions:u # 3.53 insn per cycle ( +- 0.00% ) 4,073,206,037 branches:u # 887.785 M/sec ( +- 0.00% ) 18,106,857 branch-misses:u # 0.44% of all branches ( +- 0.30% ) 4.589172192 seconds time elapsed ( +- 0.50% ) jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-novect 21 1000000 Performance counter stats for './a.out-novect 21 1000000' (10 runs): 3549.651576 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.65% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 88 page-faults:u # 0.025 K/sec ( +- 0.42% ) 11,563,811,687 cycles:u # 3.258 GHz ( +- 0.61% ) 39,259,740,624 instructions:u # 3.40 insn per cycle ( +- 0.00% ) 3,061,205,511 branches:u # 862.396 M/sec ( +- 0.00% ) 11,774,836 branch-misses:u # 0.38% of all branches ( +- 0.36% ) 3.550955730 seconds time elapsed ( +- 0.65% ) jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-nocost 21 1000000 Performance counter stats for './a.out-nocost 21 1000000' (10 runs): 4621.515923 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.31% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 87 page-faults:u # 0.019 K/sec ( +- 0.35% ) 14,965,340,896 cycles:u # 3.238 GHz ( +- 0.30% ) 52,817,740,929 instructions:u # 3.53 insn per cycle ( +- 0.00% ) 4,326,205,814 branches:u # 936.101 M/sec ( +- 0.00% ) 16,615,805 branch-misses:u # 0.38% of all branches ( +- 0.10% ) 4.622600700 seconds time elapsed ( +- 0.31% ) So vectorization hurts both in time and instruction count. There are two loops to vectorize. _34 = _74 + S.2_106; _35 = _34 * _121; _36 = _35 + _124; _38 = _36 * _37; _39 = (sizetype) _38; _40 = _72 + _39; _41 = MEM[(real(kind=8)[0:] *)A.14_116][S.2_106]; *_40 = _41; S.2_88 = S.2_106 + 1; if (_77 < S.2_88) goto <bb 9>; [15.00%] else goto loopback; [85.00%] Vector inside of loop cost: 76 Vector prologue cost: 24 Vector epilogue cost: 48 Scalar iteration cost: 24 Scalar outside cost: 24 Vector outside cost: 72 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 3 tonto.f90:26:0: note: Runtime profitability threshold = 4 tonto.f90:26:0: note: Static estimate profitability threshold = 11 and _18 = S.2_105 + 1; _19 = _18 * _61; _2 = _19 - _61; _21 = _2 * _3; _22 = (sizetype) _21; _23 = _11 + _22; _24 = *_23; _25 = _70 + S.2_105; _26 = _25 * _117; _27 = _26 + _120; _29 = _27 * _28; _30 = (sizetype) _29; _31 = _68 + _30; _32 = *_31; _33 = _24 * _32; MEM[(real(kind=8)[0:] *)A.14_116][S.2_105] = _33; if (_18 > _77) goto <bb 7>; [15.00%] else loopback; [85.00%] Vector inside of loop cost: 176 Vector prologue cost: 24 Vector epilogue cost: 112 Scalar iteration cost: 48 Scalar outside cost: 24 Vector outside cost: 136 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 7 Static estimate profitability threshold = 18 Both loops iterate about 9 time so the thresholds are close to being never executed. So the slowdown seems to be just colateral damage of adding vectorized loop for something that is not executed enough. This is how we handle first loop: _34 = _74 + S.2_106; irrelevant _35 = _34 * _121; irrelevant _36 = _35 + _124; irrelevant _38 = _36 * _37; irrelevant _39 = (sizetype) _38; irrelevant _40 = _72 + _39; irrelevant _41 = MEM[(real(kind=8)[0:] *)A.14_116][S.2_106]; This is accounted as unaligned vector load inside_cost = 12 *_40 = _41; inside_cost = 64 This is accounted as 4 stores (12*4=48) and 4 vec to scalar (4*4) S.2_88 = S.2_106 + 1; if (_77 < S.2_88) goto <bb 9>; [15.00%] else goto loopback; [85.00%] Scalar code is counted as load+store = 12+12=24 Accounted costs should match latencies multiplied by 4. Now real code with unrolling/scheduling disabled: movq %rdx, %rsi salq $5, %rsi vector load vmovupd (%r8,%rsi), %ymm0 Here the cost model is correct. it is 3 cycles and we multiply by 4. vector store vmovlpd %xmm0, (%rax) vmovhpd %xmm0, (%rax,%rcx) vextractf128 $0x1, %ymm0, %xmm0 vmovlpd %xmm0, (%rax,%rcx,2) vmovhpd %xmm0, (%rax,%rdi) Here the cost model is somehwat pesimisted because it accounts 4*1 cycles for unpacking vectors to scalars and 3*1 cycles for stores Unpacking is hidden in vmov* cost model does not know about and vextractf128 which has latency 3. incq %rdx addq %r9, %rax cmpq %rdx, %r13 jne .L10 Second loop: _18 = S.2_105 + 1; irrelevant _19 = _18 * _61; irrelevant _2 = _19 - _61; irrelevant _21 = _2 * _3; irrelevant _22 = (sizetype) _21; irrelevant _23 = _11 + _22; irrelevant _24 = *_23; Strided load 4*12 (4 loads) + 4*4 (4 vec_construct) + 48 (vec construct) = 72 _25 = _70 + S.2_105; irrelevant _26 = _25 * _117; irrelevant _27 = _26 + _120; irrelevant _29 = _27 * _28; irrelevant _30 = (sizetype) _29; irrelevant _31 = _68 + _30; irrelevant _32 = *_31; Strided load 72 again _33 = _24 * _32; vector multiply 4*5 (5 cycles mult) MEM[(real(kind=8)[0:] *)A.14_116][S.2_105] = _33; Vector store 12 if (_18 > _77) goto <bb 7>; [15.00%] else loopback; [85.00%] Generated code is: vmovsd (%rsi), %xmm1 vmovhpd (%rsi,%r15), %xmm1, %xmm2 vmovsd (%rcx), %xmm0 vmovhpd (%rcx,%r15), %xmm0, %xmm0 vinsertf128 $0x1, %xmm2, %ymm0, %ymm1 strided load vmovsd (%rdi), %xmm0 vmovhpd (%rdi,%r9), %xmm0, %xmm2 vmovsd (%rax), %xmm0 vmovhpd (%rax,%r9), %xmm0, %xmm0 vinsertf128 $0x1, %xmm2, %ymm0, %ymm0 strided load vmulpd %ymm0, %ymm1, %ymm0 vector mult movq %rdx, %r11 salq $5, %r11 vmovupd %ymm0, (%r8,%r11) vector store incq %rdx addq %r10, %rax addq %rbx, %rcx addq %rbx, %rsi addq %r10, %rdi cmpq %rdx, %r13 jne .L8 So it seems that both loops are faster in vector version and cost model is actually on conservative side. Again the strided load latency is much shorter. 3+3+3=9 instead of 18 we estimate. The slowdown comes from outside of loop body - we end up spilling to stack a lot. This can be checked by increasing the trip count of loops from approx 10 up: jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-novect 1000 10 jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-vect 1000 10 Performance counter stats for './a.out-vect 1000 10' (10 runs): 3331.057127 task-clock:u (msec) # 1.000 CPUs utilized ( +- 1.12% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 470,018 page-faults:u # 0.141 M/sec ( +- 0.02% ) 9,387,354,503 cycles:u # 2.818 GHz ( +- 1.26% ) 18,782,578,729 instructions:u # 2.00 insn per cycle ( +- 0.00% ) 522,121,680 branches:u # 156.744 M/sec ( +- 0.00% ) 560,198 branch-misses:u # 0.11% of all branches ( +- 0.55% ) 3.332128337 seconds time elapsed ( +- 1.12% ) Performance counter stats for './a.out-novect 1000 10' (10 runs): 3576.167709 task-clock:u (msec) # 0.999 CPUs utilized ( +- 0.69% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 470,222 page-faults:u # 0.131 M/sec ( +- 0.02% ) 9,938,129,346 cycles:u # 2.779 GHz ( +- 0.84% ) 22,317,980,606 instructions:u # 2.25 insn per cycle ( +- 0.00% ) 692,632,083 branches:u # 193.680 M/sec ( +- 0.00% ) 626,346 branch-misses:u # 0.09% of all branches ( +- 3.29% ) 3.577986466 seconds time elapsed ( +- 0.68% ) So I am not sure how much we can fix in cost itself (we should consider accounting for throughputs rather than lantencies but that will only make vectorization happen more often. Because upper bound on the loop iteration count is not know, we probably can try to make vectorizer less harmful. I will look into the profile of the non-vectorized path and see if I can account it more. Register allocation is pretty bad. Since internal loop consume all registers, we end up doing almost everything outside main loop in stack. Other possible improvements would be to unswitch the outer loop for the two conditionals. Number of iteration of the vectorized loop is determined by the outermost loop. Of course we do only unswitching before vectorization which kills this posibility and I think we do that only for innermost loops. Perhaps also ivopts can be improved to consume less registers for the vectorized loop body. Richi, any extra ideas?