https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82862

--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
First of all, thanks a lot for reproducer!

Here are times with vectorizer enabled, disabled and no costmodel
 Performance counter stats for './a.out-vect 21 1000000' (10 runs):

       4588.055614      task-clock:u (msec)       #    1.000 CPUs utilized     
      ( +-  0.49% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                88      page-faults:u             #    0.019 K/sec             
      ( +-  0.44% )
    14,911,755,271      cycles:u                  #    3.250 GHz               
      ( +-  0.37% )
    52,564,741,152      instructions:u            #    3.53  insn per cycle    
      ( +-  0.00% )
     4,073,206,037      branches:u                #  887.785 M/sec             
      ( +-  0.00% )
        18,106,857      branch-misses:u           #    0.44% of all branches   
      ( +-  0.30% )

       4.589172192 seconds time elapsed                                        
 ( +-  0.50% )

jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-novect 21
1000000

 Performance counter stats for './a.out-novect 21 1000000' (10 runs):

       3549.651576      task-clock:u (msec)       #    1.000 CPUs utilized     
      ( +-  0.65% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                88      page-faults:u             #    0.025 K/sec             
      ( +-  0.42% )
    11,563,811,687      cycles:u                  #    3.258 GHz               
      ( +-  0.61% )
    39,259,740,624      instructions:u            #    3.40  insn per cycle    
      ( +-  0.00% )
     3,061,205,511      branches:u                #  862.396 M/sec             
      ( +-  0.00% )
        11,774,836      branch-misses:u           #    0.38% of all branches   
      ( +-  0.36% )

       3.550955730 seconds time elapsed                                        
 ( +-  0.65% )

jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-nocost 21
1000000

 Performance counter stats for './a.out-nocost 21 1000000' (10 runs):

       4621.515923      task-clock:u (msec)       #    1.000 CPUs utilized     
      ( +-  0.31% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                87      page-faults:u             #    0.019 K/sec             
      ( +-  0.35% )
    14,965,340,896      cycles:u                  #    3.238 GHz               
      ( +-  0.30% )
    52,817,740,929      instructions:u            #    3.53  insn per cycle    
      ( +-  0.00% )
     4,326,205,814      branches:u                #  936.101 M/sec             
      ( +-  0.00% )
        16,615,805      branch-misses:u           #    0.38% of all branches   
      ( +-  0.10% )

       4.622600700 seconds time elapsed                                        
 ( +-  0.31% )

So vectorization hurts both in time and instruction count.


There are two loops to vectorize.

  _34 = _74 + S.2_106;
  _35 = _34 * _121;
  _36 = _35 + _124;
  _38 = _36 * _37;
  _39 = (sizetype) _38;
  _40 = _72 + _39;
  _41 = MEM[(real(kind=8)[0:] *)A.14_116][S.2_106];
  *_40 = _41;
  S.2_88 = S.2_106 + 1;
  if (_77 < S.2_88)
    goto <bb 9>; [15.00%]
  else
    goto loopback; [85.00%]

  Vector inside of loop cost: 76
  Vector prologue cost: 24
  Vector epilogue cost: 48
  Scalar iteration cost: 24
  Scalar outside cost: 24
  Vector outside cost: 72
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 3
tonto.f90:26:0: note:   Runtime profitability threshold = 4
tonto.f90:26:0: note:   Static estimate profitability threshold = 11

and

  _18 = S.2_105 + 1;
  _19 = _18 * _61;
  _2 = _19 - _61;
  _21 = _2 * _3;
  _22 = (sizetype) _21;
  _23 = _11 + _22;
  _24 = *_23;
  _25 = _70 + S.2_105;
  _26 = _25 * _117;
  _27 = _26 + _120;
  _29 = _27 * _28;
  _30 = (sizetype) _29;
  _31 = _68 + _30;
  _32 = *_31;
  _33 = _24 * _32;
  MEM[(real(kind=8)[0:] *)A.14_116][S.2_105] = _33;
  if (_18 > _77)
    goto <bb 7>; [15.00%]
  else
    loopback; [85.00%]


  Vector inside of loop cost: 176
  Vector prologue cost: 24
  Vector epilogue cost: 112
  Scalar iteration cost: 48
  Scalar outside cost: 24
  Vector outside cost: 136
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7
  Static estimate profitability threshold = 18

Both loops iterate about 9 time so the thresholds are close to being never
executed. So the slowdown seems to be just colateral damage of adding
vectorized loop for something that is not executed enough.

This is how we handle first loop:


  _34 = _74 + S.2_106;                                 irrelevant
  _35 = _34 * _121;                                    irrelevant
  _36 = _35 + _124;                                    irrelevant
  _38 = _36 * _37;                                     irrelevant
  _39 = (sizetype) _38;                                irrelevant
  _40 = _72 + _39;                                     irrelevant
  _41 = MEM[(real(kind=8)[0:] *)A.14_116][S.2_106];    This is accounted
                                                       as unaligned vector load
      inside_cost = 12
  *_40 = _41;
      inside_cost = 64
      This is accounted as 4 stores (12*4=48) and 4 vec to scalar (4*4)
  S.2_88 = S.2_106 + 1;
  if (_77 < S.2_88)
    goto <bb 9>; [15.00%]
  else
    goto loopback; [85.00%]

Scalar code is counted as load+store = 12+12=24
Accounted costs should match latencies multiplied by 4.

Now real code with unrolling/scheduling disabled:

        movq    %rdx, %rsi
        salq    $5, %rsi

 vector load
        vmovupd (%r8,%rsi), %ymm0

          Here the cost model is correct. it is 3 cycles and we multiply by
          4.

 vector store
        vmovlpd %xmm0, (%rax)
        vmovhpd %xmm0, (%rax,%rcx)
        vextractf128    $0x1, %ymm0, %xmm0
        vmovlpd %xmm0, (%rax,%rcx,2)
        vmovhpd %xmm0, (%rax,%rdi)

          Here the cost model is somehwat pesimisted because it accounts 4*1
          cycles for unpacking vectors to scalars and 3*1 cycles for stores
          Unpacking is hidden in vmov* cost model does not know about and
          vextractf128 which has latency 3.

        incq    %rdx
        addq    %r9, %rax
        cmpq    %rdx, %r13
        jne     .L10


Second loop:
  _18 = S.2_105 + 1;                                 irrelevant
  _19 = _18 * _61;                                   irrelevant
  _2 = _19 - _61;                                    irrelevant
  _21 = _2 * _3;                                     irrelevant
  _22 = (sizetype) _21;                              irrelevant
  _23 = _11 + _22;                                   irrelevant
  _24 = *_23;                                        Strided load
        4*12 (4 loads) + 4*4 (4 vec_construct) + 48 (vec construct) = 72

  _25 = _70 + S.2_105;                               irrelevant
  _26 = _25 * _117;                                  irrelevant
  _27 = _26 + _120;                                  irrelevant
  _29 = _27 * _28;                                   irrelevant
  _30 = (sizetype) _29;                              irrelevant
  _31 = _68 + _30;                                   irrelevant
  _32 = *_31;                                        Strided load
        72 again
  _33 = _24 * _32;                                   vector multiply
        4*5 (5 cycles mult)
  MEM[(real(kind=8)[0:] *)A.14_116][S.2_105] = _33;  Vector store
        12
  if (_18 > _77)
    goto <bb 7>; [15.00%]
  else
    loopback; [85.00%]

Generated code is:

        vmovsd  (%rsi), %xmm1
        vmovhpd (%rsi,%r15), %xmm1, %xmm2
        vmovsd  (%rcx), %xmm0
        vmovhpd (%rcx,%r15), %xmm0, %xmm0
        vinsertf128     $0x1, %xmm2, %ymm0, %ymm1
           strided load

        vmovsd  (%rdi), %xmm0
        vmovhpd (%rdi,%r9), %xmm0, %xmm2
        vmovsd  (%rax), %xmm0
        vmovhpd (%rax,%r9), %xmm0, %xmm0
        vinsertf128     $0x1, %xmm2, %ymm0, %ymm0
           strided load

        vmulpd  %ymm0, %ymm1, %ymm0
           vector mult

        movq    %rdx, %r11
        salq    $5, %r11
        vmovupd %ymm0, (%r8,%r11)
           vector store

        incq    %rdx
        addq    %r10, %rax
        addq    %rbx, %rcx
        addq    %rbx, %rsi
        addq    %r10, %rdi
        cmpq    %rdx, %r13
        jne     .L8

So it seems that both loops are faster in vector version and cost model
is actually on conservative side.  Again the strided load latency is much
shorter. 3+3+3=9 instead of 18 we estimate.

The slowdown comes from outside of loop body - we end up spilling to stack a
lot.  This can be checked by increasing the trip count of loops from approx 10
up:
jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-novect 1000 10

jan@skylake:~/trunk/build/tonto> perf stat --repeat 10 ./a.out-vect 1000 10

 Performance counter stats for './a.out-vect 1000 10' (10 runs):

       3331.057127      task-clock:u (msec)       #    1.000 CPUs utilized     
      ( +-  1.12% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
           470,018      page-faults:u             #    0.141 M/sec             
      ( +-  0.02% )
     9,387,354,503      cycles:u                  #    2.818 GHz               
      ( +-  1.26% )
    18,782,578,729      instructions:u            #    2.00  insn per cycle    
      ( +-  0.00% )
       522,121,680      branches:u                #  156.744 M/sec             
      ( +-  0.00% )
           560,198      branch-misses:u           #    0.11% of all branches   
      ( +-  0.55% )

       3.332128337 seconds time elapsed                                        
 ( +-  1.12% )

 Performance counter stats for './a.out-novect 1000 10' (10 runs):

       3576.167709      task-clock:u (msec)       #    0.999 CPUs utilized     
      ( +-  0.69% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
           470,222      page-faults:u             #    0.131 M/sec             
      ( +-  0.02% )
     9,938,129,346      cycles:u                  #    2.779 GHz               
      ( +-  0.84% )
    22,317,980,606      instructions:u            #    2.25  insn per cycle    
      ( +-  0.00% )
       692,632,083      branches:u                #  193.680 M/sec             
      ( +-  0.00% )
           626,346      branch-misses:u           #    0.09% of all branches   
      ( +-  3.29% )

       3.577986466 seconds time elapsed                                        
 ( +-  0.68% )

So I am not sure how much we can fix in cost itself (we should consider
accounting for throughputs rather than lantencies but that will only make
vectorization happen more often.

Because upper bound on the loop iteration count is not know,
we probably can try to make vectorizer less harmful.

I will look into the profile of the non-vectorized path and see if I can
account it more.  Register allocation is pretty bad.  Since internal loop
consume all registers, we end up doing almost everything outside main loop in
stack.

Other possible improvements would be to unswitch the outer loop for the two
conditionals. Number of iteration of the vectorized loop is determined by the
outermost loop.

Of course we do only unswitching before vectorization which kills this
posibility and I think we do that only for innermost loops.
Perhaps also ivopts can be improved to consume less registers for the
vectorized loop body.

Richi, any extra ideas?

Reply via email to