[Bug target/68365] gfortran test case showing performance loss with vectorization

n8tm at aol dot com Mon, 16 Nov 2015 05:11:55 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365


--- Comment #4 from n8tm at aol dot com ---
On 11/16/2015 7:13 AM, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365
>
> Richard Biener <rguenth at gcc dot gnu.org> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|WAITING                     |NEW
>
> --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Hmm, there are many loops here.  I looked at the following (assuming the
> interesting loops are marked with safelen(1))
>
>       subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)
>       use lcd_mod
> C
> C     linear dependence testing
> C     no dependence - vectorizable
> C
>       integer ntimes,ld,n,i,nl
>       real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n)
>       real t1,t2,chksum,ctime,dtime,cs1d
>       call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ')
>       call forttime(t1)
>       do nl= 1,2*ntimes
> #ifndef __MIC__
> !$omp simd safelen(1)
> #endif
>           do i= 2,n,2
>             a(i)= a(i-1)+b(i)
>             enddo
>           call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.)
>         enddo
>       call forttime(t2)
>
> and current trunk doesn't consider this profitable unless -mavx is given
> (it needs the larger vector size for profitability it seems).
>
> Because of the step 2 it ends up using strided stores.  Instead of
> doing interleaving on the loads and stores we could have just operated
> on all elements (rather than only even ones) and then use a masked
> store.  That would waste half of the vector bandwidth but save all the
> shuffles.
>
> .L8:
>         vmovups (%rdx), %xmm0
>         addl    $1, %r9d
>         addq    $64, %rdx
>         addq    $64, %r11
>         vmovups -32(%rdx), %xmm2
>         vinsertf128     $0x1, -48(%rdx), %ymm0, %ymm1
>         vmovups -64(%r11), %xmm9
>         vinsertf128     $0x1, -16(%rdx), %ymm2, %ymm3
>         vmovups -32(%r11), %xmm11
>         vinsertf128     $0x1, -48(%r11), %ymm9, %ymm10
>         vinsertf128     $0x1, -16(%r11), %ymm11, %ymm12
>         vshufps $136, %ymm3, %ymm1, %ymm4
>         vshufps $136, %ymm12, %ymm10, %ymm13
>         vperm2f128      $3, %ymm4, %ymm4, %ymm5
>         vperm2f128      $3, %ymm13, %ymm13, %ymm14
>         vshufps $68, %ymm5, %ymm4, %ymm6
>         vshufps $238, %ymm5, %ymm4, %ymm7
>         vshufps $68, %ymm14, %ymm13, %ymm15
>         vshufps $238, %ymm14, %ymm13, %ymm0
>         vinsertf128     $1, %xmm7, %ymm6, %ymm8
>         vinsertf128     $1, %xmm0, %ymm15, %ymm1
>         vaddps  %ymm1, %ymm8, %ymm2
>         vextractf128    $0x1, %ymm2, %xmm4
>         vmovss  %xmm2, -60(%rdx)
>         vextractps      $1, %xmm2, -52(%rdx)
>         vextractps      $2, %xmm2, -44(%rdx)
>         vextractps      $3, %xmm2, -36(%rdx)
>         vmovss  %xmm4, -28(%rdx)
>         vextractps      $1, %xmm4, -20(%rdx)
>         vextractps      $2, %xmm4, -12(%rdx)
>         vextractps      $3, %xmm4, -4(%rdx)
>         cmpl    %r9d, %ecx
>         ja      .L8
>
> what we fail to realize here is that cross-lane interleaving isn't working
> with AVX256 and thus the interleave for the loads is very much more expensive
> than we think.
>
> That's a general vectorizer cost model issue:
>
>       /* Uses an even and odd extract operations or shuffle operations
>          for each needed permute.  */
>       int nstmts = ncopies * ceil_log2 (group_size) * group_size;
>       inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
>                                       stmt_info, 0, vect_body);
>
> which 1) doesn't consider single-element interleaving differently,
> 2) simply uses vec_perm cost which heavily depends on the actual
> (constant) permutation used
>
Thanks for the interesting analysis.
icc/icpc take safelen(1) as preventing vectorization for this case, but
I found another stride 2 case where they still perform the unprofitable
AVX vectorization.  Maybe I'll submit an Intel PR (IPS).

[Bug target/68365] gfortran test case showing performance loss with vectorization

Reply via email to