https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365
--- Comment #4 from n8tm at aol dot com --- On 11/16/2015 7:13 AM, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365 > > Richard Biener <rguenth at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|WAITING |NEW > > --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- > Hmm, there are many loops here. I looked at the following (assuming the > interesting loops are marked with safelen(1)) > > subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc) > use lcd_mod > C > C linear dependence testing > C no dependence - vectorizable > C > integer ntimes,ld,n,i,nl > real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n) > real t1,t2,chksum,ctime,dtime,cs1d > call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ') > call forttime(t1) > do nl= 1,2*ntimes > #ifndef __MIC__ > !$omp simd safelen(1) > #endif > do i= 2,n,2 > a(i)= a(i-1)+b(i) > enddo > call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.) > enddo > call forttime(t2) > > and current trunk doesn't consider this profitable unless -mavx is given > (it needs the larger vector size for profitability it seems). > > Because of the step 2 it ends up using strided stores. Instead of > doing interleaving on the loads and stores we could have just operated > on all elements (rather than only even ones) and then use a masked > store. That would waste half of the vector bandwidth but save all the > shuffles. > > .L8: > vmovups (%rdx), %xmm0 > addl $1, %r9d > addq $64, %rdx > addq $64, %r11 > vmovups -32(%rdx), %xmm2 > vinsertf128 $0x1, -48(%rdx), %ymm0, %ymm1 > vmovups -64(%r11), %xmm9 > vinsertf128 $0x1, -16(%rdx), %ymm2, %ymm3 > vmovups -32(%r11), %xmm11 > vinsertf128 $0x1, -48(%r11), %ymm9, %ymm10 > vinsertf128 $0x1, -16(%r11), %ymm11, %ymm12 > vshufps $136, %ymm3, %ymm1, %ymm4 > vshufps $136, %ymm12, %ymm10, %ymm13 > vperm2f128 $3, %ymm4, %ymm4, %ymm5 > vperm2f128 $3, %ymm13, %ymm13, %ymm14 > vshufps $68, %ymm5, %ymm4, %ymm6 > vshufps $238, %ymm5, %ymm4, %ymm7 > vshufps $68, %ymm14, %ymm13, %ymm15 > vshufps $238, %ymm14, %ymm13, %ymm0 > vinsertf128 $1, %xmm7, %ymm6, %ymm8 > vinsertf128 $1, %xmm0, %ymm15, %ymm1 > vaddps %ymm1, %ymm8, %ymm2 > vextractf128 $0x1, %ymm2, %xmm4 > vmovss %xmm2, -60(%rdx) > vextractps $1, %xmm2, -52(%rdx) > vextractps $2, %xmm2, -44(%rdx) > vextractps $3, %xmm2, -36(%rdx) > vmovss %xmm4, -28(%rdx) > vextractps $1, %xmm4, -20(%rdx) > vextractps $2, %xmm4, -12(%rdx) > vextractps $3, %xmm4, -4(%rdx) > cmpl %r9d, %ecx > ja .L8 > > what we fail to realize here is that cross-lane interleaving isn't working > with AVX256 and thus the interleave for the loads is very much more expensive > than we think. > > That's a general vectorizer cost model issue: > > /* Uses an even and odd extract operations or shuffle operations > for each needed permute. */ > int nstmts = ncopies * ceil_log2 (group_size) * group_size; > inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm, > stmt_info, 0, vect_body); > > which 1) doesn't consider single-element interleaving differently, > 2) simply uses vec_perm cost which heavily depends on the actual > (constant) permutation used > Thanks for the interesting analysis. icc/icpc take safelen(1) as preventing vectorization for this case, but I found another stride 2 case where they still perform the unprofitable AVX vectorization. Maybe I'll submit an Intel PR (IPS).