On Thu, Jun 28, 2012 at 08:57:23AM -0700, Richard Henderson wrote: > On 2012-06-28 07:05, Jakub Jelinek wrote: > > Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems > > to pessimize the generated code for gcc.dg/vect/pr51581-3.c > > testcase (at least with -O3 -mavx) compared to when the hooks aren't > > present, because i?86 has more natural support for widen mult lo/hi > > compoared to widen mult even/odd, but I assume that on powerpc it is the > > other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR > > and builtin_mul_widen_* are possible for the particular vectype which one > > will be cheaper? > > I would assume that if the builtin exists, then it is cheaper. > > I disagree about "x86 has more natural support for hi/lo". The basic sse2 > multiplication is even. One shift per input is needed to generate odd. > On the other hand, one interleave per input is required for both hi/lo. > So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of > all that, XOP includes multiply odd at least for signed V4SI.
Perhaps the problem is then that the permutation is much more expensive for even/odd. With even/odd the f2 routine is: vmovdqa d(%rip), %xmm2 vmovdqa .LC1(%rip), %xmm0 vpsrlq $32, %xmm2, %xmm4 vmovdqa d+16(%rip), %xmm1 vpmuludq %xmm0, %xmm2, %xmm5 vpsrlq $32, %xmm0, %xmm3 vpmuludq %xmm3, %xmm4, %xmm4 vpmuludq %xmm0, %xmm1, %xmm0 vmovdqa .LC2(%rip), %xmm2 vpsrlq $32, %xmm1, %xmm1 vpmuludq %xmm3, %xmm1, %xmm3 vmovdqa .LC3(%rip), %xmm1 vpshufb %xmm2, %xmm5, %xmm5 vpshufb %xmm1, %xmm4, %xmm4 vpshufb %xmm2, %xmm0, %xmm2 vpshufb %xmm1, %xmm3, %xmm1 vpor %xmm4, %xmm5, %xmm4 vpor %xmm1, %xmm2, %xmm1 vpsrld $1, %xmm4, %xmm4 vmovdqa %xmm4, c(%rip) vpsrld $1, %xmm1, %xmm1 vmovdqa %xmm1, c+16(%rip) ret and with lo/hi it is: vmovdqa d(%rip), %xmm2 vpunpckhdq %xmm2, %xmm2, %xmm3 vpunpckldq %xmm2, %xmm2, %xmm2 vmovdqa .LC1(%rip), %xmm0 vpmuludq %xmm0, %xmm3, %xmm3 vmovdqa d+16(%rip), %xmm1 vpmuludq %xmm0, %xmm2, %xmm2 vshufps $221, %xmm2, %xmm3, %xmm2 vpsrld $1, %xmm2, %xmm2 vmovdqa %xmm2, c(%rip) vpunpckhdq %xmm1, %xmm1, %xmm2 vpunpckldq %xmm1, %xmm1, %xmm1 vpmuludq %xmm0, %xmm2, %xmm2 vpmuludq %xmm0, %xmm1, %xmm0 vshufps $221, %xmm0, %xmm2, %xmm0 vpsrld $1, %xmm0, %xmm0 vmovdqa %xmm0, c+16(%rip) ret Jakub