"H. J. Lu" <[EMAIL PROTECTED]> wrote on 24/04/2007 01:03:25: ... > > There are > > [EMAIL PROTECTED] vect]$ cat pmovzxbw.c > typedef unsigned char vec_t; > typedef unsigned short vecx_t; > > in > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 >
By the way, this PR says "Integer externsions aren't vectorized" - but I think the testcase you are referring to does get vectorized, only not as efficiently as you would want it to (right?). > > * Also I wonder how the gcc code looks like when complete unrolling is > > applied (did you use -funoll-loops?). (like the point above, this is just > > so that we aompre apples w apples). > > It is similar. I am enclosing it at the end. > thanks > > * I don't entirely follow the code that gcc generates > > what's that for exactly?: > > pxor %xmm2, %xmm2 > > movdqa %xmm2, %xmm1 > > pcmpgtb %xmm0, %xmm1 > > Is this part of the vec_unpack_hi, and if so - I wonder if there's a better > > way to model the vec_unpack_hi using the new sse4 instructions? > > That is for signed extension. I tried to model vec_unpack_hi with SSE4. It > isn't easy to move N/2 high elemenets to N/2 low elemenets. just curious - why is it difficult? (couldn't you use a psrldq? is it too expensive?) dorit > The best way > to do it is to combine > > movdqa x(%rip), %xmm9 > pmovsxbw %xmm9, %xmm11 > > into > > pmovsxbw x(%rip),%xmm11 > > and repeat it for N/2 elements. Of cause, we should only do it if > vec_unpack_lo is a single instructions. > > However, I think we need a more general approach based on the number > of elements in the resulting vector to handle, vec_extend, like, > > V4QI -> V4SI > V2QI -> V2DI > V2HI -> V2DI > > They should be independent of vec_unpack. > > > H.J. > ---- > .file "pmovsxbw.c" > .text > .p2align 4,,15 > .globl foo > .type foo, @function > foo: > pxor %xmm2, %xmm2 > movdqa x(%rip), %xmm9 > movdqa x+16(%rip), %xmm6 > movdqa %xmm2, %xmm10 > movdqa %xmm2, %xmm7 > movdqa x+32(%rip), %xmm3 > movdqa %xmm2, %xmm4 > pmovsxbw %xmm9, %xmm11 > movdqa x+48(%rip), %xmm0 > pcmpgtb %xmm9, %xmm10 > pcmpgtb %xmm6, %xmm7 > pmovsxbw %xmm6, %xmm8 > pcmpgtb %xmm3, %xmm4 > pmovsxbw %xmm3, %xmm5 > pcmpgtb %xmm0, %xmm2 > pmovsxbw %xmm0, %xmm1 > punpckhbw %xmm10, %xmm9 > punpckhbw %xmm7, %xmm6 > punpckhbw %xmm4, %xmm3 > punpckhbw %xmm2, %xmm0 > movdqa %xmm11, y(%rip) > movdqa %xmm9, y+16(%rip) > movdqa %xmm8, y+32(%rip) > movdqa %xmm6, y+48(%rip) > movdqa %xmm5, y+64(%rip) > movdqa %xmm3, y+80(%rip) > movdqa %xmm1, y+96(%rip) > movdqa %xmm0, y+112(%rip) > ret > .size foo, .-foo > .ident "GCC: (GNU) 4.3.0 20070423 (experimental) [trunk revision 124056]" > .section .note.GNU-stack,"",@progbits