On Tue, Apr 24, 2007 at 12:15:05AM +0300, Dorit Nuzman wrote: > point is clear though which sse4 insns can be modeled using existing idioms > that the vectorizer can use? (I think it is cause I think you already > included it in your patch?)
Yes, but not optimized. > > > > > > > We currently don't have idioms that represent the other forms. > > > > > > By the way, the vectorizer will not be able to make use of these > > > vec_unpacku/s_hi_* insns if you don't define the corresponding > > > vec_unpacku/s_lo_* patterns (although I think these are already defined > in > > > sse.md, though maybe less efficiently than the way sse4 can support > them?). > > > > With my SSE4.1 patch applied, for > > > > typedef char vec_t; > > typedef short vecx_t; > > > > extern __attribute__((aligned(16))) vec_t x [64]; > > extern __attribute__((aligned(16))) vecx_t y [64]; > > > > void > > foo () > > { > > int i; > > > > for (i = 0; i < 64; i++) > > y [i] = x [i]; > > } > > > > I got > > > > movdqa x(%rip), %xmm0 > > movl $16, %eax > > pxor %xmm2, %xmm2 > > pmovsxbw %xmm0, %xmm1 > > movdqa %xmm1, y(%rip) > > movdqa %xmm2, %xmm1 > > pcmpgtb %xmm0, %xmm1 > > punpckhbw %xmm1, %xmm0 > > movdqa %xmm0, y+16(%rip) > > > > When extention is a single instruction, it is better to extend one low > > element at a time: > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 > > > > So I haven't had a chance to look at your patch, but judging from the code > that was generated above I'm assuming that: > - the code above was generated using the vectorizer, > - the above is a body of a loop that iterates 4 times, > - the vectorizer used the vec_unpack_lo/hi idioms to vectorize this loop, > - with your patch, sse.md models vec_unpack_lo_* using the new pmovsx* > insns, > - sse.md still models vec_unpack_hi_* using the old mechanisms. > > (correct so far?) Yes. > > IIUC, icc used a vectorization factor 8, and in each vector iteration > expanded 8 chars to 8 shorts, using the new pmovzx* insn, and then > completely unrolled the loop. gcc OTOH used a vectorization factor 16, and > in each vector iteration expanded 16 chars to 16 shorts, using a > combination of the new and old sse insns for vector unpacking. > > In order for us to vectorize like icc, i.e. - using VF=8, I think the > vectorizer needs to be extended to work with multiple vector sizes at once > (a vector of 8 bytes to hold 8 chars, and a vector of 16 bytes to hold 8 > shorts). We currently assume a single VS (16 in this case) and determine > the VF according to the smallest elements in the loop (chars in this case). > > Note however, that while the code generated by icc looks much more compact > indeed, it does generate twice as many loads as gcc (each of the 8 pmovzx* > directly operates on memory), and in addition every other load is > unaligned. gcc OTOH generated only loads, all aligned. pmovzxbw only needs to be aligned for m64, which is 8 byte. All loads generated by icc are aligned. Since you only need one instruction to extend 8 elements, pmovzxbw should be a win. > > So, in short, the two important notes are that (1) the vectorizer could to > be extended to support multiple vector-sizes or anyhow the way the VF is > determined could be extended to consider other alternatives, and (2) from > the vectorizer's prespective, it's not obvious that working with VF=8 like > icc is necessarily better than working with a VF=16, and a cost model would > hopefully help gcc make the right choice... If extendv8qiv8hi2 exists, vectorizer should use it. > > A few minor comments: > * The icc code you show operates on unsigned types, and the gcc code you > show operates on signed types. I don't know if it makes much of a > difference, just for the record. There are [EMAIL PROTECTED] vect]$ cat pmovzxbw.c typedef unsigned char vec_t; typedef unsigned short vecx_t; in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 > * Also I wonder how the gcc code looks like when complete unrolling is > applied (did you use -funoll-loops?). (like the point above, this is just > so that we aompre apples w apples). It is similar. I am enclosing it at the end. > * I don't entirely follow the code that gcc generates > what's that for exactly?: > pxor %xmm2, %xmm2 > movdqa %xmm2, %xmm1 > pcmpgtb %xmm0, %xmm1 > Is this part of the vec_unpack_hi, and if so - I wonder if there's a better > way to model the vec_unpack_hi using the new sse4 instructions? That is for signed extension. I tried to model vec_unpack_hi with SSE4. It isn't easy to move N/2 high elemenets to N/2 low elemenets. The best way to do it is to combine movdqa x(%rip), %xmm9 pmovsxbw %xmm9, %xmm11 into pmovsxbw x(%rip),%xmm11 and repeat it for N/2 elements. Of cause, we should only do it if vec_unpack_lo is a single instructions. However, I think we need a more general approach based on the number of elements in the resulting vector to handle, vec_extend, like, V4QI -> V4SI V2QI -> V2DI V2HI -> V2DI They should be independent of vec_unpack. H.J. ---- .file "pmovsxbw.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: pxor %xmm2, %xmm2 movdqa x(%rip), %xmm9 movdqa x+16(%rip), %xmm6 movdqa %xmm2, %xmm10 movdqa %xmm2, %xmm7 movdqa x+32(%rip), %xmm3 movdqa %xmm2, %xmm4 pmovsxbw %xmm9, %xmm11 movdqa x+48(%rip), %xmm0 pcmpgtb %xmm9, %xmm10 pcmpgtb %xmm6, %xmm7 pmovsxbw %xmm6, %xmm8 pcmpgtb %xmm3, %xmm4 pmovsxbw %xmm3, %xmm5 pcmpgtb %xmm0, %xmm2 pmovsxbw %xmm0, %xmm1 punpckhbw %xmm10, %xmm9 punpckhbw %xmm7, %xmm6 punpckhbw %xmm4, %xmm3 punpckhbw %xmm2, %xmm0 movdqa %xmm11, y(%rip) movdqa %xmm9, y+16(%rip) movdqa %xmm8, y+32(%rip) movdqa %xmm6, y+48(%rip) movdqa %xmm5, y+64(%rip) movdqa %xmm3, y+80(%rip) movdqa %xmm1, y+96(%rip) movdqa %xmm0, y+112(%rip) ret .size foo, .-foo .ident "GCC: (GNU) 4.3.0 20070423 (experimental) [trunk revision 124056]" .section .note.GNU-stack,"",@progbits