"H. J. Lu" <[EMAIL PROTECTED]> wrote on 23/04/2007 21:56:37:
... > > > > > > so this looks like a vec_unpacku_hi_v4si (or _lo?), i.e. what is > > now > > > > > > modeled as follows in sse.md: > > > > > > > > > > > > (define_expand "vec_unpacku_hi_v4si" > > > > > > [(match_operand:V2DI 0 "register_operand" "") > > > > > > (match_operand:V4SI 1 "register_operand" "")] > > > > > > "TARGET_SSE2" > > > > > > { > > > > > > ix86_expand_sse_unpack (operands, true, true); > > > > > > DONE; > > > > > > }) > > > > > > > > > > > > > > > > I am not sure if they are the same since SSE4.1 instructions > > > > > extend the first 2 elements in the vector, not the high/low > > > > > parts. > > > > > > > > > > > > > unpack high/low means the high/low elements of the vector > > > > > > > > > > SSE4.1 has > > > > > > 1. The first 8 elements of V16QI zero/sign externd to V8HI. > > > > This one is equivalent to vec_unpacku/s_hi_v16qi. > > Did you mean vec_unpacku/s_lo_v16qi? > > > > > > 2. The first 4 elements of V16QI/V8HI zero/sign externd to V4SI. > > > > The second of these two - "first 4 elements of V8HI zero/sign externd to > > V4SI" - is equivalent to vec_unpacku/s_hi_v8hi. > > Did you mean vec_unpacku/s_lo_v8hi? > > > > > > 2. The first 2 elements of V16QI/V8HI/V4SI zero/sign externd to V2DI. > > > > The last of these three - "first 2 elements of V4SI zero/sign extend to > > V2DI" - is equivalent to vec_unpacku/s_hi_v4si. > > Did you mean vec_unpacku/s_lo_v4si? > yes, I might be confusing high and low on x86, sorry about that. I hope the point is clear though which sse4 insns can be modeled using existing idioms that the vectorizer can use? (I think it is cause I think you already included it in your patch?) > > > > We currently don't have idioms that represent the other forms. > > > > By the way, the vectorizer will not be able to make use of these > > vec_unpacku/s_hi_* insns if you don't define the corresponding > > vec_unpacku/s_lo_* patterns (although I think these are already defined in > > sse.md, though maybe less efficiently than the way sse4 can support them?). > > With my SSE4.1 patch applied, for > > typedef char vec_t; > typedef short vecx_t; > > extern __attribute__((aligned(16))) vec_t x [64]; > extern __attribute__((aligned(16))) vecx_t y [64]; > > void > foo () > { > int i; > > for (i = 0; i < 64; i++) > y [i] = x [i]; > } > > I got > > movdqa x(%rip), %xmm0 > movl $16, %eax > pxor %xmm2, %xmm2 > pmovsxbw %xmm0, %xmm1 > movdqa %xmm1, y(%rip) > movdqa %xmm2, %xmm1 > pcmpgtb %xmm0, %xmm1 > punpckhbw %xmm1, %xmm0 > movdqa %xmm0, y+16(%rip) > > When extention is a single instruction, it is better to extend one low > element at a time: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 > So I haven't had a chance to look at your patch, but judging from the code that was generated above I'm assuming that: - the code above was generated using the vectorizer, - the above is a body of a loop that iterates 4 times, - the vectorizer used the vec_unpack_lo/hi idioms to vectorize this loop, - with your patch, sse.md models vec_unpack_lo_* using the new pmovsx* insns, - sse.md still models vec_unpack_hi_* using the old mechanisms. (correct so far?) IIUC, icc used a vectorization factor 8, and in each vector iteration expanded 8 chars to 8 shorts, using the new pmovzx* insn, and then completely unrolled the loop. gcc OTOH used a vectorization factor 16, and in each vector iteration expanded 16 chars to 16 shorts, using a combination of the new and old sse insns for vector unpacking. In order for us to vectorize like icc, i.e. - using VF=8, I think the vectorizer needs to be extended to work with multiple vector sizes at once (a vector of 8 bytes to hold 8 chars, and a vector of 16 bytes to hold 8 shorts). We currently assume a single VS (16 in this case) and determine the VF according to the smallest elements in the loop (chars in this case). Note however, that while the code generated by icc looks much more compact indeed, it does generate twice as many loads as gcc (each of the 8 pmovzx* directly operates on memory), and in addition every other load is unaligned. gcc OTOH generated only loads, all aligned. So, in short, the two important notes are that (1) the vectorizer could to be extended to support multiple vector-sizes or anyhow the way the VF is determined could be extended to consider other alternatives, and (2) from the vectorizer's prespective, it's not obvious that working with VF=8 like icc is necessarily better than working with a VF=16, and a cost model would hopefully help gcc make the right choice... A few minor comments: * The icc code you show operates on unsigned types, and the gcc code you show operates on signed types. I don't know if it makes much of a difference, just for the record. * Also I wonder how the gcc code looks like when complete unrolling is applied (did you use -funoll-loops?). (like the point above, this is just so that we aompre apples w apples). * I don't entirely follow the code that gcc generates what's that for exactly?: pxor %xmm2, %xmm2 movdqa %xmm2, %xmm1 pcmpgtb %xmm0, %xmm1 Is this part of the vec_unpack_hi, and if so - I wonder if there's a better way to model the vec_unpack_hi using the new sse4 instructions? dorit > > H.J.