On Tue, Apr 24, 2007 at 12:15:05AM +0300, Dorit Nuzman wrote:
> point is clear though which sse4 insns can be modeled using existing idioms
> that the vectorizer can use? (I think it is cause I think you already
> included it in your patch?)
Yes, but not optimized.
>
> > >
> > > We currently don't have idioms that represent the other forms.
> > >
> > > By the way, the vectorizer will not be able to make use of these
> > > vec_unpacku/s_hi_* insns if you don't define the corresponding
> > > vec_unpacku/s_lo_* patterns (although I think these are already defined
> in
> > > sse.md, though maybe less efficiently than the way sse4 can support
> them?).
> >
> > With my SSE4.1 patch applied, for
> >
> > typedef char vec_t;
> > typedef short vecx_t;
> >
> > extern __attribute__((aligned(16))) vec_t x [64];
> > extern __attribute__((aligned(16))) vecx_t y [64];
> >
> > void
> > foo ()
> > {
> > int i;
> >
> > for (i = 0; i < 64; i++)
> > y [i] = x [i];
> > }
> >
> > I got
> >
> > movdqa x(%rip), %xmm0
> > movl $16, %eax
> > pxor %xmm2, %xmm2
> > pmovsxbw %xmm0, %xmm1
> > movdqa %xmm1, y(%rip)
> > movdqa %xmm2, %xmm1
> > pcmpgtb %xmm0, %xmm1
> > punpckhbw %xmm1, %xmm0
> > movdqa %xmm0, y+16(%rip)
> >
> > When extention is a single instruction, it is better to extend one low
> > element at a time:
> >
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667
> >
>
> So I haven't had a chance to look at your patch, but judging from the code
> that was generated above I'm assuming that:
> - the code above was generated using the vectorizer,
> - the above is a body of a loop that iterates 4 times,
> - the vectorizer used the vec_unpack_lo/hi idioms to vectorize this loop,
> - with your patch, sse.md models vec_unpack_lo_* using the new pmovsx*
> insns,
> - sse.md still models vec_unpack_hi_* using the old mechanisms.
>
> (correct so far?)
Yes.
>
> IIUC, icc used a vectorization factor 8, and in each vector iteration
> expanded 8 chars to 8 shorts, using the new pmovzx* insn, and then
> completely unrolled the loop. gcc OTOH used a vectorization factor 16, and
> in each vector iteration expanded 16 chars to 16 shorts, using a
> combination of the new and old sse insns for vector unpacking.
>
> In order for us to vectorize like icc, i.e. - using VF=8, I think the
> vectorizer needs to be extended to work with multiple vector sizes at once
> (a vector of 8 bytes to hold 8 chars, and a vector of 16 bytes to hold 8
> shorts). We currently assume a single VS (16 in this case) and determine
> the VF according to the smallest elements in the loop (chars in this case).
>
> Note however, that while the code generated by icc looks much more compact
> indeed, it does generate twice as many loads as gcc (each of the 8 pmovzx*
> directly operates on memory), and in addition every other load is
> unaligned. gcc OTOH generated only loads, all aligned.
pmovzxbw only needs to be aligned for m64, which is 8 byte. All loads
generated by icc are aligned. Since you only need one instruction to
extend 8 elements, pmovzxbw should be a win.
>
> So, in short, the two important notes are that (1) the vectorizer could to
> be extended to support multiple vector-sizes or anyhow the way the VF is
> determined could be extended to consider other alternatives, and (2) from
> the vectorizer's prespective, it's not obvious that working with VF=8 like
> icc is necessarily better than working with a VF=16, and a cost model would
> hopefully help gcc make the right choice...
If extendv8qiv8hi2 exists, vectorizer should use it.
>
> A few minor comments:
> * The icc code you show operates on unsigned types, and the gcc code you
> show operates on signed types. I don't know if it makes much of a
> difference, just for the record.
There are
[EMAIL PROTECTED] vect]$ cat pmovzxbw.c
typedef unsigned char vec_t;
typedef unsigned short vecx_t;
in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667
> * Also I wonder how the gcc code looks like when complete unrolling is
> applied (did you use -funoll-loops?). (like the point above, this is just
> so that we aompre apples w apples).
It is similar. I am enclosing it at the end.
> * I don't entirely follow the code that gcc generates
> what's that for exactly?:
> pxor %xmm2, %xmm2
> movdqa %xmm2, %xmm1
> pcmpgtb %xmm0, %xmm1
> Is this part of the vec_unpack_hi, and if so - I wonder if there's a better
> way to model the vec_unpack_hi using the new sse4 instructions?
That is for signed extension. I tried to model vec_unpack_hi with SSE4. It
isn't easy to move N/2 high elemenets to N/2 low elemenets. The best way
to do it is to combine
movdqa x(%rip), %xmm9
pmovsxbw %xmm9, %xmm11
into
pmovsxbw x(%rip),%xmm11
and repeat it for N/2 elements. Of cause, we should only do it if
vec_unpack_lo is a single instructions.
However, I think we need a more general approach based on the number
of elements in the resulting vector to handle, vec_extend, like,
V4QI -> V4SI
V2QI -> V2DI
V2HI -> V2DI
They should be independent of vec_unpack.
H.J.
----
.file "pmovsxbw.c"
.text
.p2align 4,,15
.globl foo
.type foo, @function
foo:
pxor %xmm2, %xmm2
movdqa x(%rip), %xmm9
movdqa x+16(%rip), %xmm6
movdqa %xmm2, %xmm10
movdqa %xmm2, %xmm7
movdqa x+32(%rip), %xmm3
movdqa %xmm2, %xmm4
pmovsxbw %xmm9, %xmm11
movdqa x+48(%rip), %xmm0
pcmpgtb %xmm9, %xmm10
pcmpgtb %xmm6, %xmm7
pmovsxbw %xmm6, %xmm8
pcmpgtb %xmm3, %xmm4
pmovsxbw %xmm3, %xmm5
pcmpgtb %xmm0, %xmm2
pmovsxbw %xmm0, %xmm1
punpckhbw %xmm10, %xmm9
punpckhbw %xmm7, %xmm6
punpckhbw %xmm4, %xmm3
punpckhbw %xmm2, %xmm0
movdqa %xmm11, y(%rip)
movdqa %xmm9, y+16(%rip)
movdqa %xmm8, y+32(%rip)
movdqa %xmm6, y+48(%rip)
movdqa %xmm5, y+64(%rip)
movdqa %xmm3, y+80(%rip)
movdqa %xmm1, y+96(%rip)
movdqa %xmm0, y+112(%rip)
ret
.size foo, .-foo
.ident "GCC: (GNU) 4.3.0 20070423 (experimental) [trunk revision
124056]"
.section .note.GNU-stack,"",@progbits