On Tue, Apr 24, 2007 at 12:15:05AM +0300, Dorit Nuzman wrote:

> point is clear though which sse4 insns can be modeled using existing idioms
> that the vectorizer can use? (I think it is cause I think you already
> included it in your patch?)

Yes, but not optimized.

> 
> > >
> > > We currently don't have idioms that represent the other forms.
> > >
> > > By the way, the vectorizer will not be able to make use of these
> > > vec_unpacku/s_hi_* insns if you don't define the corresponding
> > > vec_unpacku/s_lo_* patterns (although I think these are already defined
> in
> > > sse.md, though maybe less efficiently than the way sse4 can support
> them?).
> >
> > With my SSE4.1 patch applied, for
> >
> > typedef char vec_t;
> > typedef short vecx_t;
> >
> > extern __attribute__((aligned(16))) vec_t x [64];
> > extern __attribute__((aligned(16))) vecx_t y [64];
> >
> > void
> > foo ()
> > {
> >   int i;
> >
> >   for (i = 0; i < 64; i++)
> >     y [i]  = x [i];
> > }
> >
> > I got
> >
> >         movdqa  x(%rip), %xmm0
> >         movl    $16, %eax
> >         pxor    %xmm2, %xmm2
> >         pmovsxbw        %xmm0, %xmm1
> >         movdqa  %xmm1, y(%rip)
> >         movdqa  %xmm2, %xmm1
> >         pcmpgtb %xmm0, %xmm1
> >         punpckhbw       %xmm1, %xmm0
> >         movdqa  %xmm0, y+16(%rip)
> >
> > When extention is a single instruction, it is better to extend one low
> > element at a time:
> >
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667
> >
> 
> So I haven't had a chance to look at your patch, but judging from the code
> that was generated above I'm assuming that:
> - the code above was generated using the vectorizer,
> - the above is a body of a loop that iterates 4 times,
> - the vectorizer used the vec_unpack_lo/hi idioms to vectorize this loop,
> - with your patch, sse.md models vec_unpack_lo_* using the new pmovsx*
> insns,
> - sse.md still models vec_unpack_hi_* using the old mechanisms.
> 
> (correct so far?)

Yes.

> 
> IIUC, icc used a vectorization factor 8, and in each vector iteration
> expanded 8 chars to 8 shorts, using the new pmovzx* insn, and then
> completely unrolled the loop. gcc OTOH used a vectorization factor 16, and
> in each vector iteration expanded 16 chars to 16 shorts, using a
> combination of the new and old sse insns for vector unpacking.
> 
> In order for us to vectorize like icc, i.e. - using VF=8, I think the
> vectorizer needs to be extended to work with multiple vector sizes at once
> (a vector of 8 bytes to hold 8 chars, and a vector of 16 bytes to hold 8
> shorts). We currently assume a single VS (16 in this case) and determine
> the VF according to the smallest elements in the loop (chars in this case).
> 
> Note however, that while the code generated by icc looks much more compact
> indeed, it does generate twice as many loads as gcc (each of the 8 pmovzx*
> directly operates on memory), and in addition every other load is
> unaligned. gcc OTOH generated only loads, all aligned.

pmovzxbw only needs to be aligned for m64, which is 8 byte. All loads
generated by icc are aligned. Since you only need one instruction to
extend 8 elements, pmovzxbw should be a win.

> 
> So, in short, the two important notes are that (1) the vectorizer could to
> be extended to support multiple vector-sizes or anyhow the way the VF is
> determined could be extended to consider other alternatives, and (2) from
> the vectorizer's prespective, it's not obvious that working with VF=8 like
> icc is necessarily better than working with a VF=16, and a cost model would
> hopefully help gcc make the right choice...

If extendv8qiv8hi2 exists, vectorizer should use it.

> 
> A few minor comments:
> * The icc code you show operates on unsigned types, and the gcc code you
> show operates on signed types. I don't know if it makes much of a
> difference, just for the record.

There are

[EMAIL PROTECTED] vect]$ cat pmovzxbw.c
typedef unsigned char vec_t;
typedef unsigned short vecx_t;

in

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667

> * Also I wonder how the gcc code looks like when complete unrolling is
> applied (did you use -funoll-loops?). (like the point above, this is just
> so that we aompre apples w apples).

It is similar. I am enclosing it at the end.

> * I don't entirely follow the code that gcc generates
> what's that for exactly?:
>       pxor    %xmm2, %xmm2
>       movdqa  %xmm2, %xmm1
>       pcmpgtb %xmm0, %xmm1
> Is this part of the vec_unpack_hi, and if so - I wonder if there's a better
> way to model the vec_unpack_hi using the new sse4 instructions?

That is for signed extension.  I tried to model vec_unpack_hi with SSE4. It
isn't easy to move N/2 high elemenets to N/2 low elemenets. The best way
to do it is to combine

        movdqa  x(%rip), %xmm9
        pmovsxbw   %xmm9, %xmm11

into

        pmovsxbw x(%rip),%xmm11

and repeat it for N/2 elements. Of cause, we should only do it if
vec_unpack_lo is a single instructions.

However, I think we need a more general approach based on the number
of elements in the resulting vector to handle, vec_extend, like,

V4QI -> V4SI
V2QI -> V2DI
V2HI -> V2DI

They should be independent of vec_unpack.


H.J.
----
        .file   "pmovsxbw.c"
        .text
        .p2align 4,,15
.globl foo
        .type   foo, @function
foo:
        pxor    %xmm2, %xmm2
        movdqa  x(%rip), %xmm9
        movdqa  x+16(%rip), %xmm6
        movdqa  %xmm2, %xmm10
        movdqa  %xmm2, %xmm7
        movdqa  x+32(%rip), %xmm3
        movdqa  %xmm2, %xmm4
        pmovsxbw        %xmm9, %xmm11
        movdqa  x+48(%rip), %xmm0
        pcmpgtb %xmm9, %xmm10
        pcmpgtb %xmm6, %xmm7
        pmovsxbw        %xmm6, %xmm8
        pcmpgtb %xmm3, %xmm4
        pmovsxbw        %xmm3, %xmm5
        pcmpgtb %xmm0, %xmm2
        pmovsxbw        %xmm0, %xmm1
        punpckhbw       %xmm10, %xmm9
        punpckhbw       %xmm7, %xmm6
        punpckhbw       %xmm4, %xmm3
        punpckhbw       %xmm2, %xmm0
        movdqa  %xmm11, y(%rip)
        movdqa  %xmm9, y+16(%rip)
        movdqa  %xmm8, y+32(%rip)
        movdqa  %xmm6, y+48(%rip)
        movdqa  %xmm5, y+64(%rip)
        movdqa  %xmm3, y+80(%rip)
        movdqa  %xmm1, y+96(%rip)
        movdqa  %xmm0, y+112(%rip)
        ret
        .size   foo, .-foo
        .ident  "GCC: (GNU) 4.3.0 20070423 (experimental) [trunk revision 
124056]"
        .section        .note.GNU-stack,"",@progbits

Reply via email to