Re: Does vectorizer support extension?

Dorit Nuzman Mon, 23 Apr 2007 14:13:34 -0700

"H. J. Lu" <[EMAIL PROTECTED]> wrote on 23/04/2007 21:56:37:


...
> > > > > > so this looks like a vec_unpacku_hi_v4si (or _lo?), i.e. what
is
> > now
> > > > > > modeled as follows in sse.md:
> > > > > >
> > > > > > (define_expand "vec_unpacku_hi_v4si"
> > > > > >   [(match_operand:V2DI 0 "register_operand" "")
> > > > > >    (match_operand:V4SI 1 "register_operand" "")]
> > > > > >   "TARGET_SSE2"
> > > > > > {
> > > > > >   ix86_expand_sse_unpack (operands, true, true);
> > > > > >   DONE;
> > > > > > })
> > > > > >
> > > > >
> > > > > I am not sure if they are the same since SSE4.1 instructions
> > > > > extend the first 2 elements in the vector, not the high/low
> > > > > parts.
> > > > >
> > > >
> > > > unpack high/low means the high/low elements of the vector
> > > >
> > >
> > > SSE4.1 has
> > >
> > > 1.  The first 8 elements of V16QI zero/sign externd to V8HI.
> >
> > This one is equivalent to vec_unpacku/s_hi_v16qi.
>
> Did you mean vec_unpacku/s_lo_v16qi?
>
> >
> > > 2.  The first 4 elements of V16QI/V8HI zero/sign externd to V4SI.
> >
> > The second of these two - "first 4 elements of V8HI zero/sign externd
to
> > V4SI" - is equivalent to vec_unpacku/s_hi_v8hi.
>
> Did you mean vec_unpacku/s_lo_v8hi?
>
> >
> > > 2.  The first 2 elements of V16QI/V8HI/V4SI zero/sign externd to
V2DI.
> >
> > The last of these three - "first 2 elements of V4SI zero/sign extend to
> > V2DI" - is equivalent to vec_unpacku/s_hi_v4si.
>
> Did you mean vec_unpacku/s_lo_v4si?
>

yes, I might be confusing high and low on x86, sorry about that. I hope the
point is clear though which sse4 insns can be modeled using existing idioms
that the vectorizer can use? (I think it is cause I think you already
included it in your patch?)

> >
> > We currently don't have idioms that represent the other forms.
> >
> > By the way, the vectorizer will not be able to make use of these
> > vec_unpacku/s_hi_* insns if you don't define the corresponding
> > vec_unpacku/s_lo_* patterns (although I think these are already defined
in
> > sse.md, though maybe less efficiently than the way sse4 can support
them?).
>
> With my SSE4.1 patch applied, for
>
> typedef char vec_t;
> typedef short vecx_t;
>
> extern __attribute__((aligned(16))) vec_t x [64];
> extern __attribute__((aligned(16))) vecx_t y [64];
>
> void
> foo ()
> {
>   int i;
>
>   for (i = 0; i < 64; i++)
>     y [i]  = x [i];
> }
>
> I got
>
>         movdqa  x(%rip), %xmm0
>         movl    $16, %eax
>         pxor    %xmm2, %xmm2
>         pmovsxbw        %xmm0, %xmm1
>         movdqa  %xmm1, y(%rip)
>         movdqa  %xmm2, %xmm1
>         pcmpgtb %xmm0, %xmm1
>         punpckhbw       %xmm1, %xmm0
>         movdqa  %xmm0, y+16(%rip)
>
> When extention is a single instruction, it is better to extend one low
> element at a time:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667
>

So I haven't had a chance to look at your patch, but judging from the code
that was generated above I'm assuming that:
- the code above was generated using the vectorizer,
- the above is a body of a loop that iterates 4 times,
- the vectorizer used the vec_unpack_lo/hi idioms to vectorize this loop,
- with your patch, sse.md models vec_unpack_lo_* using the new pmovsx*
insns,
- sse.md still models vec_unpack_hi_* using the old mechanisms.

(correct so far?)

IIUC, icc used a vectorization factor 8, and in each vector iteration
expanded 8 chars to 8 shorts, using the new pmovzx* insn, and then
completely unrolled the loop. gcc OTOH used a vectorization factor 16, and
in each vector iteration expanded 16 chars to 16 shorts, using a
combination of the new and old sse insns for vector unpacking.

In order for us to vectorize like icc, i.e. - using VF=8, I think the
vectorizer needs to be extended to work with multiple vector sizes at once
(a vector of 8 bytes to hold 8 chars, and a vector of 16 bytes to hold 8
shorts). We currently assume a single VS (16 in this case) and determine
the VF according to the smallest elements in the loop (chars in this case).

Note however, that while the code generated by icc looks much more compact
indeed, it does generate twice as many loads as gcc (each of the 8 pmovzx*
directly operates on memory), and in addition every other load is
unaligned. gcc OTOH generated only loads, all aligned.

So, in short, the two important notes are that (1) the vectorizer could to
be extended to support multiple vector-sizes or anyhow the way the VF is
determined could be extended to consider other alternatives, and (2) from
the vectorizer's prespective, it's not obvious that working with VF=8 like
icc is necessarily better than working with a VF=16, and a cost model would
hopefully help gcc make the right choice...

A few minor comments:
* The icc code you show operates on unsigned types, and the gcc code you
show operates on signed types. I don't know if it makes much of a
difference, just for the record.
* Also I wonder how the gcc code looks like when complete unrolling is
applied (did you use -funoll-loops?). (like the point above, this is just
so that we aompre apples w apples).
* I don't entirely follow the code that gcc generates
what's that for exactly?:
      pxor    %xmm2, %xmm2
      movdqa  %xmm2, %xmm1
      pcmpgtb %xmm0, %xmm1
Is this part of the vec_unpack_hi, and if so - I wonder if there's a better
way to model the vec_unpack_hi using the new sse4 instructions?

dorit

>
> H.J.

Re: Does vectorizer support extension?

Reply via email to