Re: [RFC] Vectorization of indexed elements

Vidya Praveen Wed, 04 Dec 2013 08:11:03 -0800

Hi Richi,

Apologies for the late response. I was on vacation.


On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> On Fri, 11 Oct 2013, Vidya Praveen wrote:
> 
> > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > > > 
> > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > > > [...]
> > > > > > > > > > I can't really insist on the single lane load.. something 
> > > > > > > > > > like:
> > > > > > > > > > 
> > > > > > > > > > vc:V4SI[0] = c
> > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > > > > > > > 
> > > > > > > > > > Or is there any other way to do this?
> > > > > > > > > 
> > > > > > > > > Can you elaborate on "I can't really insist on the single 
> > > > > > > > > lane load"?
> > > > > > > > > What's the single lane load in your example? 
> > > > > > > > 
> > > > > > > > Loading just one lane of the vector like this:
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > > > 
> > > > > > > > or 
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c[2] 
> > > > > > > > 
> > > > > > > > is what I meant by single lane load. In this example:
> > > > > > > > 
> > > > > > > > t = c[2] 
> > > > > > > > ...
> > > > > > > > vb:v4si = b[0:3] 
> > > > > > > > vc:v4si = { t, t, t, t }
> > > > > > > > va:v4si = vb:v4si <op> vc:v4si 
> > > > > > > > 
> > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at 
> > > > > > > > vec_init, I cannot
> > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] 
> > > > > > > > (which could be 
> > > > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > > > 
> > > > > > > > > I'd expect the instruction
> > > > > > > > > pattern as quoted to just work (and I hope we expand an 
> > > > > > > > > uniform
> > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > > > 
> > > > > > > > As much as I went through the code, this is only done using 
> > > > > > > > vect_init. It is
> > > > > > > > not expanded as vec_duplicate from, for example, 
> > > > > > > > store_constructor() of expr.c
> > > > > > > 
> > > > > > > Do you see any issues if we expand such constructor as 
> > > > > > > vec_duplicate directly 
> > > > > > > instead of going through vect_init way? 
> > > > > > 
> > > > > > Sorry, that was a bad question.
> > > > > > 
> > > > > > But here's what I would like to propose as a first step. Please 
> > > > > > tell me if this
> > > > > > is acceptable or if it makes sense:
> > > > > > 
> > > > > > - Introduce standard pattern names 
> > > > > > 
> > > > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > > > 
> > > > > > Example:
> > > > > > 
> > > > > > (define_insn "vmuliv4si4"
> > > > > >    [set (match_operand:V4SI 0 "register_operand")
> > > > > >         (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > > > >                   (vec_duplicate:V4SI
> > > > > >                     (vec_select:SI
> > > > > >                       (match_operand:V4SI 2 "register_operand")
> > > > > >                       (match_operand:V4SI 3 "immediate_operand)))))]
> > > > > >  ...
> > > > > > )
> > > > > 
> > > > > We could factor this with providing a standard pattern name for
> > > > > 
> > > > > (define_insn "vdupi<mode>"
> > > > >   [set (match_operand:<mode> 0 "register_operand")
> > > > >        (vec_duplicate:<mode>
> > > > >           (vec_select:<scalarmode>
> > > > >              (match_operand:<mode> 1 "register_operand")
> > > > >              (match_operand:SI 2 "immediate_operand))))]
> > > > 
> > > > This is good. I did think about this but then I thought of avoiding the 
> > > > need
> > > > for combiner patterns :-) 
> > > > 
> > > > But do you find the lane specific mov pattern I proposed, acceptable? 
> > > 
> > > The specific mul pattern?  As said, consider factoring to vdupi to
> > > avoid an explosion in required special optabs.
> > > 
> > > > > (you use V4SI for the immediate?  
> > > > 
> > > > Sorry typo again!! It should've been SI.
> > > > 
> > > > > Ideally vdupi has another custom
> > > > > mode for the vector index).
> > > > > 
> > > > > Note that this factored pattern is already available as 
> > > > > vec_perm_const!
> > > > > It is simply (vec_perm_const:V4SI <source> <source> 
> > > > > <immediate-selector>).
> > > > > 
> > > > > Which means that on the GIMPLE level we should try to combine
> > > > > 
> > > > > el_4 = BIT_FIELD_REF <v_3, ...>;
> > > > > v_5 = { el_4, el_4, ... };
> > > > 
> > > > I don't think we reach this state at all for the scenarios in 
> > > > discussion.
> > > > what we generally have is:
> > > > 
> > > >  el_4 = MEM_REF < array + index*size >
> > > >  v_5 = { el_4, ... }
> > > > 
> > > > Or am I missing something?
> > > 
> > > Well, but in that case I doubt it is profitable (or even valid!) to
> > > turn this into a vector lane load from the array.  If it is profitable
> > > to perform a vector read (because we're going to use the other elements
> > > of the vector as well) then the vectorizer should produce a vector
> > > load and materialize the uniform vector from one of its elements.
> > > 
> > > Maybe at this point you should show us a compilable C testcase
> > > with a loop that should be vectorized using your instructions in
> > > the end?
> > 
> > Here's a compilable example:
> > 
> > void 
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int *__restrict__ c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> >     a[i] = b[i] * c[2];
> > }
> > 
> > This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> > of target instructions that can take a vector register as second argument 
> > but
> > use only one element (by using the same value for all the lanes) of the 
> > vector register.
> > 
> > Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index]
> >     mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add
> > 
> > But for a loop like the one in the C example given, I will have to load the
> > c[2] in one element of the vector register (leaving the remaining unused)
> > rather. This is why I was proposing to load just one element in a vector 
> > register (what I meant as "lane specific load"). The benefit of doing this 
> > is
> > that we avoid explicit duplication, however such a simplification can only
> > be done where such support is available - the reason why I was thinking in
> > terms of optional standard pattern name. Another benefit is we will also be
> > able to support scalars in the expression like in the following example:
> > 
> > void
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> >     a[i] = b[i] * c;
> > }
> 
> Both cases can be handled by patterns that match
> 
>   (mul:VXSI (reg:VXSI
>              (vec_duplicate:VXSI reg:SI)))

How do I arrive at this pattern in the first place? Assuming vec_init with
uniform values are expanded as vec_duplicate, it will still be two expressions.

That is,

(set reg:VXSI (vec_duplicate:VXSI (reg:SI)))
(set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI)))

> You'd then "consume" the vec_duplicate and implement it as
> load scalar into element zero of the vector and use index mult
> with index zero.

If I understand this correctly, you are suggesting to leave the scalar
load from memory as it is but treat the 

(mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI)))

as 

load reg:VXSI[0], reg:SI
mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register 
perhaps

either by generating instructions directly or by using define_split. Am I right?

If I'm right, then my concern is that it may be possible to simplify this 
further
by loading directly to a indexed vector register from memory but it's too late 
at
this point for such simplification to be possible.

Please let me know what am I not understanding.

> This handling can be either implemented at combine level (this
> is a two-instruction combine only!) or via TER.
>
> > Another example which can take advantage of the target feature:
> > 
> > void 
> > foo (int *__restrict__ a,
> >      int *__restrict__ b,
> >      int *__restrict__ c)
> > {
> >   int i,j;
> > 
> >   for (i = 0; i < 8; i++)
> >     for (j = 0; j < 8; j++)
> >       a[j] += b[j] * c[i];
> > }
> > 
> > This scenario we discussed this earlier (you suggested handling this at 
> > TER).
> 
> Yeah, though for optimal code you'd want to load c[0:8] once with a
> vector load.  Eventually the vectorizer outer loop vectorization support
> can be massaged to handle this optimally.

I agree. 

Thanks
VP.

Re: [RFC] Vectorization of indexed elements

Reply via email to